Skip to content

Change parser state only after .await points to prevent state corruption#935

Draft
Mingun wants to merge 1 commit intotafia:masterfrom
Mingun:fix-async-cancel-safety
Draft

Change parser state only after .await points to prevent state corruption#935
Mingun wants to merge 1 commit intotafia:masterfrom
Mingun:fix-async-cancel-safety

Conversation

@Mingun
Copy link
Collaborator

@Mingun Mingun commented Feb 12, 2026

See explanation in #624 (comment)

Unfortunately, I cannot create stable test for this, so this bugfix without tests.

…t state corruption after canceling of async reading
@Mingun Mingun added the bug label Feb 12, 2026
@Mingun
Copy link
Collaborator Author

Mingun commented Feb 12, 2026

Failing Fuzzing pipeline is due to google/oss-fuzz#14945

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.98%. Comparing base (d1acdb5) to head (a765dca).
⚠️ Report is 33 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #935      +/-   ##
==========================================
- Coverage   58.08%   54.98%   -3.11%     
==========================================
  Files          42       44       +2     
  Lines       15513    16776    +1263     
==========================================
+ Hits         9011     9224     +213     
- Misses       6502     7552    +1050     
Flag Coverage Δ
unittests 54.98% <ø> (-3.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Mingun
Copy link
Collaborator Author

Mingun commented Feb 13, 2026

After reflection, I think that still this PR does not fix the problem. It is that we do not save the state of the parser between .await points. Thus, if an interruption occurred at this point, we could read some piece of data from AsyncRead to the buffer, but we do not reflect that in our ParseState -- each variant hold the information only about that we in start of the next event, but not that we inside it.

I was already working on rewriting the parser in the style of "reading to the buffer separately, then feed the read piece to the parser" and suddenly this work became relevant to fix this problem. Parser will maintain state after consuming piece of data, so we will never lost it.

So I mark this a draft for now.

@Mingun Mingun marked this pull request as draft February 13, 2026 06:30
@adamreichold
Copy link
Contributor

I was already working on rewriting the parser in the style of "reading to the buffer separately, then feed the read piece to the parser" and suddenly this work became relevant to fix this problem.

Sorry for the random tangent: Did you consider using async instead of manual state machines to write your parsers? I think this could very much simplify them, basing them off an async trait like AsyncBufRead and keeping the code in a much more readable recursive descent parser style with the compiler building those state machines for you. (Sync usage would then just need a little shim executor to reimplement the current "feed" style interface.)

@Mingun
Copy link
Collaborator Author

Mingun commented Feb 13, 2026

I don't quite understand the question. async is needed to wait for data from somewhere, when parsing we do not need to await anything, except the data, but getting data is independent of its parsing. For example, when we parse from slice we already have all needed data in the memory.

@adamreichold
Copy link
Contributor

async can also be used to automatically generate the state machines currently open-coded via Parser::feed so that instead of manually managing a state enum, you just write imperative style code with the state enum a.k.a future being built by the Rust compiler so that you can keep writing sans I/O parsers but write them in recursive descent style. In that approach, the parser would use async in its implementation (but not its API) even when the source is not async at all.

@Mingun
Copy link
Collaborator Author

Mingun commented Feb 13, 2026

It is still unclear for me how the async may help with managing parse states. Could you provide, for example, how parsing of <tag attribute1=">" attribute2='<'> could look? We need 3 states here:

  • Outside -- outside of any quoted region
  • DoubleQ -- inside double-quoted region
  • SingleQ -- inside single-quoted region

@adamreichold
Copy link
Contributor

It is still unclear for me how the async may help with managing parse states. Could you provide, for example, how parsing of <tag attribute1=">" attribute2='<'> could look? We need 3 states here:

* Outside -- outside of any quoted region

* DoubleQ -- inside double-quoted region

* SingleQ -- inside single-quoted region

Instead of a signature like

fn feed(&mut self, bytes: &[u8]) -> Option<usize>;

which tracks state via self, you'd track state via control flow, e.g.

async fn parse_quoted(read: &mut AsyncBufRead) -> Result<Vec<u8>> {
    let quote = expect_oneof(read, [b'\'', b'"']).await?;

    let mut buf = Vec::new();
    read.read_until(quote, &mut buf).await?;

    Ok(buf)
}

where the compiler will effectively create that state enum for you when it prepares the future to suspend after each suspension/await point.

This is just how I/O-based recursive descent parsers have been written forever and their main upside is that the code flow represents what symbols are expected where and when which is usually simpler for humans to understand than an explicit state machine.

The main trick here would to wrap this using a custom implementation of AsyncBufRead and trivial executor, so that one would could present this as a sans I/O interface on the outside, e.g.

fn feed(&mut self, bytes: &[u8]) -> Result<usize> {
   self.bytes = bytes;

   let mut cx = std::task::Context::from_waker(std::task::Waker::noop());

   match self.parser.as_mut().poll(&cx) {
     Poll::Pending => { Ok(self.consumed) }
     Poll::Ready(Ok(())) => { self.finished = true; Ok(self.consumed) }
     Poll::Ready(Err(err))) => { Err(err) }
   }
}

with some suitably constructed future stored in self.parser which was passed an AsyncBufRead implementation that is essentially backed by self.bytes and self.consumed. (Yes, there is a bit of hand-waving to safely making the sharing happen behind the scenes here.)

This would turn this around making the parser async-first and then derive the sync version from it, but just as a means to get state-management-via-control-flow known from classic recursive descent parsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants