Skip to content

Scan character runs in bulk while parsing#490

Open
tfoutrein wants to merge 2 commits into
python-poetry:masterfrom
AstekGroup:perf/bulk-scan
Open

Scan character runs in bulk while parsing#490
tfoutrein wants to merge 2 commits into
python-poetry:masterfrom
AstekGroup:perf/bulk-scan

Conversation

@tfoutrein
Copy link
Copy Markdown
Contributor

Stacked on #489 — the first commit here is #489 (index-based Source). Best reviewed/merged after #489; GitHub will collapse this to just the bulk-scan commit once #489 lands. Happy to rebase on request.

What

The parser advances one character at a time through runs of whitespace, bare-key and number characters, paying a Source.inc() call (attribute lookups + a TOMLChar build + bounds check) for every character of the run.

This adds Source.advance_while(charset) / advance_until(stopset), which scan the underlying string in a single pass and update the index + current character only once, and uses them for the leading-whitespace, bare-key and number/date runs. They keep the exact value contract of the while ... and self.inc() loops they replace (on return current is the first character outside / inside the set, or EOF).

Benchmarks

Parsing speedup across document sizes/shapes (median, interleaved A/B vs master, includes #489):

document speedup
typical mixed (~4 KB) 1.32×
poetry.lock-like (~64 KB) 1.26×
pyproject.toml 1.17×
array-heavy 1.16×
large flat (~90 KB) 1.15×

No regression on any shape.

Tests

Full suite passes (972 tests, incl. the toml-test conformance submodule). No public API or behaviour change — round-trip output is byte-identical to master on a varied corpus.

tfoutrein added 2 commits June 5, 2026 16:19
`Source.__init__` built `iter([(i, TOMLChar(c)) for i, c in enumerate(self)])`,
allocating one tuple and one TOMLChar per character of the whole input up
front. Track an integer index into the underlying string instead: `inc()`
bumps the index and reads `self[idx]`, and state save/restore snapshots the
index rather than copying an iterator. Construction is O(1) and per-character
work is deferred to the read.

No behaviour change (full suite incl. the toml-test conformance submodule
passes); ~1.07-1.14x faster parsing across document sizes.
The parser advanced one character at a time through runs of whitespace,
bare-key and number characters, paying a `Source.inc()` call (attribute
lookups + a `TOMLChar` build + bounds check) for every character.

Add `Source.advance_while(charset)` / `advance_until(stopset)`, which scan
the underlying string in a single pass and update the index and current
character only once, and use them for the leading-whitespace, bare-key and
number/date runs. Same value contract as the `while ... and self.inc()`
loops they replace.

No behaviour change (full suite incl. the toml-test conformance submodule
passes; round-trip output byte-identical on a varied corpus). ~1.05-1.32x
faster parsing depending on shape (e.g. ~1.26x on a poetry.lock-like file).
@frostming
Copy link
Copy Markdown
Contributor

LGTM, please resolve the conflicts.

BTW: we are merging by "squash and merge" so the stacked PR is inconvenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants