Scan character runs in bulk while parsing by tfoutrein · Pull Request #490 · python-poetry/tomlkit

tfoutrein · 2026-06-05T14:37:40Z

Stacked on #489 — the first commit here is #489 (index-based Source). Best reviewed/merged after #489; GitHub will collapse this to just the bulk-scan commit once #489 lands. Happy to rebase on request.

What

The parser advances one character at a time through runs of whitespace, bare-key and number characters, paying a Source.inc() call (attribute lookups + a TOMLChar build + bounds check) for every character of the run.

This adds Source.advance_while(charset) / advance_until(stopset), which scan the underlying string in a single pass and update the index + current character only once, and uses them for the leading-whitespace, bare-key and number/date runs. They keep the exact value contract of the while ... and self.inc() loops they replace (on return current is the first character outside / inside the set, or EOF).

Benchmarks

Parsing speedup across document sizes/shapes (median, interleaved A/B vs master, includes #489):

document	speedup
typical mixed (~4 KB)	1.32×
poetry.lock-like (~64 KB)	1.26×
pyproject.toml	1.17×
array-heavy	1.16×
large flat (~90 KB)	1.15×

No regression on any shape.

Tests

Full suite passes (972 tests, incl. the toml-test conformance submodule). No public API or behaviour change — round-trip output is byte-identical to master on a varied corpus.

`Source.__init__` built `iter([(i, TOMLChar(c)) for i, c in enumerate(self)])`, allocating one tuple and one TOMLChar per character of the whole input up front. Track an integer index into the underlying string instead: `inc()` bumps the index and reads `self[idx]`, and state save/restore snapshots the index rather than copying an iterator. Construction is O(1) and per-character work is deferred to the read. No behaviour change (full suite incl. the toml-test conformance submodule passes); ~1.07-1.14x faster parsing across document sizes.

The parser advanced one character at a time through runs of whitespace, bare-key and number characters, paying a `Source.inc()` call (attribute lookups + a `TOMLChar` build + bounds check) for every character. Add `Source.advance_while(charset)` / `advance_until(stopset)`, which scan the underlying string in a single pass and update the index and current character only once, and use them for the leading-whitespace, bare-key and number/date runs. Same value contract as the `while ... and self.inc()` loops they replace. No behaviour change (full suite incl. the toml-test conformance submodule passes; round-trip output byte-identical on a varied corpus). ~1.05-1.32x faster parsing depending on shape (e.g. ~1.26x on a poetry.lock-like file).

frostming · 2026-06-08T01:39:27Z

LGTM, please resolve the conflicts.

BTW: we are merging by "squash and merge" so the stacked PR is inconvenient.

tfoutrein added 2 commits June 5, 2026 16:19

tfoutrein force-pushed the perf/bulk-scan branch from 21d9186 to f64d9c6 Compare June 5, 2026 14:39

This was referenced Jun 5, 2026

Bulk-scan single-line string bodies #491

Open

Speed up parsing by interning TOMLChar instances #488

Open

Remove the internal TOMLChar wrapper #492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scan character runs in bulk while parsing#490

Scan character runs in bulk while parsing#490
tfoutrein wants to merge 2 commits into
python-poetry:masterfrom
AstekGroup:perf/bulk-scan

tfoutrein commented Jun 5, 2026

Uh oh!

frostming commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tfoutrein commented Jun 5, 2026

What

Benchmarks

Tests

Uh oh!

frostming commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants