Skip to content

perf: inline row decoding and eliminate closures in recv_results_rows (100's to 1000's of ns, x1.3-1.8 speedup, Python only)#765

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/inline-row-decode
Draft

perf: inline row decoding and eliminate closures in recv_results_rows (100's to 1000's of ns, x1.3-1.8 speedup, Python only)#765
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/inline-row-decode

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 25, 2026

Summary

  • Split recv_results_rows into fast path (no column encryption) and slow path (CE enabled)
  • Eliminate per-call closure allocation and merge two-pass row processing into single-pass decoding

Note: This optimization applies to the pure Python decode path only. When Cython extensions are compiled (the default for pip-installed packages), FastResultMessage from row_parser.pyx replaces recv_results_rows entirely. Users running without Cython (e.g., environments where C compilation is unavailable, or explicit use of _ProtocolHandler) will benefit from this change.

Details

Problem

The current recv_results_rows has three sources of overhead on every call:

  1. Two passes over row data: First recv_row reads all raw bytes into a list[list[bytes]], then decode_row iterates again to deserialize — doubling iteration and creating intermediate lists that are immediately discarded.

  2. Per-call closures: decode_val and decode_row are defined as closures inside recv_results_rows, meaning Python allocates new function objects on every result set.

  3. Unconditional ColDesc creation: ColDesc namedtuples are built for every column even when column encryption is not configured (the vast majority of deployments).

Solution

Fast path (no column encryption — the common case):

  • _decode_row_inline(f, colcount, col_types, protocol_version) reads each column's size, reads the bytes, and immediately calls from_binary() — one pass, no intermediate list
  • ColDesc creation is skipped entirely
  • No closures allocated

Slow path (column encryption enabled):

  • Preserves the existing two-pass logic (needed because CE must decrypt before type decoding)
  • decode_val/decode_row moved to module-level functions (_decode_val_ce, _decode_row_ce) to avoid per-call closure overhead

Benchmark results

Measured on CPython 3.14.3, Protocol V4, 300 iterations, 100 warmup. All values in nanoseconds per row.

Scenario Master (min ns/row) PR (min ns/row) Master (median ns/row) PR (median ns/row) Speedup (min) Speedup (median)
5 int cols, 10 rows 2677 1911 3192 2558 1.40x 1.25x
5 int cols, 100 rows 2155 1489 2877 1908 1.45x 1.51x
5 int cols, 1000 rows 2675 1848 3165 2260 1.45x 1.40x
5 mixed cols, 100 rows 2625 2024 3225 2203 1.30x 1.46x
5 mixed cols, 1000 rows 2942 1926 3880 2118 1.53x 1.83x
10 int cols, 100 rows, 50% NULL 4666 3095 5284 3314 1.51x 1.59x
10 int cols, 1000 rows, 50% NULL 4812 2737 6156 3166 1.76x 1.94x
10 int cols, 100 rows, no NULL 5082 3826 5339 4201 1.33x 1.27x
10 int cols, 1000 rows, no NULL 5116 3647 6184 4589 1.40x 1.35x

1.3x–1.8x speedup on the pure Python path. The speedup is higher with NULL-heavy workloads because the inline path short-circuits from_binary() for negative-length (NULL) columns.

Merge conflict note

⚠️ This PR modifies the same recv_results_rows method as PR #630, which also splits the method into CE/non-CE branches. If both PRs are accepted, there will be a merge conflict requiring manual resolution.

Testing

  • All 651 existing unit tests pass (16 pre-existing skips)
  • Added test for decode error wrapping in the inline path (test_protocol.py)

@mykaul mykaul marked this pull request as draft March 25, 2026 20:33
@mykaul mykaul force-pushed the perf/inline-row-decode branch from 3c3fea8 to 020c764 Compare April 7, 2026 10:57
@mykaul mykaul changed the title perf: inline row decoding and eliminate closures in recv_results_rows perf: inline row decoding and eliminate closures in recv_results_rows (100's to 1000's of ns, x1.3-1.8 speedup) Apr 7, 2026
@mykaul mykaul changed the title perf: inline row decoding and eliminate closures in recv_results_rows (100's to 1000's of ns, x1.3-1.8 speedup) perf: inline row decoding and eliminate closures in recv_results_rows (100's to 1000's of ns, x1.3-1.8 speedup, Python only) Apr 7, 2026
Split recv_results_rows into fast path (no column encryption) and slow
path (column encryption enabled):

Fast path (common case):
  - Reads raw column bytes and decodes types in a single pass per row
    via _decode_row_inline(), eliminating the intermediate list-of-lists
  - Skips ColDesc namedtuple creation entirely (only needed for CE)
  - No closure allocation per call
  - Wraps decode errors with column name/type info for diagnostics

Slow path (column encryption):
  - Preserves full CE logic with ColDesc creation
  - Moves decode_val/decode_row closures to module-level functions
    (_decode_val_ce, _decode_row_ce) to avoid per-call closure overhead

Note: This PR modifies the same method as PR scylladb#630 (which also splits
recv_results_rows into CE/non-CE branches). There will be a merge
conflict that needs manual resolution if both PRs are accepted.
@mykaul mykaul force-pushed the perf/inline-row-decode branch from 020c764 to 30d3a44 Compare April 9, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant