(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cython) by mykaul · Pull Request #732 · scylladb/python-driver

mykaul · 2026-03-07T10:01:50Z

Summary

Optimize foundational Cython byte-unpacking and add a dedicated VectorType Cython deserializer.

Commits (4, squashed from 7)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

Replace generic byte-swap loop in unpack_num() with ntohs()/ntohl() intrinsics for 16/32-bit types (compiles to single bswap on x86)
Replace varint_unpack() hex-string-based conversion with int.from_bytes(term, 'big', signed=True) — 7.7x speedup
Simplify read_int() to direct pointer cast + ntohl()
Remove slice_buffer(), replace all call sites with from_ptr_and_size()
Add Windows support: platform-conditional #ifdef _WIN32 for winsock2.h vs arpa/inet.h

2. Optimize float deserialization with ntohl() intrinsic

Add float-specific branch: reinterpret float bits as uint32_t, apply ntohl(), reinterpret back to float
Eliminates 4-iteration byte-swap loop for every float value
Refactor to use from_ptr_and_size() helper consistently
Add buffer bounds validation (CQL protocol NULL/not-set handling in subelem(), bounds checks in _unpack_len(), DesTupleType, DesCompositeType)

3. Optimize VectorType deserialization with Cython deserializer

New DesVectorType class with specialized deserialization methods:
- _deserialize_float(): C-level memcpy + ntohl + pointer-cast (no Python dispatch per element)
- _deserialize_double() / _deserialize_int64(): 8-byte manual byte-swap
- _deserialize_int32(): memcpy + ntohl + cast
- _deserialize_int16(): ntohs cast
- Numpy fast-path for vectors >= 32 elements
- Generic fallback for other fixed-size types with size validation
Automatically registered via find_deserializer() for the Cython row parser

4. Remove dead `values = []` in DesTupleType.deserialize

The values list was allocated but never used — results built directly into pre-allocated tuple via tuple_set()

Benchmark Results

All benchmarks: min(timeit.repeat(number=N, repeat=5)), per-call nanoseconds.
Machine: idle Linux workstation, Cython extensions compiled.

Primitives (via `CqlType.deserialize()`)

Benchmark	Master (ns)	PR #732 (ns)	Speedup
Int32Type	175	175	1.0x
ShortType	153	147	1.04x
FloatType	171	165	1.04x
DoubleType	174	171	1.02x
IntegerType (8-byte varint)	1489	193	7.7x
Tuple	2657	2432	1.09x

The ntohs/ntohl change replaces a byte-swap loop that was already fast for 2/4-byte types. The big win is varint_unpack() where int.from_bytes() replaces hex-string conversion.

VectorType — Python path (via `VectorType.deserialize()`)

Benchmark	Master (ns)	PR #732 (ns)	Speedup
Vector<float,4>	1587	1592	1.0x
Vector<float,128>	33546	30763	1.09x
Vector<float,1536>	516789	371960	1.39x

VectorType — Cython path (via `DesVectorType.deserialize_bytes()`)

The Cython DesVectorType is used by the Cython row parser (find_deserializer()), bypassing the Python VectorType.deserialize() entirely:

Benchmark	Python path, master (ns)	Cython DesVectorType (ns)	Speedup
Vector<float,4>	1587	140	11.3x
Vector<float,128>	33546	2105	15.9x
Vector<float,1536>	516789	24825	20.8x

Unit tests

640 passed, 49 skipped (baseline: 645 passed, 43 skipped on master).

….from_bytes Performance improvements to serialization/deserialization hot paths: 1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types instead of byte-by-byte swapping loop. These compile to single bswap instructions on x86, providing more predictable performance. 2. read_int(): Simplify to use ntohl() directly instead of going through unpack_num() with a temporary Buffer. 3. varint_unpack(): Replace hex string conversion with int.from_bytes(). This eliminates string allocations and provides 4-18x speedup for the function itself (larger gains for longer varints). 4. Remove slice_buffer() and replaced with direct assignment 5. _unpack_len() is now implemented similar to read_int() Also removes unused 'start' and 'end' variables from unpack_num(). End-to-end benchmark shows ~4-5% improvement in row throughput. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…helpers Add buffer bounds validation to Cython deserializers for safety against malformed buffers, refactor to use from_ptr_and_size() helper consistently, and add float ntohl() specialization for consistency with int32/int16 paths. Changes: - subelem(): Add CQL protocol-compliant value handling (NULL/-1, not-set/-2, invalid/<-2) with bounds checking - _unpack_len(): Add bounds check and use memcpy for alignment safety - DesTupleType: Add defensive bounds checking for tuple item lengths - DesCompositeType: Add bounds validation for composite element lengths - Refactor 4 locations to use from_ptr_and_size() instead of manual Buffer field assignment - Add float branch to unpack_num(): reinterpret bits as uint32, ntohl(), reinterpret back (consistent with int16/int32 intrinsic paths) - Add from_ptr_and_size() declaration to buffer.pxd Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…izer Addded DesVectorType Cython deserializer with C-level optimizations for improved performance in row parsing for vectors. The deserializer uses: - Direct C byte swapping (ntohl, ntohs) for numeric types - Memory operations without Python object overhead - Unified numpy path for large vectors (≥32 elements) - struct.unpack fallback for small vectors (<32 elements) Performance improvements: - Small vectors (3-4 elements): 4.4-4.7x faster - Medium vectors (128 elements): 1.0-1.5x faster - Large vectors (384-1536 elements): 0.9-1.0x (marginal) The Cython deserializer is automatically used by the row parser when available via find_deserializer(). Includes unit tests and benchmark code. Follow-up commits will try to get Numpy arrays, and perhaps more. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

The 'values' list was allocated but never used — the method builds results directly into a pre-allocated tuple via tuple_set(res, i, item). Removes one unnecessary list allocation per tuple deserialization.

mykaul marked this pull request as draft March 7, 2026 10:23

mykaul mentioned this pull request Mar 14, 2026

Tracking: Vector search (VectorType) performance improvement PRs #746

Open

mykaul mentioned this pull request Apr 2, 2026

[DO NOT MERGE] (Improvement) improve performance of Vector type parsing #689

Draft

8 tasks

mykaul force-pushed the cython-vector-deser branch from 8673d95 to 34dd41e Compare April 7, 2026 10:08

mykaul added 3 commits April 7, 2026 13:24

perf: Remove dead 'values = []' assignment in DesTupleType.deserialize

9b7697c

The 'values' list was allocated but never used — the method builds results directly into a pre-allocated tuple via tuple_set(res, i, item). Removes one unnecessary list allocation per tuple deserialization.

mykaul force-pushed the cython-vector-deser branch from 34dd41e to 9b7697c Compare April 7, 2026 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cython)#732

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cython)#732
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:cython-vector-deser

mykaul commented Mar 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits (4, squashed from 7)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

2. Optimize float deserialization with ntohl() intrinsic

3. Optimize VectorType deserialization with Cython deserializer

4. Remove dead values = [] in DesTupleType.deserialize

Benchmark Results

Primitives (via CqlType.deserialize())

VectorType — Python path (via VectorType.deserialize())

VectorType — Cython path (via DesVectorType.deserialize_bytes())

Unit tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Mar 7, 2026 •

edited

Loading

4. Remove dead `values = []` in DesTupleType.deserialize

Primitives (via `CqlType.deserialize()`)

VectorType — Python path (via `VectorType.deserialize()`)

VectorType — Cython path (via `DesVectorType.deserialize_bytes()`)