Skip to content

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cython)#732

Draft
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:cython-vector-deser

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 7, 2026

Summary

Optimize foundational Cython byte-unpacking and add a dedicated VectorType Cython deserializer.

Commits (4, squashed from 7)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

  • Replace generic byte-swap loop in unpack_num() with ntohs()/ntohl() intrinsics for 16/32-bit types (compiles to single bswap on x86)
  • Replace varint_unpack() hex-string-based conversion with int.from_bytes(term, 'big', signed=True)7.7x speedup
  • Simplify read_int() to direct pointer cast + ntohl()
  • Remove slice_buffer(), replace all call sites with from_ptr_and_size()
  • Add Windows support: platform-conditional #ifdef _WIN32 for winsock2.h vs arpa/inet.h

2. Optimize float deserialization with ntohl() intrinsic

  • Add float-specific branch: reinterpret float bits as uint32_t, apply ntohl(), reinterpret back to float
  • Eliminates 4-iteration byte-swap loop for every float value
  • Refactor to use from_ptr_and_size() helper consistently
  • Add buffer bounds validation (CQL protocol NULL/not-set handling in subelem(), bounds checks in _unpack_len(), DesTupleType, DesCompositeType)

3. Optimize VectorType deserialization with Cython deserializer

  • New DesVectorType class with specialized deserialization methods:
    • _deserialize_float(): C-level memcpy + ntohl + pointer-cast (no Python dispatch per element)
    • _deserialize_double() / _deserialize_int64(): 8-byte manual byte-swap
    • _deserialize_int32(): memcpy + ntohl + cast
    • _deserialize_int16(): ntohs cast
    • Numpy fast-path for vectors >= 32 elements
    • Generic fallback for other fixed-size types with size validation
  • Automatically registered via find_deserializer() for the Cython row parser

4. Remove dead values = [] in DesTupleType.deserialize

  • The values list was allocated but never used — results built directly into pre-allocated tuple via tuple_set()

Benchmark Results

All benchmarks: min(timeit.repeat(number=N, repeat=5)), per-call nanoseconds.
Machine: idle Linux workstation, Cython extensions compiled.

Primitives (via CqlType.deserialize())

Benchmark Master (ns) PR #732 (ns) Speedup
Int32Type 175 175 1.0x
ShortType 153 147 1.04x
FloatType 171 165 1.04x
DoubleType 174 171 1.02x
IntegerType (8-byte varint) 1489 193 7.7x
Tuple 2657 2432 1.09x

The ntohs/ntohl change replaces a byte-swap loop that was already fast for 2/4-byte types. The big win is varint_unpack() where int.from_bytes() replaces hex-string conversion.

VectorType — Python path (via VectorType.deserialize())

Benchmark Master (ns) PR #732 (ns) Speedup
Vector<float,4> 1587 1592 1.0x
Vector<float,128> 33546 30763 1.09x
Vector<float,1536> 516789 371960 1.39x

VectorType — Cython path (via DesVectorType.deserialize_bytes())

The Cython DesVectorType is used by the Cython row parser (find_deserializer()), bypassing the Python VectorType.deserialize() entirely:

Benchmark Python path, master (ns) Cython DesVectorType (ns) Speedup
Vector<float,4> 1587 140 11.3x
Vector<float,128> 33546 2105 15.9x
Vector<float,1536> 516789 24825 20.8x

Unit tests

640 passed, 49 skipped (baseline: 645 passed, 43 skipped on master).

….from_bytes

Performance improvements to serialization/deserialization hot paths:

1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types
   instead of byte-by-byte swapping loop. These compile to single bswap
   instructions on x86, providing more predictable performance.

2. read_int(): Simplify to use ntohl() directly instead of going through
   unpack_num() with a temporary Buffer.

3. varint_unpack(): Replace hex string conversion with int.from_bytes().
   This eliminates string allocations and provides 4-18x speedup for the
   function itself (larger gains for longer varints).

4. Remove slice_buffer() and replaced with direct assignment

5. _unpack_len() is now implemented similar to read_int()

Also removes unused 'start' and 'end' variables from unpack_num().

End-to-end benchmark shows ~4-5% improvement in row throughput.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul mykaul force-pushed the cython-vector-deser branch from 8673d95 to 34dd41e Compare April 7, 2026 10:08
mykaul added 3 commits April 7, 2026 13:24
…helpers

Add buffer bounds validation to Cython deserializers for safety against
malformed buffers, refactor to use from_ptr_and_size() helper consistently,
and add float ntohl() specialization for consistency with int32/int16 paths.

Changes:
- subelem(): Add CQL protocol-compliant value handling (NULL/-1,
  not-set/-2, invalid/<-2) with bounds checking
- _unpack_len(): Add bounds check and use memcpy for alignment safety
- DesTupleType: Add defensive bounds checking for tuple item lengths
- DesCompositeType: Add bounds validation for composite element lengths
- Refactor 4 locations to use from_ptr_and_size() instead of manual
  Buffer field assignment
- Add float branch to unpack_num(): reinterpret bits as uint32,
  ntohl(), reinterpret back (consistent with int16/int32 intrinsic paths)
- Add from_ptr_and_size() declaration to buffer.pxd

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…izer

Addded DesVectorType Cython deserializer with C-level optimizations for
improved performance in row parsing for vectors.
The deserializer uses:
- Direct C byte swapping (ntohl, ntohs) for numeric types
- Memory operations without Python object overhead
- Unified numpy path for large vectors (≥32 elements)
- struct.unpack fallback for small vectors (<32 elements)

Performance improvements:
- Small vectors (3-4 elements): 4.4-4.7x faster
- Medium vectors (128 elements): 1.0-1.5x faster
- Large vectors (384-1536 elements): 0.9-1.0x (marginal)

The Cython deserializer is automatically used by the row parser when
available via find_deserializer().

Includes unit tests and benchmark code.

Follow-up commits will try to get Numpy arrays, and perhaps more.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The 'values' list was allocated but never used — the method builds
results directly into a pre-allocated tuple via tuple_set(res, i, item).
Removes one unnecessary list allocation per tuple deserialization.
@mykaul mykaul force-pushed the cython-vector-deser branch from 34dd41e to 9b7697c Compare April 7, 2026 10:24
@mykaul mykaul changed the title (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cyhon) Apr 7, 2026
@mykaul mykaul changed the title (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cyhon) (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer (substantial - 11x-30x speedup mainly via DesVectorType.deserialize_bytes with Cython) Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant