Skip to content

(improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only!)#730

Open
mykaul wants to merge 3 commits intoscylladb:masterfrom
mykaul:vector-struct-numpy-deser
Open

(improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only!)#730
mykaul wants to merge 3 commits intoscylladb:masterfrom
mykaul:vector-struct-numpy-deser

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 7, 2026

Summary

  • Replace element-by-element VectorType deserialization with bulk struct.unpack for known numeric types (float, double, int32, int64, short), caching a struct.Struct object at type-creation time
  • Add numpy fast-path (np.frombuffer().tolist()) for vectors with >= 32 elements
  • Cache serial_size() results to eliminate per-call method dispatch overhead
  • Fix exception handling in variable-size vector path: remove dead KeyError catch, wrap subtype.deserialize failures with element context and proper exception chaining

Performance (pure Python, best of 5)

Deserialization:

Vector Config Master PR #730 Speedup
Vector<float, 4> 1.12 us 0.22 us 5.1x
Vector<float, 16> 3.23 us 0.35 us 9.2x
Vector<float, 128> 23.46 us 1.91 us 12.3x
Vector<float, 768> 146.07 us 11.22 us 13.0x
Vector<float, 1536> 293.27 us 21.98 us 13.3x

Serialization:

Vector Config Master PR #730 Speedup
Vector<float, 4> 0.55 us 0.16 us 3.4x
Vector<float, 16> 1.67 us 0.24 us 7.0x
Vector<float, 128> 11.15 us 1.01 us 11.0x
Vector<float, 768> 62.53 us 5.12 us 12.2x
Vector<float, 1536> 123.69 us 10.82 us 11.4x

serial_size() overhead:

Master PR #730 Speedup
serial_size() call (768-dim) 104 ns 50 ns 2.1x

Details

Commit 1 -- struct.unpack optimization + variable-size path fixes:

  • At apply_parameters() time, cache a struct.Struct('>Nf') for the vector's subtype+dimension
  • deserialize() calls list(struct.unpack(byts)) -- single C-level bulk unpack
  • Also optimizes serialization via struct.pack(*v)
  • Fallback for non-numeric fixed-size types uses pre-allocated result list + cached method reference
  • Variable-size path: remove dead KeyError from except clause (uvint_unpack only raises IndexError), wrap subtype.deserialize failures in ValueError with element index and proper exception chaining (from e)

Commit 2 -- numpy for large vectors:

  • For vectors >= 32 elements with a known numeric dtype, use np.frombuffer(byts, dtype='>f4', count=N).tolist()
  • numpy avoids intermediate Python object creation during unpacking; .tolist() batch-converts with better cache locality
  • Threshold of 32 chosen empirically: below this, struct.unpack is faster due to lower fixed overhead
  • _numpy_dtype cached on the class at type-creation time (no per-call dict construction)

Commit 3 -- serial_size caching:

  • Cache subtype.serial_size() result as _subtype_serial_size and the full vector serial size as _serial_size during apply_parameters()
  • serial_size() returns cached value directly (no method dispatch chain)
  • serialize() and deserialize() use cls._subtype_serial_size instead of calling cls.subtype.serial_size() each time
  • Eliminates ~50ns overhead per serialize/deserialize call

All three commits modify only cassandra/cqltypes.py. No Cython dependency.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes VectorType (de)serialization in cassandra/cqltypes.py by introducing bulk numeric (de)serialization via a cached struct.Struct, and an optional numpy-based deserialization fast path for larger vectors.

Changes:

  • Cache a per-parameterized-vector struct.Struct to bulk unpack/pack common numeric vector subtypes.
  • Add an optional numpy frombuffer(...).tolist() deserialization fast-path for vectors with vector_size >= 32.
  • Refactor variable-size vector deserialization to a fixed-iteration loop with stricter bounds checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/cqltypes.py
Comment thread cassandra/cqltypes.py
Comment thread cassandra/cqltypes.py
mykaul added 2 commits April 2, 2026 13:51
…ct.unpack

Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.

Optimized types:
- FloatType  ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type  ('>Ni' format)
- LongType   ('>Nq' format)
- ShortType  ('>Nh' format)

Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):

Small vectors (3-4 elements):
  Vector<float, 3>  : 0.88 μs → 0.25 μs  (3.58x faster)
  Vector<float, 4>  : 0.78 μs → 0.28 μs  (2.79x faster)

Medium vectors (128 elements):
  Vector<float, 128>  : 4.72 μs → 4.06 μs  (1.16x faster)
  Vector<double, 128> : 4.83 μs → 4.01 μs  (1.20x faster)
  Vector<int, 128>    : 2.27 μs → 1.25 μs  (1.82x faster)

Large vectors (384-1536 elements):
  Vector<float, 384>  : 15.38 μs → 14.67 μs  (1.05x faster)
  Vector<float, 768>  : 32.43 μs → 30.72 μs  (1.06x faster)
  Vector<float, 1536> : 63.74 μs → 63.24 μs  (1.01x faster)

The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup

For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.

Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides
1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack.

The hybrid approach:
- Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline)
- Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack)

Threshold of 32 elements balances code complexity with performance gains.

Benchmark results:
- float[128]:  2.15 μs → 1.87 μs (1.15x faster)
- float[384]:  6.17 μs → 4.44 μs (1.39x faster)
- float[768]: 12.25 μs → 8.45 μs (1.45x faster)
- float[1536]: 24.44 μs → 15.77 μs (1.55x faster)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul mykaul force-pushed the vector-struct-numpy-deser branch from c417e73 to 0535ecd Compare April 2, 2026 10:52
@mykaul mykaul self-assigned this Apr 2, 2026
@mykaul mykaul marked this pull request as ready for review April 2, 2026 12:22
…ated method dispatch

Cache subtype.serial_size() and the full vector serial_size() as class
attributes (_subtype_serial_size, _serial_size) during apply_parameters().
This eliminates per-call method dispatch overhead in serialize(),
deserialize(), and serial_size() hot paths.

serial_size() call: 99ns -> 46ns (2.2x faster)
Attribute access: 54ns -> 17ns (3.2x faster)
@mykaul mykaul changed the title (improvement) Optimize VectorType deserialization with struct.unpack and numpy (improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only) Apr 7, 2026
@mykaul mykaul changed the title (improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only) (improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only!) Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants