(improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only!)#730
Open
mykaul wants to merge 3 commits intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes VectorType (de)serialization in cassandra/cqltypes.py by introducing bulk numeric (de)serialization via a cached struct.Struct, and an optional numpy-based deserialization fast path for larger vectors.
Changes:
- Cache a per-parameterized-vector
struct.Structto bulkunpack/packcommon numeric vector subtypes. - Add an optional numpy
frombuffer(...).tolist()deserialization fast-path for vectors withvector_size >= 32. - Refactor variable-size vector deserialization to a fixed-iteration loop with stricter bounds checks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8 tasks
…ct.unpack
Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.
Optimized types:
- FloatType ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type ('>Ni' format)
- LongType ('>Nq' format)
- ShortType ('>Nh' format)
Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):
Small vectors (3-4 elements):
Vector<float, 3> : 0.88 μs → 0.25 μs (3.58x faster)
Vector<float, 4> : 0.78 μs → 0.28 μs (2.79x faster)
Medium vectors (128 elements):
Vector<float, 128> : 4.72 μs → 4.06 μs (1.16x faster)
Vector<double, 128> : 4.83 μs → 4.01 μs (1.20x faster)
Vector<int, 128> : 2.27 μs → 1.25 μs (1.82x faster)
Large vectors (384-1536 elements):
Vector<float, 384> : 15.38 μs → 14.67 μs (1.05x faster)
Vector<float, 768> : 32.43 μs → 30.72 μs (1.06x faster)
Vector<float, 1536> : 63.74 μs → 63.24 μs (1.01x faster)
The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup
For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.
Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides 1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack. The hybrid approach: - Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline) - Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack) Threshold of 32 elements balances code complexity with performance gains. Benchmark results: - float[128]: 2.15 μs → 1.87 μs (1.15x faster) - float[384]: 6.17 μs → 4.44 μs (1.39x faster) - float[768]: 12.25 μs → 8.45 μs (1.45x faster) - float[1536]: 24.44 μs → 15.77 μs (1.55x faster) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
c417e73 to
0535ecd
Compare
…ated method dispatch Cache subtype.serial_size() and the full vector serial_size() as class attributes (_subtype_serial_size, _serial_size) during apply_parameters(). This eliminates per-call method dispatch overhead in serialize(), deserialize(), and serial_size() hot paths. serial_size() call: 99ns -> 46ns (2.2x faster) Attribute access: 54ns -> 17ns (3.2x faster)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
struct.unpackfor known numeric types (float, double, int32, int64, short), caching astruct.Structobject at type-creation timenp.frombuffer().tolist()) for vectors with >= 32 elementsserial_size()results to eliminate per-call method dispatch overheadKeyErrorcatch, wrapsubtype.deserializefailures with element context and proper exception chainingPerformance (pure Python, best of 5)
Deserialization:
Vector<float, 4>Vector<float, 16>Vector<float, 128>Vector<float, 768>Vector<float, 1536>Serialization:
Vector<float, 4>Vector<float, 16>Vector<float, 128>Vector<float, 768>Vector<float, 1536>serial_size() overhead:
serial_size()call (768-dim)Details
Commit 1 -- struct.unpack optimization + variable-size path fixes:
apply_parameters()time, cache astruct.Struct('>Nf')for the vector's subtype+dimensiondeserialize()callslist(struct.unpack(byts))-- single C-level bulk unpackstruct.pack(*v)KeyErrorfrom except clause (uvint_unpackonly raisesIndexError), wrapsubtype.deserializefailures inValueErrorwith element index and proper exception chaining (from e)Commit 2 -- numpy for large vectors:
np.frombuffer(byts, dtype='>f4', count=N).tolist().tolist()batch-converts with better cache locality_numpy_dtypecached on the class at type-creation time (no per-call dict construction)Commit 3 -- serial_size caching:
subtype.serial_size()result as_subtype_serial_sizeand the full vector serial size as_serial_sizeduringapply_parameters()serial_size()returns cached value directly (no method dispatch chain)serialize()anddeserialize()usecls._subtype_serial_sizeinstead of callingcls.subtype.serial_size()each timeAll three commits modify only
cassandra/cqltypes.py. No Cython dependency.