(Improvement) reduced memory for metadata schema parsing (main improvement - Select only needed columns from system_schema.column )#745
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce CPU time and memory usage during schema metadata refresh by avoiding per-row dict allocations and trimming schema column queries, while also shrinking metadata object overhead.
Changes:
- Introduces an internal
_RowView+_row_factoryand routes schema query result handling through it to reduce per-row allocations. - Adds
__slots__to several metadata model classes and replaces someOrderedDictusages with plaindictto reduce memory overhead. - Narrows the
system_schema.columnsquery inSchemaParserV3to only the fields needed by the parser and refactors_build_table_columnsto classify rows in a single pass.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3a7f4bc to
4f9e4f5
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes schema metadata refresh by reducing per-row allocations during schema parsing and narrowing the system_schema.columns select list to only the fields needed for building table/column metadata.
Changes:
- Introduces an internal lightweight row representation (
_RowView+_row_factory) and uses it in schema parser result handling to reduce time/memory overhead. - Reduces the
system_schema.columnsquery to fetch only required columns and refactors_build_table_columnsto classify rows in a single pass. - Adds unit tests for
_RowViewand_row_factorybehavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| cassandra/metadata.py | Adds _RowView/_row_factory, switches schema parser row handling away from per-row dicts, tightens system_schema.columns query, and updates metadata classes/docstrings/slots. |
| tests/unit/test_metadata.py | Adds unit tests validating _RowView and _row_factory semantics (getitem/get/contains/read-only/shared index map). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4f9e4f5 to
069ce0f
Compare
Introduce _RowView, a __slots__-based read-only row wrapper that stores data as tuples with a shared column-name-to-index map, and _row_factory that creates these views. _RowView inherits from collections.abc.Mapping, providing a complete dict-like read interface. This eliminates per-row dict allocation during schema parsing. All rows from the same result set share a single index map object.
Python 3.7+ guarantees dict preserves insertion order, making OrderedDict
unnecessary. Replace OrderedDict() with {} in TableMetadata.columns,
TableMetadata.triggers, and MaterializedViewMetadata.columns. Remove the
now-unused OrderedDict import.
….columns Replace SELECT * with an explicit column list for the system_schema.columns query in SchemaParserV3 (inherited by V4). Only the 7 columns actually consumed by the parser are fetched: keyspace_name, table_name, column_name, clustering_order, kind, position, type. This reduces network transfer and deserialization overhead during schema refresh.
Replace dict_factory in _SchemaParser._handle_results and get_column_from_system_local with _row_factory, eliminating per-row dict allocation during schema parsing. Also refactor SchemaParserV4._build_keyspace_metadata_internal to read from the row without mutating it, since _RowView is read-only. Note: V22-only dict_factory call sites are left unchanged as they do not affect the V3/V4 code path (V3 and V4 fully override _query_all).
Replace three list comprehension passes over col_rows with a single classification loop that sorts columns into partition, clustering, and other buckets. Also use in-place sort() instead of sorted() and reuse the already-built column_meta instead of a redundant dict lookup.
Cover __getitem__, get(), __contains__, __repr__, shared index map, read-only enforcement, empty input, single-column, and multi-row scenarios.
069ce0f to
eda1086
Compare
Reduce the time and memory required to parse schema metadata when refreshing. The key changes are:
system_schema.columns— the biggest network/memory win for large schemasdict_factorywith_RowView— lightweight tuple-backed Mapping that shares a single index map across all rows (5.7x memory reduction per row)__slots__on all metadata classes — no__dict__overhead, saving ~264 bytes perColumnMetadatainstance (~516 KB for 2000-column schemas)_build_table_columns— avoids repeated list-filtering passesOrderedDictwithdict— Python 3.7+ guarantees insertion orderPerformance results (Python 3.14, rebased on master)
_RowViewnow implements the fullcollections.abc.Mappingprotocol (keys(),values(),items(),len(), iteration).Pre-review checklist
./docs/source/.Fixes:annotations to PR description.