Skip to content

(Improvement) reduced memory for metadata schema parsing (main improvement - Select only needed columns from system_schema.column )#745

Draft
mykaul wants to merge 6 commits into
scylladb:masterfrom
mykaul:improvement/metadata-schema-parsing
Draft

(Improvement) reduced memory for metadata schema parsing (main improvement - Select only needed columns from system_schema.column )#745
mykaul wants to merge 6 commits into
scylladb:masterfrom
mykaul:improvement/metadata-schema-parsing

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 14, 2026

Reduce the time and memory required to parse schema metadata when refreshing. The key changes are:

  1. Select only needed columns from system_schema.columns — the biggest network/memory win for large schemas
  2. Replace dict_factory with _RowView — lightweight tuple-backed Mapping that shares a single index map across all rows (5.7x memory reduction per row)
  3. Pure __slots__ on all metadata classes — no __dict__ overhead, saving ~264 bytes per ColumnMetadata instance (~516 KB for 2000-column schemas)
  4. Single-pass column classification in _build_table_columns — avoids repeated list-filtering passes
  5. Replace OrderedDict with dict — Python 3.7+ guarantees insertion order

Performance results (Python 3.14, rebased on master)

Metric _RowView (new) dict_factory (old) Improvement
Row creation + access 651 ns/row 886 ns/row 1.36x faster
Full pipeline (100 tables × 20 cols) 3.37 ms 4.26 ms 1.26x faster
Memory per row 48 bytes 272 bytes 5.7x reduction
Bulk memory (2000 rows) 113 KB 561 KB 80% reduction
ColumnMetadata instance 80 bytes 344 bytes 264 bytes saved

_RowView now implements the full collections.abc.Mapping protocol (keys(), values(), items(), len(), iteration).

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@mykaul mykaul marked this pull request as draft March 14, 2026 09:10
@mykaul mykaul changed the title (Improvement) metadata schema parsing (Improvement) faster metadata schema parsing Mar 14, 2026
@mykaul mykaul requested a review from Copilot March 14, 2026 09:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce CPU time and memory usage during schema metadata refresh by avoiding per-row dict allocations and trimming schema column queries, while also shrinking metadata object overhead.

Changes:

  • Introduces an internal _RowView + _row_factory and routes schema query result handling through it to reduce per-row allocations.
  • Adds __slots__ to several metadata model classes and replaces some OrderedDict usages with plain dict to reduce memory overhead.
  • Narrows the system_schema.columns query in SchemaParserV3 to only the fields needed by the parser and refactors _build_table_columns to classify rows in a single pass.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/metadata.py Outdated
Comment thread cassandra/metadata.py
Comment thread cassandra/metadata.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes schema metadata refresh by reducing per-row allocations during schema parsing and narrowing the system_schema.columns select list to only the fields needed for building table/column metadata.

Changes:

  • Introduces an internal lightweight row representation (_RowView + _row_factory) and uses it in schema parser result handling to reduce time/memory overhead.
  • Reduces the system_schema.columns query to fetch only required columns and refactors _build_table_columns to classify rows in a single pass.
  • Adds unit tests for _RowView and _row_factory behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
cassandra/metadata.py Adds _RowView/_row_factory, switches schema parser row handling away from per-row dicts, tightens system_schema.columns query, and updates metadata classes/docstrings/slots.
tests/unit/test_metadata.py Adds unit tests validating _RowView and _row_factory semantics (getitem/get/contains/read-only/shared index map).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mykaul mykaul self-assigned this Mar 16, 2026
@mykaul mykaul force-pushed the improvement/metadata-schema-parsing branch from 4f9e4f5 to 069ce0f Compare April 2, 2026 20:47
mykaul added 6 commits April 3, 2026 18:20
Introduce _RowView, a __slots__-based read-only row wrapper that stores
data as tuples with a shared column-name-to-index map, and _row_factory
that creates these views. _RowView inherits from collections.abc.Mapping,
providing a complete dict-like read interface.

This eliminates per-row dict allocation during schema parsing. All rows
from the same result set share a single index map object.
Python 3.7+ guarantees dict preserves insertion order, making OrderedDict
unnecessary. Replace OrderedDict() with {} in TableMetadata.columns,
TableMetadata.triggers, and MaterializedViewMetadata.columns. Remove the
now-unused OrderedDict import.
….columns

Replace SELECT * with an explicit column list for the system_schema.columns
query in SchemaParserV3 (inherited by V4). Only the 7 columns actually
consumed by the parser are fetched: keyspace_name, table_name, column_name,
clustering_order, kind, position, type. This reduces network transfer and
deserialization overhead during schema refresh.
Replace dict_factory in _SchemaParser._handle_results and
get_column_from_system_local with _row_factory, eliminating per-row
dict allocation during schema parsing.

Also refactor SchemaParserV4._build_keyspace_metadata_internal to read
from the row without mutating it, since _RowView is read-only.

Note: V22-only dict_factory call sites are left unchanged as they do not
affect the V3/V4 code path (V3 and V4 fully override _query_all).
Replace three list comprehension passes over col_rows with a single
classification loop that sorts columns into partition, clustering, and
other buckets. Also use in-place sort() instead of sorted() and reuse
the already-built column_meta instead of a redundant dict lookup.
Cover __getitem__, get(), __contains__, __repr__, shared index map,
read-only enforcement, empty input, single-column, and multi-row
scenarios.
@mykaul mykaul force-pushed the improvement/metadata-schema-parsing branch from 069ce0f to eda1086 Compare April 3, 2026 15:39
@mykaul mykaul changed the title (Improvement) faster metadata schema parsing (Improvement) reduced memory for metadata schema parsing Apr 3, 2026
@mykaul mykaul changed the title (Improvement) reduced memory for metadata schema parsing (Improvement) reduced memory for metadata schema parsing (main improvement - Select only needed columns from system_schema.column ) Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants