Stream 3d plot#11
Conversation
WalkthroughThe changes introduce a new optional parameter, Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant initialize.py
participant FileManager
participant pyarrow
Client->>initialize.py: Request '3D_SN_plot' data
initialize.py->>FileManager: get_results(..., use_pyarrow=True)
FileManager->>pyarrow: Load parquet dataset via pyarrow
pyarrow-->>FileManager: Return dataset
FileManager-->>initialize.py: Return results
initialize.py-->>Client: Return data
sequenceDiagram
participant update.py
participant pyarrow
participant pandas
update.py->>pyarrow: Filter per_scan_data by scanIndex
alt massIndex is provided
pyarrow->>pandas: Convert filtered table to DataFrame
pandas->>update.py: Extract SignalPeaks and NoisyPeaks by massIndex
else massIndex not provided
pyarrow-->>update.py: Return filtered table
end
update.py-->>update.py: Update data['per_scan_data']
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/render/update.py (2)
80-92: Performance optimized filtering for 'Precursor Signals', consider adding error handling.The implementation now uses PyArrow dataset filtering instead of pandas operations, which should significantly improve performance for the 3D plot as mentioned in the PR objectives. The approach effectively:
- Filters data using PyArrow's predicate pushdown capability
- Only converts to pandas when necessary for the specific massIndex operations
- Handles missing indices gracefully
However, there's no explicit error handling if PyArrow operations fail.
Consider adding error handling to gracefully manage potential PyArrow-related exceptions:
elif component == 'Precursor Signals': scan_index = selection_store.get("scanIndex") mass_index = selection_store.get("massIndex") if scan_index is None: - data['per_scan_data'] = data['per_scan_data'].to_table(filter=(ds.field("index") == -1)).slice(0, 0) + try: + data['per_scan_data'] = data['per_scan_data'].to_table(filter=(ds.field("index") == -1)).slice(0, 0) + except Exception as e: + st.error(f"Error filtering data: {e}") + data['per_scan_data'] = pd.DataFrame() else: - filtered_table = data['per_scan_data'].to_table(filter=(ds.field("index") == scan_index)) - if mass_index is not None: - df = filtered_table.to_pandas() - df['SignalPeaks'] = df['SignalPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) - df['NoisyPeaks'] = df['NoisyPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) - filtered_table = df - data['per_scan_data'] = filtered_table + try: + filtered_table = data['per_scan_data'].to_table(filter=(ds.field("index") == scan_index)) + if mass_index is not None: + df = filtered_table.to_pandas() + df['SignalPeaks'] = df['SignalPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) + df['NoisyPeaks'] = df['NoisyPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) + filtered_table = df + data['per_scan_data'] = filtered_table + except Exception as e: + st.error(f"Error processing data: {e}") + data['per_scan_data'] = pd.DataFrame()
89-90: Consider using vectorized operations for better performance.The current approach uses
applywith lambda functions to transform the 'SignalPeaks' and 'NoisyPeaks' columns. For even better performance, consider using vectorized operations if possible, especially since this is part of a performance optimization PR.- df['SignalPeaks'] = df['SignalPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) - df['NoisyPeaks'] = df['NoisyPeaks'].apply(lambda peaks: peaks[mass_index] if len(peaks) > mass_index else None) + # Vectorized approach if lists are of uniform length + # Check if any list is shorter than mass_index to avoid index errors + has_short_lists = any(len(peaks) <= mass_index for peaks in df['SignalPeaks'] if peaks is not None) + + if not has_short_lists: + # Fast path: vectorized extraction + df['SignalPeaks'] = df['SignalPeaks'].map(lambda peaks: None if peaks is None else peaks[mass_index]) + df['NoisyPeaks'] = df['NoisyPeaks'].map(lambda peaks: None if peaks is None else peaks[mass_index]) + else: + # Fallback: safe element-wise extraction + df['SignalPeaks'] = df['SignalPeaks'].apply(lambda peaks: peaks[mass_index] if peaks is not None and len(peaks) > mass_index else None) + df['NoisyPeaks'] = df['NoisyPeaks'].apply(lambda peaks: peaks[mass_index] if peaks is not None and len(peaks) > mass_index else None)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
js-component/dist/assets/index-e2941e20.jsis excluded by!**/dist/**js-component/dist/index.htmlis excluded by!**/dist/**
📒 Files selected for processing (7)
environment.yml(1 hunks)openms-streamlit-vue-component(1 hunks)requirements.txt(1 hunks)src/render/components.py(1 hunks)src/render/initialize.py(1 hunks)src/render/update.py(2 hunks)src/workflow/FileManager.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/render/initialize.py (1)
src/workflow/FileManager.py (1)
get_results(404-449)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-openms
- GitHub Check: build-full-app
🔇 Additional comments (10)
openms-streamlit-vue-component (1)
1-1: Confirm correctness of the submodule commit update
The subproject pointer has been bumped to9aabf9919e7a46837a29030ff16cac7f1aad8319. Please verify that this commit includes all required Vue component changes, that the submodule is properly initialized/updated, and that the component build and integration tests pass with this new reference.src/render/initialize.py (1)
80-80: PyArrow optimization for 3D plot data loadingThe addition of
use_pyarrow=Trueto theget_resultscall for the 3D scatter and noise plot component enables more efficient loading of parquet data using PyArrow's dataset API instead of pandas.read_parquet. This change aligns with the PR objective of improving 3D plot performance.requirements.txt (1)
12-12: Consistent PyArrow version constraintAdding the minimum version
pyarrow>=15to complement the existingpyarrow<16ensures compatibility with the new PyArrow dataset API features being used. This version range is appropriately specified across both dependency files.environment.yml (1)
13-13: Consistent PyArrow version constraintAdding the minimum version constraint
pyarrow>=15to match requirements.txt ensures consistent dependency management across both conda and pip installations. This is necessary for the PyArrow dataset features being used for the 3D plot performance enhancement.src/workflow/FileManager.py (3)
9-9: PyArrow dataset module importAdded import for PyArrow's dataset module to enable more efficient loading and filtering of parquet files. This is appropriate for optimizing the 3D plot performance.
404-405: Optional PyArrow data loading parameterThe addition of
use_pyarrowparameter with a default ofFalsemaintains backward compatibility while enabling improved performance for specific components that opt-in to the feature.
441-444: PyArrow dataset loading implementationThis implementation provides a performance optimization by allowing components to use PyArrow's dataset API for loading parquet files when
use_pyarrow=True. This approach is more memory-efficient and faster for large datasets, especially when combined with filtering operations, making it well-suited for 3D visualization data.src/render/update.py (3)
3-3: Appropriate import for PyArrow integration.The addition of PyArrow enables more efficient data filtering operations, aligning with the PR's performance improvement objective.
73-73: LGTM: Component list extension.Inclusion of 'Augmented Deconvolved Spectrum' in the list of components that share the same filtering logic is appropriate.
84-84:Details
❓ Verification inconclusive
Verify that the field name is correct.
The code uses
ds.field("index")to filter the dataset, but make sure that "index" is the correct field name in the PyArrow dataset. Sometimes dataset field names might differ from DataFrame column names.
🏁 Script executed:
#!/bin/bash # Verify field names in PyArrow datasets # Look for usages of PyArrow dataset creation to confirm field names rg -A 3 "pyarrow\.dataset\.(dataset|parquet_dataset)" rg -A 3 "ds\.dataset"Length of output: 368
To confirm whether the
"index"field actually exists onper_scan_data, let’s inspect howper_scan_datais defined and where it comes from insrc/render/update.py.
🏁 Script executed:
#!/bin/bash # Show how per_scan_data is populated and used around line 84 rg -n "per_scan_data" -A5 -B5 src/render/update.py # Dump the first 200 lines of update.py for broader context sed -n '1,200p' src/render/update.pyLength of output: 6720
Confirm Parquet column name for filtering
At
src/render/update.py:84, you’re usingds.field("index")to filter the dataset. Please verify that your Parquet source actually defines a column named"index". You can inspect the schema (for example, in a REPL or small script) with:import pyarrow.dataset as ds dataset = ds.dataset(<your_parquet_path>, format="parquet") print(dataset.schema.names)If the field is named differently (e.g.
"scanIndex"), update the filter to use the correct column name.
Change the nginx listen directive from `listen 8501` to `listen 0.0.0.0:8501` so the load balancer explicitly accepts connections on all network interfaces, not just the default. https://claude.ai/code/session_01721rHYJhSZfJnnUpWHXbVs Co-authored-by: Claude <noreply@anthropic.com>
Significantly increases performance for 3d plot
Summary by CodeRabbit