Skip to content
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
* [FEATURE] Memberlist: Add `-memberlist.cluster-label` and `-memberlist.cluster-label-verification-disabled` to prevent accidental cross-cluster gossip joins and support rolling label rollout. #7385
* [FEATURE] Querier: Add timeout classification to classify query timeouts as 4XX (user error) or 5XX (system error) based on phase timing. When enabled, queries that spend most of their time in PromQL evaluation return `422 Unprocessable Entity` instead of `503 Service Unavailable`. #7374
* [FEATURE] Querier: Implement Resource Based Throttling in Querier. #7442
* [FEATURE] Querier: Add resource-based query eviction that automatically cancels the heaviest running query when CPU or heap utilization exceeds configured thresholds. #7488
* [ENHANCEMENT] Upgrade prometheus alertmanager version to v0.32.1. #7462
* [ENHANCEMENT] Tenant Federation: Avoid purging the regex resolver LRU cache on user-sync ticks when the set of known users has not changed. #7489
* [ENHANCEMENT] Parquet Converter: Add a ring status page to expose the ring status. #7455
Expand Down
42 changes: 42 additions & 0 deletions docs/blocks-storage/querier.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,48 @@ querier:
# type. 0 to disable.
# CLI flag: -querier.query-protection.rejection.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

eviction:
threshold:
# EXPERIMENTAL: Max CPU utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in
# percentage, between 0 and 1. monitored_resources config must include
# the resource type. 0 to disable.
# CLI flag: -querier.query-protection.eviction.threshold.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in
# percentage, between 0 and 1. monitored_resources config must include
# the resource type. 0 to disable.
# CLI flag: -querier.query-protection.eviction.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

# EXPERIMENTAL: How frequently the evictor checks system resource
# utilization.
# CLI flag: -querier.query-protection.eviction.check-interval
[check_interval: <duration> | default = 1s]

# EXPERIMENTAL: Number of check intervals to wait after an eviction before
# evicting again.
# CLI flag: -querier.query-protection.eviction.cooldown-period
[cooldown_period: <int> | default = 3]

# EXPERIMENTAL: The query metric used to determine the heaviest query for
# eviction. Supported values: fetched_samples, fetched_series,
# fetched_chunks, fetched_chunk_bytes.
# CLI flag: -querier.query-protection.eviction.eviction-metric
[eviction_metric: <string> | default = "fetched_samples"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason you chose fetched_samples as default? is there some data you can share about each metric's correlation to the query heaviness?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the queries I looked at that were causing heavy heap usage on queriers were low scrape interval with high time range. I believe samples is the best metric we currently have to detect both of those dimensions together.

In those queriers that are reaching high heap usage, we see that usually the querier pod is dominated by shards of one heavy query with all 3 of those metrics higher than other queries and any of them would correctly correlate.

Ideally in the future we have an upstream pr and have access to a metric that is more accurate and directly correlates with heap usage.


# EXPERIMENTAL: Minimum time a query must be running before it becomes
# eligible for eviction. Queries younger than this are ignored.
# CLI flag: -querier.query-protection.eviction.min-query-age
[min_query_age: <duration> | default = 10s]

# EXPERIMENTAL: Maximum number of queries to evict in a single check cycle
# when resource thresholds are breached.
# CLI flag: -querier.query-protection.eviction.max-evictions-per-cycle
[max_evictions_per_cycle: <int> | default = 1]
```

### `blocks_storage_config`
Expand Down
42 changes: 42 additions & 0 deletions docs/blocks-storage/store-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,48 @@ store_gateway:
# CLI flag: -store-gateway.query-protection.rejection.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

eviction:
threshold:
# EXPERIMENTAL: Max CPU utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in
# percentage, between 0 and 1. monitored_resources config must include
# the resource type. 0 to disable.
# CLI flag: -store-gateway.query-protection.eviction.threshold.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in
# percentage, between 0 and 1. monitored_resources config must include
# the resource type. 0 to disable.
# CLI flag: -store-gateway.query-protection.eviction.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

# EXPERIMENTAL: How frequently the evictor checks system resource
# utilization.
# CLI flag: -store-gateway.query-protection.eviction.check-interval
[check_interval: <duration> | default = 1s]

# EXPERIMENTAL: Number of check intervals to wait after an eviction before
# evicting again.
# CLI flag: -store-gateway.query-protection.eviction.cooldown-period
[cooldown_period: <int> | default = 3]

# EXPERIMENTAL: The query metric used to determine the heaviest query for
# eviction. Supported values: fetched_samples, fetched_series,
# fetched_chunks, fetched_chunk_bytes.
# CLI flag: -store-gateway.query-protection.eviction.eviction-metric
[eviction_metric: <string> | default = "fetched_samples"]

# EXPERIMENTAL: Minimum time a query must be running before it becomes
# eligible for eviction. Queries younger than this are ignored.
# CLI flag: -store-gateway.query-protection.eviction.min-query-age
[min_query_age: <duration> | default = 10s]

# EXPERIMENTAL: Maximum number of queries to evict in a single check cycle
# when resource thresholds are breached.
# CLI flag: -store-gateway.query-protection.eviction.max-evictions-per-cycle
[max_evictions_per_cycle: <int> | default = 1]

hedged_request:
# If true, hedged requests are applied to object store calls. It can help
# with reducing tail latency.
Expand Down
126 changes: 126 additions & 0 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -3877,6 +3877,48 @@ query_protection:
# disable.
# CLI flag: -ingester.query-protection.rejection.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

eviction:
threshold:
# EXPERIMENTAL: Max CPU utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -ingester.query-protection.eviction.threshold.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -ingester.query-protection.eviction.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

# EXPERIMENTAL: How frequently the evictor checks system resource
# utilization.
# CLI flag: -ingester.query-protection.eviction.check-interval
[check_interval: <duration> | default = 1s]

# EXPERIMENTAL: Number of check intervals to wait after an eviction before
# evicting again.
# CLI flag: -ingester.query-protection.eviction.cooldown-period
[cooldown_period: <int> | default = 3]

# EXPERIMENTAL: The query metric used to determine the heaviest query for
# eviction. Supported values: fetched_samples, fetched_series,
# fetched_chunks, fetched_chunk_bytes.
# CLI flag: -ingester.query-protection.eviction.eviction-metric
[eviction_metric: <string> | default = "fetched_samples"]

# EXPERIMENTAL: Minimum time a query must be running before it becomes
# eligible for eviction. Queries younger than this are ignored.
# CLI flag: -ingester.query-protection.eviction.min-query-age
[min_query_age: <duration> | default = 10s]

# EXPERIMENTAL: Maximum number of queries to evict in a single check cycle
# when resource thresholds are breached.
# CLI flag: -ingester.query-protection.eviction.max-evictions-per-cycle
[max_evictions_per_cycle: <int> | default = 1]
```

### `ingester_client_config`
Expand Down Expand Up @@ -5032,6 +5074,48 @@ query_protection:
# disable.
# CLI flag: -querier.query-protection.rejection.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

eviction:
threshold:
# EXPERIMENTAL: Max CPU utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -querier.query-protection.eviction.threshold.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -querier.query-protection.eviction.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

# EXPERIMENTAL: How frequently the evictor checks system resource
# utilization.
# CLI flag: -querier.query-protection.eviction.check-interval
[check_interval: <duration> | default = 1s]

# EXPERIMENTAL: Number of check intervals to wait after an eviction before
# evicting again.
# CLI flag: -querier.query-protection.eviction.cooldown-period
[cooldown_period: <int> | default = 3]

# EXPERIMENTAL: The query metric used to determine the heaviest query for
# eviction. Supported values: fetched_samples, fetched_series,
# fetched_chunks, fetched_chunk_bytes.
# CLI flag: -querier.query-protection.eviction.eviction-metric
[eviction_metric: <string> | default = "fetched_samples"]

# EXPERIMENTAL: Minimum time a query must be running before it becomes
# eligible for eviction. Queries younger than this are ignored.
# CLI flag: -querier.query-protection.eviction.min-query-age
[min_query_age: <duration> | default = 10s]

# EXPERIMENTAL: Maximum number of queries to evict in a single check cycle
# when resource thresholds are breached.
# CLI flag: -querier.query-protection.eviction.max-evictions-per-cycle
[max_evictions_per_cycle: <int> | default = 1]
```

### `query_frontend_config`
Expand Down Expand Up @@ -6801,6 +6885,48 @@ query_protection:
# CLI flag: -store-gateway.query-protection.rejection.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

eviction:
threshold:
# EXPERIMENTAL: Max CPU utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -store-gateway.query-protection.eviction.threshold.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this instance can reach before
# evicting the heaviest running query (across all tenants) in percentage,
# between 0 and 1. monitored_resources config must include the resource
# type. 0 to disable.
# CLI flag: -store-gateway.query-protection.eviction.threshold.heap-utilization
[heap_utilization: <float> | default = 0]

# EXPERIMENTAL: How frequently the evictor checks system resource
# utilization.
# CLI flag: -store-gateway.query-protection.eviction.check-interval
[check_interval: <duration> | default = 1s]

# EXPERIMENTAL: Number of check intervals to wait after an eviction before
# evicting again.
# CLI flag: -store-gateway.query-protection.eviction.cooldown-period
[cooldown_period: <int> | default = 3]

# EXPERIMENTAL: The query metric used to determine the heaviest query for
# eviction. Supported values: fetched_samples, fetched_series,
# fetched_chunks, fetched_chunk_bytes.
# CLI flag: -store-gateway.query-protection.eviction.eviction-metric
[eviction_metric: <string> | default = "fetched_samples"]

# EXPERIMENTAL: Minimum time a query must be running before it becomes
# eligible for eviction. Queries younger than this are ignored.
# CLI flag: -store-gateway.query-protection.eviction.min-query-age
[min_query_age: <duration> | default = 10s]

# EXPERIMENTAL: Maximum number of queries to evict in a single check cycle
# when resource thresholds are breached.
# CLI flag: -store-gateway.query-protection.eviction.max-evictions-per-cycle
[max_evictions_per_cycle: <int> | default = 1]

hedged_request:
# If true, hedged requests are applied to object store calls. It can help with
# reducing tail latency.
Expand Down
7 changes: 7 additions & 0 deletions docs/configuration/v1-guarantees.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,13 @@ Currently experimental features are:
- `-validation.max-label-cardinality-for-unoptimized-regex` (int) - maximum label cardinality
- `-validation.max-total-label-value-length-for-unoptimized-regex` (int) - maximum total length of all label values in bytes
- HATracker: `-distributor.ha-tracker.enable-startup-sync` (bool) - If enabled, fetches all tracked keys on startup to populate the local cache.
- Querier: Resource-based query eviction
- `-querier.query-protection.eviction.threshold.cpu-utilization` (float)
- `-querier.query-protection.eviction.threshold.heap-utilization` (float)
- `-querier.query-protection.eviction.check-interval` (duration)
- `-querier.query-protection.eviction.cooldown-period` (int)
- `-querier.query-protection.eviction.eviction-metric` (string)
- `-querier.query-protection.eviction.min-query-age` (duration)
- Ingester: Active Series Tracker
- Per-tenant `active_series_trackers` configuration in runtime config overrides
- Counts active series matching PromQL label matchers and exposes `cortex_ingester_active_series_per_tracker` metric
Loading
Loading