From c7d81119b4a33c55a3f4e356ca5b2eec75aec8e7 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Wed, 27 May 2026 20:35:48 -0400 Subject: [PATCH 1/2] docs: reframe positioning around verified Bayesian + Git-PR + three-OSS-engine moat MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reframes RelyLoop's documented value proposition based on the May 2026 competitive landscape research (see docs/07_research/comparison.md): OpenSearch SRW is the closest competitor and ships GA for query sets, judgment lists (LLM-as-judge + UBI-via-COEC), A/B comparison, scheduled experiments — but its only optimizer is a 66-cell grid search over hybrid weights, has no Git-PR apply path by explicit RFC choice, and is OpenSearch-only by architecture. Elastic deprecated Behavioral Analytics + Search Applications in 9.0 with no SRW equivalent. The Solr ecosystem (Quepid, RRE, Chorus) is mature for manual evaluation but has no auto-optimizer. The defensible bundle is therefore: Bayesian/TPE optimization across the full query-time search space + Git-PR apply path + all three OSS engines + hybrid UBI+LLM judgments + conversational agent that runs the loop + local-first observability. Release-matrix reshuffle to land all six differentiators by MVP3: - MVP1 (shipped): The Loop. ES + OpenSearch + Optuna + Git PR + agent. - MVP2 (new): Three-Engine + Real Signals. Apache Solr adapter + UBI judgments + hybrid UBI+LLM converter (bundled because Solr's first- party solr.UBIComponent writes the same UBI schema). - MVP3 (was MVP2): Observable. Langfuse + SigNoz + audit-log + lineage. - GA v1: Production-ready. No new product surface; polish + governance. - Backlog: multi-Git providers, multi-tenancy, multi-LLM provider SDKs, LTR training, Path B (monitoring + bandits), Lucidworks Fusion adapter (explicitly dropped — see chore_drop_fusion_scope/idea.md). Spec file renamed docs/00_overview/product/relevance-copilot-spec.md → docs/00_overview/relyloop-spec.md; all 24 active-doc references updated. New artifacts: - docs/07_research/comparison.md (factual+neutral competitive matrix) - docs/02_product/planned_features/infra_adapter_solr/idea.md (MVP2) - docs/02_product/planned_features/chore_drop_fusion_scope/idea.md Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 42 +- CONTRIBUTING.md | 2 +- README.md | 49 +- architecture.md | 14 +- docs/00_overview/DASHBOARD.md | 6 +- docs/00_overview/MVP1_DASHBOARD.md | 48 +- docs/00_overview/MVP2_DASHBOARD.md | 19 +- docs/00_overview/README.md | 2 +- docs/00_overview/dashboard.html | 10 +- docs/00_overview/mvp1_dashboard.html | 104 ++- docs/00_overview/mvp2_dashboard.html | 23 +- ...vance-copilot-spec.md => relyloop-spec.md} | 627 +++++++----------- docs/01_architecture/adapters.md | 68 +- docs/01_architecture/agent-tools.md | 4 +- docs/01_architecture/api-conventions.md | 2 +- docs/01_architecture/apply-path.md | 18 +- docs/01_architecture/data-model.md | 2 +- docs/01_architecture/deployment.md | 16 +- docs/01_architecture/llm-orchestration.md | 2 +- docs/01_architecture/mvp1-overview.md | 60 +- docs/01_architecture/optimization.md | 9 +- docs/01_architecture/system-overview.md | 12 +- docs/01_architecture/tech-stack.md | 37 +- docs/01_architecture/ui-architecture.md | 2 +- docs/02_product/mvp1-user-stories.md | 19 +- .../chore_drop_fusion_scope/idea.md | 117 ++++ .../feat_ubi_judgments/idea.md | 38 +- .../infra_adapter_solr/idea.md | 97 +++ docs/07_research/comparison.md | 81 +++ docs/08_guides/tutorial-first-study.md | 2 +- docs/08_guides/workflows-overview.md | 2 +- docs/README.md | 2 +- state.md | 4 +- ui/public/docs/tutorial-first-study.md | 2 +- ui/public/docs/workflows-overview.md | 2 +- 35 files changed, 894 insertions(+), 650 deletions(-) rename docs/00_overview/{product/relevance-copilot-spec.md => relyloop-spec.md} (76%) create mode 100644 docs/02_product/planned_features/chore_drop_fusion_scope/idea.md create mode 100644 docs/02_product/planned_features/infra_adapter_solr/idea.md create mode 100644 docs/07_research/comparison.md diff --git a/CLAUDE.md b/CLAUDE.md index 2975814f..f3398295 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -12,9 +12,9 @@ Continue execution without constantly asking for permission to execute tests or ## Project Overview -RelyLoop is an open-source tool for enterprise search platform teams. It combines a conversational LLM agent with an automated overnight optimization loop ("Karpathy loop") to systematically tune query-time search relevance on Elasticsearch, OpenSearch, and Lucidworks Fusion (with pure-Solr support deferred to v2). Engineers describe relevance problems in chat; the agent introspects the cluster, proposes search-space parameters, and queues thousands of trials against `ir_measures`-computed metrics. Winning configurations are surfaced as Pull Requests / Merge Requests against a central search-config Git repo, where named approvers review and merge them into production. +RelyLoop is the only open-source tool that runs **automated Bayesian search-space optimization** (Optuna/TPE, thousands of trials) across the **full query-time search space** on every major OSS search engine — Elasticsearch, OpenSearch, and Apache Solr (Solr ships at MVP2) — and ships winning configurations as **Pull Requests** to a central search-config Git repo for the operator's existing approvers to review and merge. A conversational LLM agent describes the loop and proposes the search space, but the engineering moat is the loop itself, the Git-PR posture, and the three-engine reach. See [`docs/07_research/comparison.md`](docs/07_research/comparison.md) for the citation-backed competitive matrix vs OpenSearch SRW, Quepid, RRE, Chorus, and Elastic's native tooling. -The tool is a single, engine-agnostic, provider-agnostic system: one UI, one workflow, one schema. Differences between Elasticsearch / OpenSearch, Lucidworks Fusion, and any future engine (pure Solr, Vespa, etc.) are isolated behind a thin adapter interface — and the same adapter pattern applies to LLM providers (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, self-hosted Ollama / vLLM) and Git providers (GitHub, GitLab, Bitbucket). Multi-tenancy is supported from the schema level so a single deployment can serve many downstream customers in isolation (activates at MVP4). +The tool is a single, engine-neutral, provider-neutral system: one UI, one workflow, one schema. Differences between Elasticsearch, OpenSearch, and Apache Solr are isolated behind a thin `SearchAdapter` Protocol; LLM providers behind a `ChatModel` adapter (OpenAI-compatible endpoint today, including Ollama / vLLM / LM Studio / TGI via `OPENAI_BASE_URL`; native non-OpenAI SDKs in the backlog); Git providers behind a `GitProvider` adapter (GitHub today; GitLab + Bitbucket in the backlog). Multi-tenancy is in the backlog — RelyLoop is single-tenant through GA v1, with SSO via reverse proxy as the recommended path. **Personas** (per umbrella spec §6): @@ -32,13 +32,11 @@ The tool is a single, engine-agnostic, provider-agnostic system: one UI, one wor | Release | Theme | Adds | |---|---|---| -| MVP1 / v0.1 | "The Loop" | ES + OpenSearch adapter, OpenAI-compatible LLM, GitHub provider, single-tenant, no auth, Docker Compose, 80% coverage gate | -| MVP1.5 / v0.1.5 | "Real Signals" | OpenSearch UBI judgments as a first-class source — `UbiReader` (engine-agnostic; reads `ubi_queries` + `ubi_events`) + pluggable `SignalsConverter` (position-bias-corrected CTR, dwell-time, hybrid UBI+LLM); judgment lists can mix sources via existing `source` enum; new `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool. No schema migration, no new Compose service. Predicated on operator running the OpenSearch UBI plugin. | -| MVP2 / v0.2 | "Observable" | Langfuse + ClickHouse + SigNoz; canonical event catalog; `audit_log` table + immutability trigger (no users/tenants yet); lineage columns; PII redaction; trace propagation | -| MVP3 / v0.3 | "Production Stacks" | Lucidworks Fusion adapter; multi-Git-provider abstraction (GitLab, Bitbucket); production install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); AWS managed OpenSearch | -| MVP4 / v0.4 | "Multi-tenant, Multi-LLM" | `tenants` + `tenant_memberships` + `users` + `api_keys`; `tenant_id` columns + backfill; SSO via reverse proxy; Argon2id-hashed bearer API keys; native non-OpenAI provider SDKs (Anthropic, Bedrock, Vertex) | -| GA v1 | "Production-ready" | LangGraph orchestrator + `PostgresSaver`; full RFC 7807 errors; `Idempotency-Key`; Helm chart; container scanning; image signing; 90% coverage gate | -| v2+ | post-GA | Apache Solr adapter | +| MVP1 / v0.1 (shipped) | "The Loop" | ES + OpenSearch adapter, OpenAI-compatible LLM, GitHub provider, single-tenant, no auth, Docker Compose, 80% coverage gate, Optuna/TPE Bayesian loop, Git-PR apply path, conversational agent | +| MVP2 / v0.2 | "Three-Engine + Real Signals" | Apache Solr adapter (Solr 9.x + 10.x via `edismax` + `{!ltr}` rescore) + UBI judgments (`UbiReader` reads `ubi_queries` + `ubi_events` via any `SearchAdapter`) + pluggable `SignalsConverter` (position-bias-corrected CTR, dwell-time, **hybrid UBI+LLM**) + `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool. Solr's first-party `solr.UBIComponent` writes the same UBI schema, so UBI works on all three engines from day one. | +| MVP3 / v0.3 | "Observable" | Langfuse + ClickHouse + SigNoz; canonical event catalog; `audit_log` table + immutability trigger (no users/tenants yet); lineage columns; PII redaction; trace propagation across all three engines + both judgment sources | +| GA v1 / v1.0 | "Production-ready" | LangGraph orchestrator + `PostgresSaver`; full RFC 7807 errors; `Idempotency-Key`; full four-layer test pyramid at 90% coverage; complete CI/CD with security gates; container scanning; image signing; design-partner references; public Optuna-vs-SRW-grid benchmark. **No new product surface** — all six differentiators are GA by MVP3; GA v1 is polish + governance + hardening. | +| Backlog | — | Multi-Git provider abstraction (GitLab, Bitbucket); multi-tenancy + multi-LLM provider abstraction (Anthropic, Bedrock, Vertex, Azure OpenAI); LTR training; Path B (production monitoring, bandits, shadow validation); Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](docs/02_product/planned_features/chore_drop_fusion_scope/idea.md)) | If a CLAUDE.md statement conflicts with the canonical release matrix, the matrix wins — flag the drift in your PR. @@ -59,7 +57,7 @@ After completing a task, evaluate whether documentation needs updating: - `state.md` — update if: the active branch changed, new features were completed, priorities shifted, new debt was introduced, or the Alembic head moved - `architecture.md` — update if: new services/layers were added, new data flows were introduced, design decisions were made, invariants changed, or the topical docs in `docs/01_architecture/` got a new entry -- `CLAUDE.md` — update if: new conventions, rules, environment variables, or build commands were added; or if a release crossed a maturity boundary that activates new rules (e.g., MVP4 turning on the multi-tenant rules below) +- `CLAUDE.md` — update if: new conventions, rules, environment variables, or build commands were added; or if a release crossed a maturity boundary that activates new rules (e.g., multi-tenancy rules below being activated) - `docs/03_runbooks/` — add or update if new ops procedures, deployment steps, or troubleshooting needed ## Repository Structure @@ -77,7 +75,7 @@ backend/ services/ # use-case orchestrators (study lifecycle, judgment generation, digest, PR worker) domain/ # pure business logic — search-space rules, study state machine, query rendering adapters/ # engine adapters (MVP1: ElasticAdapter for ES + OpenSearch) - llm/ # OpenAI-compatible client + capability check + provider abstraction (MVP4 multi-provider) + llm/ # OpenAI-compatible client + capability check + provider abstraction (backlog: native non-OpenAI provider SDKs) git/ # Git provider clients (MVP1: GitHub; MVP3: + GitLab + Bitbucket) workers/ # Arq WorkerSettings + job functions (run_trial, generate_digest, open_pr — arrive with their owning features) tests/ @@ -101,7 +99,7 @@ scripts/ install.sh # auto-generates required + optional secrets, then docker compose up -d check-conventional-commit.sh # commit-msg pre-commit hook docs/ - 00_overview/ # umbrella spec (relevance-copilot-spec.md), implemented_features/_/ + 00_overview/ # umbrella spec (relyloop-spec.md), implemented_features/_/ 01_architecture/# topical arch docs: tech-stack, system-overview, data-model, deployment, api-conventions, adapters, llm-orchestration, optimization, ui-architecture, agent-tools, apply-path, mvp1-overview 02_product/ # mvp1-user-stories.md + planned_features// 03_runbooks/ # local-dev.md (and per-feature runbooks as features ship) @@ -117,13 +115,13 @@ docs/ 2. **Secrets via mounted files, never bare env vars.** RelyLoop's Pydantic Settings reads `*_FILE`-suffixed env vars (e.g., `OPENAI_API_KEY_FILE=/run/secrets/openai_key`) and resolves the file content. **Bare env vars (`OPENAI_API_KEY=sk-...`) are NOT supported** — they appear in container `inspect`, logs, and `ps` output, defeating the secrets-management purpose. The `.env` file at repo root is for non-secret Compose overrides only (e.g., `OPENAI_BASE_URL`, `ES_HEAP_SIZE`). Real secrets live in `./secrets/` files mounted as Docker secrets. See [`docs/01_architecture/deployment.md` §"Secrets"](docs/01_architecture/deployment.md) and `infra_foundation` FR-3. -3. **Never call OpenAI directly when the LLM abstraction exists.** MVP1 ships a thin `openai` SDK client pointed at `OPENAI_BASE_URL`; once the multi-provider `BaseChatModel` abstraction lands at MVP4, every LLM call MUST go through it (no `openai.AsyncClient(...)` in services). MVP1 services may use the SDK directly while the abstraction is still scoped — but always read `OPENAI_BASE_URL` and `OPENAI_MODEL` from `Settings`, never hardcode model names. See [`docs/01_architecture/llm-orchestration.md`](docs/01_architecture/llm-orchestration.md). +3. **Never call OpenAI directly when the LLM abstraction exists.** MVP1 ships a thin `openai` SDK client pointed at `OPENAI_BASE_URL`; once the multi-provider `BaseChatModel` abstraction lands (backlog), every LLM call MUST go through it (no `openai.AsyncClient(...)` in services). MVP1 services may use the SDK directly while the abstraction is still scoped — but always read `OPENAI_BASE_URL` and `OPENAI_MODEL` from `Settings`, never hardcode model names. See [`docs/01_architecture/llm-orchestration.md`](docs/01_architecture/llm-orchestration.md). 4. **Never bypass the engine adapter Protocol.** Engine-specific code lives ONLY in `backend/app/adapters/.py`. The orchestrator, study runner, evaluator, and UI consume the unified `SearchAdapter` Protocol per [`docs/01_architecture/adapters.md`](docs/01_architecture/adapters.md). No `elasticsearch.AsyncElasticsearch(...)` instances outside the adapter module. This rule activates the moment `infra_adapter_elastic` lands; until then, the adapter Protocol is the spec, not the code. 5. **All Alembic migrations must include `downgrade()` and round-trip cleanly.** Verify with `alembic upgrade head && alembic downgrade -1 && alembic upgrade head` before merging. The MVP1 baseline is `0001_baseline` — the empty migration that registers `alembic_version`; subsequent feature migrations build on it. -6. **`/healthz` is unauthenticated by design.** It's an operator-facing probe, unprefixed (not under `/api/v1/`), and reports subsystem status. Never gate it behind auth. The shape is documented in [`infra_foundation/feature_spec.md`](docs/02_product/planned_features/infra_foundation/feature_spec.md) §7.3 — any change requires a spec patch first. When TLS + auth land at MVP4, `/healthz` stays open via the reverse proxy's localhost or internal-network ACL. +6. **`/healthz` is unauthenticated by design.** It's an operator-facing probe, unprefixed (not under `/api/v1/`), and reports subsystem status. Never gate it behind auth. The shape is documented in [`infra_foundation/feature_spec.md`](docs/02_product/planned_features/infra_foundation/feature_spec.md) §7.3 — any change requires a spec patch first. When TLS + auth land (TLS via Caddy is a GA-v1 hardening item; multi-tenant auth is in the backlog), `/healthz` stays open via the reverse proxy's localhost or internal-network ACL. 7. **Conventional Commits format is enforced** (per `infra_foundation` FR-6). Pre-commit `commit-msg` hook validates the message against `^(feat|fix|chore|docs|infra|refactor|test|style|perf|build|ci)(\([a-z0-9-]+\))?(!)?:`. Never bypass with `--no-verify` or `-n`. If a hook fails, fix the message; don't skip. @@ -135,9 +133,9 @@ docs/ 11. **Per-route LLM/network calls inside `/healthz` must respect the 200ms timeout.** The health endpoint orchestrates 5 parallel subsystem probes via `asyncio.wait_for(probe(), timeout=0.2)` so total response stays under 500ms p99. Never add a probe that synchronously waits on a slow upstream — wrap it in the timeout, return `down`/`unreachable` on TimeoutError. The OpenAI capability check (FR-7) does NOT run inside `/healthz` — it runs once at startup as a fire-and-forget task and `/healthz` reads the cached result from Redis. -**Activates at MVP2:** `audit_log` table + Postgres immutability trigger + canonical event catalog. When MVP2 lands, add an Absolute Rule: every state-mutating endpoint or service function must call `create_audit_event()` in the same transaction as the primary mutation (before `db.commit()`); see [`docs/01_architecture/data-model.md`](docs/01_architecture/data-model.md) §"Forthcoming: audit_log". +**Activates at MVP3 (Observable):** `audit_log` table + Postgres immutability trigger + canonical event catalog. When MVP3 lands, add an Absolute Rule: every state-mutating endpoint or service function must call `create_audit_event()` in the same transaction as the primary mutation (before `db.commit()`); see [`docs/01_architecture/data-model.md`](docs/01_architecture/data-model.md) §"Forthcoming: audit_log". -**Activates at MVP4:** Multi-tenancy. When MVP4 lands, add an Absolute Rule: every DB write on a tenant-scoped table must include `tenant_id`; admin endpoints bypass tenant scoping but require explicit role check via `require_role({"platform_admin"})`. Until then, RelyLoop is single-tenant — no `tenants` table, no `tenant_id` column, no membership check. +**Activates when multi-tenancy is promoted from backlog:** Multi-tenancy. When multi-tenancy ships, add an Absolute Rule: every DB write on a tenant-scoped table must include `tenant_id`; admin endpoints bypass tenant scoping but require explicit role check via `require_role({"platform_admin"})`. Until then, RelyLoop is single-tenant — no `tenants` table, no `tenant_id` column, no membership check. ## Build, Test, and Lint Commands @@ -187,14 +185,14 @@ make migrate-create name= # alembic revision --autogenerate -m "" ## Environments -MVP1 has one environment: local development on a developer's laptop or in CI. Production-style install lands at MVP3 (TLS via Caddy + Let's Encrypt, no SSO yet); SSO + multi-tenant arrive at MVP4. +MVP1 has one environment: local development on a developer's laptop or in CI. Production-style install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis, AWS managed OpenSearch) lands with GA v1 hardening; SSO + multi-tenant remain in the backlog. | Context | `ENVIRONMENT` value | Where it runs | Notes | |---|---|---|---| | Local development | `development` (default) | Developer machine via `make up` | All defaults; no auth; no TLS | | CI (GitHub Actions) | `development` | GitHub Actions runners with service containers | Same toolchain as local; backend tests use a service-container Postgres + ES + OpenSearch | | Staging (MVP3+) | `staging` | TBD operator deployment | TLS on; trusted-network deployment | -| Production (MVP4+) | `production` | TBD operator deployment | TLS + SSO + multi-tenant; arrives with the auth surface | +| Production (post-GA) | `production` | TBD operator deployment | TLS + SSO + multi-tenant; arrives with the auth surface | There is no remote staging in MVP1 — every contributor runs the stack locally. The umbrella spec describes this as "evaluation-only" and the README labels it "alpha." @@ -306,7 +304,7 @@ See [`docs/01_architecture/data-model.md`](docs/01_architecture/data-model.md) f - JSONB for flexible structured fields (settings, params, metrics, payloads). - Soft delete via `deleted_at` on user-facing tables; hard delete on internal append-only tables (e.g., `trials`). - All foreign keys explicit; no implicit relationships. -- Indexes on `(tenant_id, created_at)` for tenant-scoped tables — **MVP4+ only**; MVP1–3 has no `tenant_id` column. +- Indexes on `(tenant_id, created_at)` for tenant-scoped tables — **backlog only**; RelyLoop is single-tenant through GA v1 — no `tenant_id` column. ## Frontend Conventions @@ -328,7 +326,7 @@ See [`docs/01_architecture/data-model.md`](docs/01_architecture/data-model.md) f - `/proposals` and `/proposals/[id]` — landing in `feat_proposals_ui` - `/chat` — landing in `feat_chat_agent` - `/judgments/[id]` — landing in `feat_llm_judgments` -- No admin routes in MVP1 (admin model arrives at MVP4) +- No admin routes through GA v1 (admin model is in the backlog) ### Common UI Patterns (when UI features land) @@ -356,9 +354,9 @@ When you add a `` option list, filter dropdown, status badge variant, and sort-key literal the frontend sends to the backend must be grounded in a concrete backend source file. (See "Enumerated Value Contract Discipline" above.) - **Do not** edit a file and then `git mv` it in the same commit. `git mv old new` writes the *last-committed blob* of `old` into the index entry for `new` — any prior working-tree edits to `old` end up unstaged at the new path (visible only as the lowercase "M" in `git status`'s `RM` indicator) and `git add && git commit` will silently drop them. **Order:** `git mv` first, then edit at the new path, then `git add `. Verify with `git diff --cached --stat` before commit — every file you intended to edit must show non-zero `+`/`-` counts. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fb1b2b2c..194b4144 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -141,7 +141,7 @@ RelyLoop's engine, LLM provider, and Git provider adapters are designed for comm - Includes unit tests with `pytest-recording` cassettes - Documents auth flow, version support, and any quirks in `docs/06_vendor_docs/adapters/.md` -See the spec (`docs/00_overview/product/relevance-copilot-spec.md` §8 for engine adapters, §15 for LLM providers, §16 for Git providers) for the full contracts. +See the spec (`docs/00_overview/relyloop-spec.md` §8 for engine adapters, §15 for LLM providers, §16 for Git providers) for the full contracts. ## Maintainers diff --git a/README.md b/README.md index a2d66c10..cd900680 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,21 @@ # RelyLoop -> **Status: alpha (MVP1, v0.1.0).** Open-source automated relevance tuning for enterprise search platforms. - -RelyLoop combines an LLM-driven chat agent with an Optuna-driven optimization -loop ("Karpathy loop") to systematically tune query-time relevance on -Elasticsearch and OpenSearch. Engineers describe the problem in chat; the -agent introspects the cluster, proposes a search-space, and runs thousands -of trials against `ir_measures`-computed metrics. Winning configurations -land as Pull Requests against a central search-config Git repo, where named -approvers review and merge. +> **Status: alpha (MVP1, v0.1.0).** The only open-source tool that runs automated Bayesian search-space optimization across thousands of trials, on every major open-source search engine (Elasticsearch, OpenSearch, Apache Solr at MVP2), and ships winning configs as Pull Requests for your existing approval workflow. + +A conversational LLM agent describes the problem and proposes the search +space, but the engineering moat is the loop itself, the Git-PR posture, and +the three-engine reach. RelyLoop runs **thousands of Optuna/TPE trials** +across the full query-time search space (field boosts, function scores, +fuzziness, `mm`, tie-breakers, hybrid weights — not just one slice), +evaluates each trial against `ir_measures`-computed metrics, and opens a +**Pull Request** with the winning configuration against your central +search-config Git repo. Your existing approvers and CI handle deployment; +RelyLoop never sits on the live search-serving path. + +See [`docs/07_research/comparison.md`](docs/07_research/comparison.md) for +the citation-backed comparison vs OpenSearch Search Relevance Workbench, +Quepid, RRE, Chorus, and Elastic's native tooling — and why the bundle is +genuinely unique in May 2026. ## 5-minute quickstart @@ -37,9 +44,11 @@ see Step 0 of the tutorial. ## What's in MVP1 / What's coming MVP1 ships the full Karpathy loop end-to-end on Elasticsearch + OpenSearch: -chat agent, Optuna optimizer, LLM-as-judge, digest, GitHub PR worker, single- -tenant install. Observable / Production Stacks / Multi-tenant land in MVP2 → -MVP3 → MVP4. +chat agent, Optuna/TPE optimizer, LLM-as-judge, digest, GitHub PR worker, +single-tenant install. **MVP2** adds Apache Solr + UBI judgments + hybrid +UBI+LLM (bundled). **MVP3** adds local-first observability (Langfuse + +SigNoz). **GA v1** is polish + governance + hardening — no new product +surface; all six differentiators are in by MVP3. Canonical release matrix: [`docs/01_architecture/tech-stack.md`](docs/01_architecture/tech-stack.md) — @@ -47,20 +56,20 @@ do not duplicate here, the matrix is the source of truth. ## Key design choices -- **Engine-agnostic** — Elasticsearch + OpenSearch in MVP1 via one adapter; Lucidworks Fusion in MVP3; pure Solr in v2. -- **Provider-agnostic** — OpenAI in MVP1; Anthropic, AWS Bedrock, Azure OpenAI, Vertex, Ollama / vLLM in MVP4. -- **Git-as-source-of-truth** — winning configs land as PRs against a central config repo; deployment is the operator's CI's job, not RelyLoop's. -- **Local-first observability** — Langfuse + SigNoz both self-hosted (MVP2+); no LLM trace data leaves the deployment VM. -- **Multi-tenant from MVP4** — single deployment serves many downstream customers in isolation. -- **Agent-first API** — every operation the in-tool orchestrator can perform is also callable by external agents; OpenAPI 3.1, idempotency keys, RFC 7807 errors, outgoing webhooks. -- **Deliberate, not real-time** — RelyLoop is for offline experimentation and change management; it does not sit on the live search-serving path. +- **Engine-neutral across the three OSS engines** — Elasticsearch + OpenSearch in MVP1 via one adapter; Apache Solr in MVP2. Lucidworks Fusion explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](docs/02_product/planned_features/chore_drop_fusion_scope/idea.md)). +- **Full-search-space Bayesian/TPE optimization** — Optuna across field boosts, function scores, fuzziness, `mm`, tie-breakers, hybrid weights, LTR rescoring. Not a 66-cell grid over hybrid weights alone (the only thing OpenSearch SRW's optimizer covers today). +- **Git-as-source-of-truth** — winning configs land as PRs against a central config repo; deployment is the operator's CI's job, not RelyLoop's. OpenSearch SRW has no apply path by explicit RFC choice; this is a stable differentiator. +- **Provider-neutral LLM** — OpenAI-compatible endpoint in MVP1 (works against api.openai.com, Ollama, LM Studio, vLLM, HuggingFace TGI via `OPENAI_BASE_URL`). Native non-OpenAI provider SDKs are in the backlog. +- **Local-first observability** — Langfuse + SigNoz both self-hosted (MVP3); no LLM trace data leaves the deployment VM. +- **Single-tenant through GA v1** — multi-tenancy is in the backlog; SSO via reverse proxy is the recommended path for now. +- **Deliberate, not real-time** — RelyLoop is for offline experimentation and change management; it does not sit on the live search-serving path. Online learning / bandits / production-quality monitoring are a v2 Path B direction. See spec §4 (non-goals) for the full set. ## Links - Tutorial: [`docs/08_guides/tutorial-first-study.md`](docs/08_guides/tutorial-first-study.md) -- Umbrella spec: [`docs/00_overview/product/relevance-copilot-spec.md`](docs/00_overview/product/relevance-copilot-spec.md) +- Umbrella spec: [`docs/00_overview/relyloop-spec.md`](docs/00_overview/relyloop-spec.md) - Architecture index: [`docs/01_architecture/`](docs/01_architecture/) - Local-dev runbook: [`docs/03_runbooks/local-dev.md`](docs/03_runbooks/local-dev.md) - Release checklist (maintainers): [`docs/03_runbooks/release-checklist.md`](docs/03_runbooks/release-checklist.md) diff --git a/architecture.md b/architecture.md index a33bd8e1..9b062d59 100644 --- a/architecture.md +++ b/architecture.md @@ -8,8 +8,9 @@ RelyLoop is an **off-line** relevance-tuning tool for enterprise search platforms. The architecture has four cooperating layers: 1. **Adapter** — a thin Protocol behind which engine differences - (Elasticsearch / OpenSearch / Lucidworks Fusion) and provider differences - (OpenAI / Anthropic / Bedrock / Ollama / Vertex) are isolated. + (Elasticsearch / OpenSearch in MVP1; Apache Solr in MVP2) and LLM + provider differences (OpenAI-compatible endpoints today; Anthropic / + Bedrock / Vertex / Azure OpenAI in the backlog) are isolated. 2. **Domain** — pure Python (no I/O): study state machine, search-space rules, query rendering, evaluator helpers. 3. **Service** — orchestrators (study runner, judgment generation, digest, @@ -29,7 +30,7 @@ tests, and never modifies cluster schema/mapping/analyzer settings. | [`mvp1-overview.md`](docs/01_architecture/mvp1-overview.md) | The MVP1 reading guide — start here if you're new | | [`tech-stack.md`](docs/01_architecture/tech-stack.md) | Languages, frameworks, lockfiles, code organization, **canonical release matrix** | | [`system-overview.md`](docs/01_architecture/system-overview.md) | Service inventory, how containers fit together | -| [`deployment.md`](docs/01_architecture/deployment.md) | Compose layout, secrets pattern, MVP1→MVP4 deployment evolution | +| [`deployment.md`](docs/01_architecture/deployment.md) | Compose layout, secrets pattern, MVP1→GA v1 deployment evolution | | [`api-conventions.md`](docs/01_architecture/api-conventions.md) | Endpoint conventions, error envelope, pagination, idempotency | | [`data-model.md`](docs/01_architecture/data-model.md) | Per-table column-level reference; lineage; future audit_log | | [`adapters.md`](docs/01_architecture/adapters.md) | The `SearchAdapter` Protocol shape | @@ -67,9 +68,10 @@ expected to honor them. The full text lives in [`CLAUDE.md`](CLAUDE.md): 1. **Never commit directly to `main`** — feature branches + PRs only. 2. **Secrets via mounted files** (`*_FILE` env vars), never bare env vars. -3. LLM calls go through the `BaseChatModel` abstraction once it lands at - MVP4; until then services may use the `openai` SDK directly but always - read model + base URL from `Settings`. +3. LLM calls go through the `BaseChatModel` abstraction once it lands + (backlog item — native non-OpenAI provider SDKs); until then services + may use the `openai` SDK directly but always read model + base URL + from `Settings`. 4. **Engine-specific code lives only in `backend/app/adapters/.py`** — the orchestrator and study runner consume the unified `SearchAdapter` Protocol. diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 3e5f6469..9103694a 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -1,13 +1,13 @@ # RelyLoop — Release Roadmap -_Top-level index across MVP1 → GA v1+ as of **2026-05-27**. Click a release name to drill into the per-release dashboard. Theme labels sourced from [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../01_architecture/tech-stack.md). For the rich local view, open [`dashboard.html`](dashboard.html) in a browser._ +_Top-level index across MVP1 → GA v1+ as of **2026-05-28**. Click a release name to drill into the per-release dashboard. Theme labels sourced from [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../01_architecture/tech-stack.md). For the rich local view, open [`dashboard.html`](dashboard.html) in a browser._ ## Releases | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 88 / 89 scoped done · 12 remaining | **In progress** | -| [MVP1.5 / v0.1.5](MVP1_5_DASHBOARD.md) | Real Signals | 1 item(s) queued | **Held / queued** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 88 / 89 scoped done · 14 remaining | **In progress** | +| MVP1.5 / v0.1.5 | Real Signals | — | **Not yet scoped** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 97249189..eeb14aa7 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -2,7 +2,7 @@ # RelyLoop MVP1 Dashboard -_Reflects feature-folder state as of **2026-05-27** (latest mtime of any planned/implemented feature `.md` file). Regenerated by `make dashboard` and the `mvp1-dashboard-regen` pre-commit hook. For the rich local view (filter chips, type colors), open [`mvp1_dashboard.html`](mvp1_dashboard.html) in a browser._ +_Reflects feature-folder state as of **2026-05-28** (latest mtime of any planned/implemented feature `.md` file). Regenerated by `make dashboard` and the `mvp1-dashboard-regen` pre-commit hook. For the rich local view (filter chips, type colors), open [`mvp1_dashboard.html`](mvp1_dashboard.html) in a browser._ ## Next up @@ -21,19 +21,19 @@ Implementation in progress — resume to finish | Metric | Value | |---|---| | Scoped items done | **88 / 89** (99%) — feat_/infra_/chore_/epic_ past idea stage | -| Pending work | **13** items (every not-done feat/infra/chore/bug across all priorities) | +| Pending work | **16** items (every not-done feat/infra/chore/bug across all priorities) | | → P0 — do next | **0** unblocking / paying daily cost | -| → P1 | **1** high-value, ready when P0 clears | +| → P1 | **4** high-value, ready when P0 clears | | → P2 (default) | 10 important to file, not blocking | | → Backlog | 2 captured for record, not planned | | Open bugs | 5 | -| Legacy "Path to MVP1" | 12 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | -| Backlog ideas | 1 idea-only feat/infra (not yet scoped into MVP1) | +| Legacy "Path to MVP1" | 14 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | +| Backlog ideas | 2 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 1 feature(s) actively shipping | ## Pipeline -### Done (118) +### Done (119) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -53,7 +53,7 @@ Implementation in progress — resume to finish | [feat_github_webhook](implemented_features/2026_05_12_feat_github_webhook/feature_spec.md) | Feature | GitHub posts to `POST /webhooks/github` with HMAC-SHA256 signature; the receiver verifies the signature, looks up the proposal by `pr_url`, updates `pr_state` and `pr_merged_at`. | `infra_foundation` `infra_adapter_elastic` `feat_github_pr_worker` | [PR #56](https://github.com/SoundMindsAI/relyloop/pull/56) merged 2026-05-12 | | [feat_home_demo_reseed_endpoint](implemented_features/2026_05_24_feat_home_demo_reseed_endpoint/feature_spec.md) | Feature | A dev-only `POST /api/v1/_test/demo/reseed` endpoint plus a "Reset to demo state" button inside `StartHereChecklist` that lets an operator wipe + re-seed the 4 demo scenarios from the browser. | — | [PR #228](https://github.com/SoundMindsAI/relyloop/pull/228) merged 2026-05-24 | | [feat_home_first_run_demo_nudge](implemented_features/2026_05_22_feat_home_first_run_demo_nudge/feature_spec.md) | Feature | An operator landing on a freshly-seeded stack sees an unambiguous banner above the dashboard's empty/populated content that names the present demo clusters, explains they ship with realistic queries + | — | [PR #188](https://github.com/SoundMindsAI/relyloop/pull/188) merged 2026-05-22 | -| [feat_index_document_browser](../02_product/planned_features/feat_index_document_browser/feature_spec.md) | Feature | A read-only document browser, reachable from two independent entry points (cluster detail + study detail), that lets operators see corpus shape, paginate documents, and inspect any single doc's `_sour | — | [PR #282](https://github.com/SoundMindsAI/relyloop/pull/282) | +| [feat_index_document_browser](implemented_features/2026_05_27_feat_index_document_browser/feature_spec.md) | Feature | A read-only document browser, reachable from two independent entry points (cluster detail + study detail), that lets operators see corpus shape, paginate documents, and inspect any single doc's `_sour | — | [PR #285](https://github.com/SoundMindsAI/relyloop/pull/285) merged 2026-05-27 | | [feat_judgments_periodic_resume_sweep](implemented_features/2026_05_14_feat_judgments_periodic_resume_sweep/feature_spec.md) | Feature | A new Arq cron job `resume_stuck_judgment_lists` ticks every `RELYLOOP_JUDGMENTS_RESUME_SWEEP_MINUTES` minutes (default 15), re-enqueues every `judgment_lists.status='generating'` row via deterministi | — | [PR #104](https://github.com/SoundMindsAI/relyloop/pull/104) merged 2026-05-12 | | [feat_llm_judgments](implemented_features/2026_05_11_feat_llm_judgments/feature_spec.md) | Feature | A relevance engineer selects a query set + cluster + target + rubric and the system runs the current template to fetch top-K hits per query, asks OpenAI to rate each (query, doc) on a 0–3 scale with r | `infra_foundation` `infra_adapter_elastic` `feat_study_lifecycle` | [PR #35](https://github.com/SoundMindsAI/relyloop/pull/35) merged 2026-05-11 | | [feat_orchestrator_zero_streak_abort](implemented_features/2026_05_22_feat_orchestrator_zero_streak_abort/feature_spec.md) | Feature | Complete (PR #191, merged 2026-05-22 as squash `51ae4b3c`) | — | [PR #191](https://github.com/SoundMindsAI/relyloop/pull/191) merged 2026-05-22 | @@ -136,6 +136,7 @@ Implementation in progress — resume to finish | [bug_dashboard_reset_disclosure_gating_too_strict](implemented_features/2026_05_26_bug_dashboard_reset_disclosure_gating_too_strict/idea.md) | Bug | [`ui/src/components/dashboard/start-here-checklist.tsx:150-160`](../ui/src/components/dashboard/start-here-checklist.tsx#L150-L160): | — | Complete | | [bug_datatable_col_vis_density_localstorage_undefined_jsdom](implemented_features/2026_05_26_bug_datatable_col_vis_density_localstorage_undefined_jsdom/idea.md) | Bug | The first integration test in the file (`toggling a column off via the menu removes its cells and persists to localStorage`, line 148) accesses `window.localStorage` successfully. By the time the 3rd– | — | Complete | | [bug_demo_clusters_unreachable_in_healthz](implemented_features/2026_05_25_bug_demo_clusters_unreachable_in_healthz/feature_spec.md) | Bug | **After** the warmup task completes (typically within ~5 seconds of API startup, bounded by per-cluster `httpx` probe latency), `/healthz` reports the accurate `healthy` / `unreachable` aggregate for | — | [PR #236](https://github.com/SoundMindsAI/relyloop/pull/236) merged 2026-05-25 | +| [bug_demo_reseed_fake_metric_regression](implemented_features/2026_05_27_bug_demo_reseed_fake_metric_regression) | Bug | Complete | — | Complete | | [bug_digest_param_importance_seam](implemented_features/2026_05_13_bug_digest_param_importance_seam/idea.md) | Bug | The test fixture builds its own `RDBStorage` via `build_storage(...)`, constructs sampler/pruner with `seed=42`, and calls `tell()` against THAT handle. The worker independently calls `build_storage(. | — | Complete | | [bug_dockerfile_missing_prompts](implemented_features/2026_05_13_bug_dockerfile_missing_prompts/idea.md) | Bug | The `Dockerfile` at the repo root copies `backend/`, `migrations/`, `alembic.ini`, and `pyproject.toml` into `/app/` but does NOT copy `prompts/`. Any code that loads a file from `prompts/` at module- | — | Complete | | [bug_dockerfile_missing_scripts_dir](implemented_features/2026_05_24_bug_dockerfile_missing_scripts_dir/idea.md) | Bug | [`backend/app/services/demo_seeding.py:39`](../../backend/app/services/demo_seeding.py#L39) imports four constants from `scripts/seed_meaningful_demos.py`: | — | Complete | @@ -170,22 +171,25 @@ _None._ _None._ -### Idea (12) +### Idea (15) | # | Priority | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---|---|---| -| 1 | P1 | [infra_smoke_job_chronic_flake](../02_product/planned_features/infra_smoke_job_chronic_flake/idea.md) | Infra | Recent `pr.yml` runs on `main` (newest first): | — | Idea — captured during feat_index_document_browser CI watch (PR #285) | -| 2 | P2 | [chore_e2e_api_base_url_construction](../02_product/planned_features/chore_e2e_api_base_url_construction/idea.md) | Chore | Five sites in three e2e specs concatenate `API_BASE` with a path string: | — | Idea — surfaced during Gemini Code Assist review on PR #273 (`chore_clone_narrow_bounds_full_roundtrip_e2e`). | -| 3 | P2 | [chore_state_md_size_compression](../02_product/planned_features/chore_state_md_size_compression/idea.md) | Chore | `state.md` is structured around two concerns conflated into one file: | — | Idea — tangential observation surfaced during `/impl-execute` for `infra_agent_sibling_worktree_isolation` (Phase 1, this PR). | -| 4 | P2 | [chore_studies_post_arq_spy_fixture](../02_product/planned_features/chore_studies_post_arq_spy_fixture/idea.md) | Chore | The studies POST handler at [`backend/app/api/v1/studies.py:307`](../../backend/app/api/v1/studies.py#L307) calls `await _enqueue_start_study(request, study_id)` after a successful create. The helper | — | Idea — surfaced during `feat_study_preflight_overlap_probe` (PR ___) phase-gate review | -| 5 | P2 | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | -| 6 | P2 | [bug_ceiling_badge_assumes_maximize_direction](../02_product/planned_features/bug_ceiling_badge_assumes_maximize_direction/idea.md) | Bug | The `CEILING` badge in [`studies-table.column-config.tsx:METRIC_CEILING_THRESHOLD`](../ui/src/components/studies/studies-table.column-config.tsx) flags rows where `best_metric >= 0.99`. The threshold | — | — | -| 7 | P2 | [bug_demo_reseed_fake_metric_regression](../02_product/planned_features/bug_demo_reseed_fake_metric_regression) | Bug | | — | — | -| 8 | P2 | [bug_smoke_studies_data_table_search_flake](../02_product/planned_features/bug_smoke_studies_data_table_search_flake/idea.md) | Bug | [`ui/tests/e2e/studies-data-table.spec.ts:20-40`](../../ui/tests/e2e/studies-data-table.spec.ts#L20-L40): | — | Idea — surfaced during PR #273 CI watch. | -| 9 | P2 | [bug_starlette_request_poisons_fastapi_depends_tests](../02_product/planned_features/bug_starlette_request_poisons_fastapi_depends_tests/idea.md) | Bug | There is shared state somewhere in starlette / FastAPI that is mutated by `Request(scope={"type": "http", ...})` and breaks subsequent `Depends` resolution. Possible suspects: | — | Idea — bug captured during feat_index_document_browser Story 2.1 | -| 10 | P2 | [bug_webhook_concurrent_merge_race_timing_sensitive](../02_product/planned_features/bug_webhook_concurrent_merge_race_timing_sensitive/idea.md) | Bug | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | — | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | -| 11 | Backlog | [chore_auto_followup_parent_advisory_lock](../02_product/planned_features/chore_auto_followup_parent_advisory_lock/idea.md) | Chore | The shipped `feat_auto_followup_studies` worker uses a two-layer idempotency scheme: | — | Idea — captured as a standalone file to resolve broken cross-references in `feat_auto_followup_studies` D-11 + plan F2 + `bug_auto_followup_completed_parent_stop_chain_race/idea.md`. The slug was coined 2026-05-24 in D-11 but only existed as descriptive prose across other documents until now. | -| 12 | Backlog | [chore_e2e_seed_acme_helper_dead](../02_product/planned_features/chore_e2e_seed_acme_helper_dead/idea.md) | Chore | `seedAcmeProductsChain` is a 140-line helper that constructs a cluster + query_set + template + judgment_list + study + optional proposal/digest chain "Acme Products" demo scenario. The function is co | — | Closed (2026-05-25) — superseded by guide-06 spec wiring (commit `2cbcb93b`, 2026-05-22). Real caller: `ui/tests/e2e/guides/06_create_and_monitor_study.spec.ts`. No further action beyond the coverage-audit refresh that ships in the same PR. | +| 1 | P1 | [feat_ubi_judgments](../02_product/planned_features/feat_ubi_judgments/idea.md) | Feature | MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click`… | — | Idea — bundled with [`infra_adapter_solr`](../infra_adapter_solr/idea.md) into MVP2 / v0.2 "Three-Engine + Real Signals" | +| 2 | P1 | [infra_smoke_job_chronic_flake](../02_product/planned_features/infra_smoke_job_chronic_flake/idea.md) | Infra | Recent `pr.yml` runs on `main` (newest first): | — | Idea — captured during feat_index_document_browser CI watch (PR #285) | +| 3 | P1 | [chore_drop_fusion_scope](../02_product/planned_features/chore_drop_fusion_scope/idea.md) | Chore | The prior umbrella spec ([`docs/00_overview/relyloop-spec.md`](relyloop-spec.md)) planned Lucidworks Fusion as the MVP3 engine target and Apache Solr as a v2+ "architectural reference, not v1 scope" a | — | Idea — scope decision, paired with [`infra_adapter_solr`](../infra_adapter_solr/idea.md) | +| 4 | P1 | [bug_demo_reseed_button_silent_enqueue_failure](../02_product/planned_features/bug_demo_reseed_button_silent_enqueue_failure/idea.md) | Bug | There is at least one untrapped exception path in `backend/workers/demo_reseed.py:run_demo_reseed`'s pre-main-body initialization that: | — | Idea — bug captured during PR #286 first-run testing | +| 5 | P2 | [chore_demo_seeding_integration_tests_rewrite](../02_product/planned_features/chore_demo_seeding_integration_tests_rewrite/idea.md) | Chore | The async flow's contract: | — | Idea — chore captured during PR #286 | +| 6 | P2 | [chore_e2e_api_base_url_construction](../02_product/planned_features/chore_e2e_api_base_url_construction/idea.md) | Chore | Five sites in three e2e specs concatenate `API_BASE` with a path string: | — | Idea — surfaced during Gemini Code Assist review on PR #273 (`chore_clone_narrow_bounds_full_roundtrip_e2e`). | +| 7 | P2 | [chore_state_md_size_compression](../02_product/planned_features/chore_state_md_size_compression/idea.md) | Chore | `state.md` is structured around two concerns conflated into one file: | — | Idea — tangential observation surfaced during `/impl-execute` for `infra_agent_sibling_worktree_isolation` (Phase 1, this PR). | +| 8 | P2 | [chore_studies_post_arq_spy_fixture](../02_product/planned_features/chore_studies_post_arq_spy_fixture/idea.md) | Chore | The studies POST handler at [`backend/app/api/v1/studies.py:307`](../../backend/app/api/v1/studies.py#L307) calls `await _enqueue_start_study(request, study_id)` after a successful create. The helper | — | Idea — surfaced during `feat_study_preflight_overlap_probe` (PR ___) phase-gate review | +| 9 | P2 | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | +| 10 | P2 | [bug_ceiling_badge_assumes_maximize_direction](../02_product/planned_features/bug_ceiling_badge_assumes_maximize_direction/idea.md) | Bug | The `CEILING` badge in [`studies-table.column-config.tsx:METRIC_CEILING_THRESHOLD`](../ui/src/components/studies/studies-table.column-config.tsx) flags rows where `best_metric >= 0.99`. The threshold | — | — | +| 11 | P2 | [bug_smoke_studies_data_table_search_flake](../02_product/planned_features/bug_smoke_studies_data_table_search_flake/idea.md) | Bug | [`ui/tests/e2e/studies-data-table.spec.ts:20-40`](../../ui/tests/e2e/studies-data-table.spec.ts#L20-L40): | — | Idea — surfaced during PR #273 CI watch. | +| 12 | P2 | [bug_starlette_request_poisons_fastapi_depends_tests](../02_product/planned_features/bug_starlette_request_poisons_fastapi_depends_tests/idea.md) | Bug | There is shared state somewhere in starlette / FastAPI that is mutated by `Request(scope={"type": "http", ...})` and breaks subsequent `Depends` resolution. Possible suspects: | — | Idea — bug captured during feat_index_document_browser Story 2.1 | +| 13 | P2 | [bug_webhook_concurrent_merge_race_timing_sensitive](../02_product/planned_features/bug_webhook_concurrent_merge_race_timing_sensitive/idea.md) | Bug | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | — | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | +| 14 | Backlog | [chore_auto_followup_parent_advisory_lock](../02_product/planned_features/chore_auto_followup_parent_advisory_lock/idea.md) | Chore | The shipped `feat_auto_followup_studies` worker uses a two-layer idempotency scheme: | — | Idea — captured as a standalone file to resolve broken cross-references in `feat_auto_followup_studies` D-11 + plan F2 + `bug_auto_followup_completed_parent_stop_chain_race/idea.md`. The slug was coined 2026-05-24 in D-11 but only existed as descriptive prose across other documents until now. | +| 15 | Backlog | [chore_e2e_seed_acme_helper_dead](../02_product/planned_features/chore_e2e_seed_acme_helper_dead/idea.md) | Chore | `seedAcmeProductsChain` is a 140-line helper that constructs a cluster + query_set + template + judgment_list + study + optional proposal/digest chain "Acme Products" demo scenario. The function is co | — | Closed (2026-05-25) — superseded by guide-06 spec wiring (commit `2cbcb93b`, 2026-05-22). Real caller: `ui/tests/e2e/guides/06_create_and_monitor_study.spec.ts`. No further action beyond the coverage-audit refresh that ships in the same PR. | ## Dependency graph @@ -198,8 +202,6 @@ graph LR classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; - feat_index_document_browser["index document browser"] - class feat_index_document_browser done; infra_agent_sibling_worktree_isolation["agent sibling worktree isolation"] class infra_agent_sibling_worktree_isolation implement; infra_foundation["foundation"] @@ -376,6 +378,8 @@ graph LR class infra_dockerfile_invariant_smoke_in_ci done; infra_test_worktree_missing_integration_envs["test worktree missing integration envs"] class infra_test_worktree_missing_integration_envs done; + feat_index_document_browser["index document browser"] + class feat_index_document_browser done; feat_study_lifecycle --> feat_digest_proposal feat_llm_judgments --> feat_digest_proposal infra_foundation --> feat_llm_judgments diff --git a/docs/00_overview/MVP2_DASHBOARD.md b/docs/00_overview/MVP2_DASHBOARD.md index bef843ad..8d0eb00d 100644 --- a/docs/00_overview/MVP2_DASHBOARD.md +++ b/docs/00_overview/MVP2_DASHBOARD.md @@ -2,7 +2,7 @@ # RelyLoop MVP2 Dashboard -_Reflects feature-folder state as of **2026-05-27** (latest mtime of any planned/implemented feature `.md` file). Regenerated by `make dashboard` and the `mvp1-dashboard-regen` pre-commit hook. For the rich local view (filter chips, type colors), open [`mvp2_dashboard.html`](mvp2_dashboard.html) in a browser._ +_Reflects feature-folder state as of **2026-05-28** (latest mtime of any planned/implemented feature `.md` file). Regenerated by `make dashboard` and the `mvp1-dashboard-regen` pre-commit hook. For the rich local view (filter chips, type colors), open [`mvp2_dashboard.html`](mvp2_dashboard.html) in a browser._ ## Next up @@ -15,14 +15,14 @@ Pull from the Idea backlog or capture a new feature spec. | Metric | Value | |---|---| | Scoped items done | **1 / 1** (100%) — feat_/infra_/chore_/epic_ past idea stage | -| Pending work | **4** items (every not-done feat/infra/chore/bug across all priorities) | +| Pending work | **5** items (every not-done feat/infra/chore/bug across all priorities) | | → P0 — do next | **0** unblocking / paying daily cost | -| → P1 | **0** high-value, ready when P0 clears | +| → P1 | **1** high-value, ready when P0 clears | | → P2 (default) | 1 important to file, not blocking | | → Backlog | 3 captured for record, not planned | | Open bugs | 1 | | Legacy "Path to MVP2" | 1 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | -| Backlog ideas | 3 idea-only feat/infra (not yet scoped into MVP2) | +| Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP2) | | In flight | 0 feature(s) actively shipping | ## Pipeline @@ -45,14 +45,15 @@ _None._ _None._ -### Idea (4) +### Idea (5) | # | Priority | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---|---|---| -| 1 | P2 | [feat_chat_last_message_preview](../02_product/planned_features/feat_chat_last_message_preview/idea.md) | Feature | The `/chat` list page ([ui/src/app/chat/page.tsx](../../ui/src/app/chat/page.tsx)) renders each conversation row as `title + relative timestamp + "{N} messages"` via… | — | Held for MVP2 (decided 2026-05-13). No technical dependency on MVP2 infra; bundling with [`bug_chat_long_conversation_truncation_mvp2`](../bug_chat_long_conversation_truncation_mvp2/idea.md) as chat polish. `feat_chat_agent` has been live since 2026-05-12 (PR #60) and no operator has asked for the preview yet. Folder renamed from `chore_chat_last_message_preview` 2026-05-14 per `/idea-preflight` audit — `chore_` is reserved for changes with no user-visible behavior per [feature_templates/README.md](../feature_templates/README.md). | -| 2 | Backlog | [feat_fts_rank_ordering_mvp2](../02_product/planned_features/feat_fts_rank_ordering_mvp2/idea.md) | Feature | `feat_data_table_primitive` shipped filter-only FTS — `?q=foo` matches rows where `search_vector @@ plainto_tsquery('english', 'foo')` is true but orders results by `created_at DESC, id DESC` (the def | — | Idea — deferred from `feat_data_table_primitive` (MVP1) per spec §16. | -| 3 | Backlog | [infra_arq_subprocess_test_mvp2](../02_product/planned_features/infra_arq_subprocess_test_mvp2/idea.md) | Infra | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; | — | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; a subprocess test would add a narrow Arq-version-regression guard. | -| 4 | Backlog | [bug_chat_long_conversation_truncation_mvp2](../02_product/planned_features/bug_chat_long_conversation_truncation_mvp2/idea.md) | Bug | [`backend/app/services/agent_chat.send_user_message`](../../backend/app/services/agent_chat.py) defensively caps the OpenAI history at the most recent `HISTORY_MAX_MESSAGES = 100` messages… | — | Held for MVP2 (decided 2026-05-13). Folder renamed with `_mvp2` suffix to make the deferral visible at-a-glance in `ls docs/02_product/planned_features/`. Resume work when MVP2 starts — no technical dependency on MVP2 infra (audit_log is N/A; Langfuse is convenience only); the deferral is scope discipline + zero current impact (latent bug, no operator has hit the 100-message cap). | +| 1 | P1 | [infra_adapter_solr](../02_product/planned_features/infra_adapter_solr/idea.md) | Infra | After MVP1.5, RelyLoop runs against Elasticsearch and OpenSearch — but the "engine-neutral" positioning is aspirational until a third engine ships. Apache Solr is the right third engine because: | — | Idea — anchor feature for MVP2 / v0.2 "Three-Engine + Real Signals" (bundled with [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md)) | +| 2 | P2 | [feat_chat_last_message_preview](../02_product/planned_features/feat_chat_last_message_preview/idea.md) | Feature | The `/chat` list page ([ui/src/app/chat/page.tsx](../../ui/src/app/chat/page.tsx)) renders each conversation row as `title + relative timestamp + "{N} messages"` via… | — | Held for MVP2 (decided 2026-05-13). No technical dependency on MVP2 infra; bundling with [`bug_chat_long_conversation_truncation_mvp2`](../bug_chat_long_conversation_truncation_mvp2/idea.md) as chat polish. `feat_chat_agent` has been live since 2026-05-12 (PR #60) and no operator has asked for the preview yet. Folder renamed from `chore_chat_last_message_preview` 2026-05-14 per `/idea-preflight` audit — `chore_` is reserved for changes with no user-visible behavior per [feature_templates/README.md](../feature_templates/README.md). | +| 3 | Backlog | [feat_fts_rank_ordering_mvp2](../02_product/planned_features/feat_fts_rank_ordering_mvp2/idea.md) | Feature | `feat_data_table_primitive` shipped filter-only FTS — `?q=foo` matches rows where `search_vector @@ plainto_tsquery('english', 'foo')` is true but orders results by `created_at DESC, id DESC` (the def | — | Idea — deferred from `feat_data_table_primitive` (MVP1) per spec §16. | +| 4 | Backlog | [infra_arq_subprocess_test_mvp2](../02_product/planned_features/infra_arq_subprocess_test_mvp2/idea.md) | Infra | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; | — | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; a subprocess test would add a narrow Arq-version-regression guard. | +| 5 | Backlog | [bug_chat_long_conversation_truncation_mvp2](../02_product/planned_features/bug_chat_long_conversation_truncation_mvp2/idea.md) | Bug | [`backend/app/services/agent_chat.send_user_message`](../../backend/app/services/agent_chat.py) defensively caps the OpenAI history at the most recent `HISTORY_MAX_MESSAGES = 100` messages… | — | Held for MVP2 (decided 2026-05-13). Folder renamed with `_mvp2` suffix to make the deferral visible at-a-glance in `ls docs/02_product/planned_features/`. Resume work when MVP2 starts — no technical dependency on MVP2 infra (audit_log is N/A; Langfuse is convenience only); the deferral is scope discipline + zero current impact (latent bug, no operator has hit the 100-message cap). | ## Dependency graph diff --git a/docs/00_overview/README.md b/docs/00_overview/README.md index 9ddd9efe..e55a7d02 100644 --- a/docs/00_overview/README.md +++ b/docs/00_overview/README.md @@ -5,6 +5,6 @@ Use this section for high-level project context, repository orientation, and umb Current contents: - `product/` — full product spec - - `product/relevance-copilot-spec.md` — full product and system specification + - `relyloop-spec.md` — full product and system specification For MVP1 decomposition (user stories + per-feature spec folders), see [`docs/02_product/`](../02_product/). diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index fff82db8..ba8bdfcd 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -368,7 +368,7 @@

RelyLoop — Release Roadmap

- Top-level index across MVP1 → GA v1+ as of 2026-05-27. Click a release name to + Top-level index across MVP1 → GA v1+ as of 2026-05-28. Click a release name to drill into the per-release dashboard. Theme labels sourced from tech-stack.md §"Canonical release matrix". See state.md for @@ -384,16 +384,16 @@

Releases

The Loop
-
88 / 89 scoped done · 12 remaining
+
88 / 89 scoped done · 14 remaining
In progress
- +
MVP1.5 / v0.1.5
Real Signals
-
1 item(s) queued
- Held / queued +
+ Not yet scoped
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 9c79b71c..cf5bfe4a 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -369,7 +369,7 @@

RelyLoop MVP1 Dashboard

- Reflects feature-folder state as of 2026-05-27 (latest mtime of any + Reflects feature-folder state as of 2026-05-28 (latest mtime of any docs/02_product/planned_features/ or docs/00_overview/implemented_features/ file). See state.md for the active branch context, @@ -403,7 +403,7 @@

MVP1 Progress

Pending work
-
13
+
16
every not-done feat/infra/chore/bug across all priorities
@@ -420,7 +420,7 @@

MVP1 Progress

P1
-
1
+
4
high-value, ready when P0 clears
@@ -435,14 +435,14 @@

MVP1 Progress

Legacy "Path to MVP1"
-
12
+
14
scoped not-done + bugs + chore-ideas only (excludes feat/infra ideas)
Backlog ideas: - 1 idea-only feat/infra folders (not yet scoped into MVP1) + 2 idea-only feat/infra folders (not yet scoped into MVP1) In flight: @@ -463,7 +463,20 @@

Pipeline

-

Idea 12

+

Idea 15

+ +
+ +
+ Feature + P1 + +
+
MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click`…
+ + +
+
@@ -478,6 +491,45 @@

Idea 12

+
+ +
+ Chore + P1 + +
+
The prior umbrella spec ([`docs/00_overview/relyloop-spec.md`](relyloop-spec.md)) planned Lucidworks Fusion as the MVP3 engine target and Apache Solr as a v2+ "architectural reference, not v1 scope" a
+ + +
+ + +
+ +
+ Bug + P1 + +
+
There is at least one untrapped exception path in `backend/workers/demo_reseed.py:run_demo_reseed`'s pre-main-body initialization that:
+ + +
+ + +
+ +
+ Chore + P2 + +
+
The async flow's contract:
+ + +
+ +
@@ -543,19 +595,6 @@

Idea 12

-
- -
- Bug - P2 - -
-
- - -
- -
@@ -650,7 +689,7 @@

Implementing 1

-

Done 118

+

Done 119

@@ -861,11 +900,11 @@

Done 118

- +
Feature - PR #282 + PR #285 merged 2026-05-27
A read-only document browser, reachable from two independent entry points (cluster detail + study detail), that lets operators see corpus shape, paginate documents, and inspect any single doc's `_sour
@@ -1939,6 +1978,19 @@

Done 118

+
+ +
+ Bug + + merged 2026-05-27 +
+
Complete
+ + +
+ +
@@ -2198,8 +2250,6 @@

Dependency graph (feat_ + infra_)

classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; - feat_index_document_browser["index document browser"] - class feat_index_document_browser done; infra_agent_sibling_worktree_isolation["agent sibling worktree isolation"] class infra_agent_sibling_worktree_isolation implement; infra_foundation["foundation"] @@ -2376,6 +2426,8 @@

Dependency graph (feat_ + infra_)

class infra_dockerfile_invariant_smoke_in_ci done; infra_test_worktree_missing_integration_envs["test worktree missing integration envs"] class infra_test_worktree_missing_integration_envs done; + feat_index_document_browser["index document browser"] + class feat_index_document_browser done; feat_study_lifecycle --> feat_digest_proposal feat_llm_judgments --> feat_digest_proposal infra_foundation --> feat_llm_judgments @@ -2429,8 +2481,6 @@

Dependency graph (feat_ + infra_)

classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; - feat_index_document_browser["index document browser"] - class feat_index_document_browser done; infra_agent_sibling_worktree_isolation["agent sibling worktree isolation"] class infra_agent_sibling_worktree_isolation implement; infra_foundation["foundation"] @@ -2607,6 +2657,8 @@

Dependency graph (feat_ + infra_)

class infra_dockerfile_invariant_smoke_in_ci done; infra_test_worktree_missing_integration_envs["test worktree missing integration envs"] class infra_test_worktree_missing_integration_envs done; + feat_index_document_browser["index document browser"] + class feat_index_document_browser done; feat_study_lifecycle --> feat_digest_proposal feat_llm_judgments --> feat_digest_proposal infra_foundation --> feat_llm_judgments diff --git a/docs/00_overview/mvp2_dashboard.html b/docs/00_overview/mvp2_dashboard.html index d016746e..43e430a1 100644 --- a/docs/00_overview/mvp2_dashboard.html +++ b/docs/00_overview/mvp2_dashboard.html @@ -369,7 +369,7 @@

RelyLoop MVP2 Dashboard

- Reflects feature-folder state as of 2026-05-27 (latest mtime of any + Reflects feature-folder state as of 2026-05-28 (latest mtime of any docs/02_product/planned_features/ or docs/00_overview/implemented_features/ file). See state.md for the active branch context, @@ -403,7 +403,7 @@

MVP2 Progress

Pending work
-
4
+
5
every not-done feat/infra/chore/bug across all priorities
@@ -420,7 +420,7 @@

MVP2 Progress

P1
-
0
+
1
high-value, ready when P0 clears
@@ -442,7 +442,7 @@

MVP2 Progress

Backlog ideas: - 3 idea-only feat/infra folders (not yet scoped into MVP2) + 4 idea-only feat/infra folders (not yet scoped into MVP2) In flight: @@ -463,7 +463,20 @@

Pipeline

-

Idea 4

+

Idea 5

+ +
+ +
+ Infra + P1 + +
+
After MVP1.5, RelyLoop runs against Elasticsearch and OpenSearch — but the "engine-neutral" positioning is aspirational until a third engine ships. Apache Solr is the right third engine because:
+ + +
+
diff --git a/docs/00_overview/product/relevance-copilot-spec.md b/docs/00_overview/relyloop-spec.md similarity index 76% rename from docs/00_overview/product/relevance-copilot-spec.md rename to docs/00_overview/relyloop-spec.md index 1e3755bc..08408b65 100644 --- a/docs/00_overview/product/relevance-copilot-spec.md +++ b/docs/00_overview/relyloop-spec.md @@ -9,71 +9,88 @@ ## 1. Summary -RelyLoop is an open-source tool for enterprise search platform teams. It combines a conversational LLM agent with an automated overnight optimization loop ("Karpathy loop") to systematically tune query-time search relevance on Elasticsearch, OpenSearch, and Lucidworks Fusion (with pure-Solr support deferred to v2). Engineers describe relevance problems in chat; the agent introspects the cluster, proposes search-space parameters, and queues thousands of trials against `ir_measures`-computed metrics. Winning configurations are surfaced as Pull Requests / Merge Requests against a central search-config Git repo, where named approvers review and merge them into production. +RelyLoop is the only open-source tool that runs **automated Bayesian search-space optimization across thousands of trials** on every major open-source search engine (Elasticsearch, OpenSearch, Apache Solr) and ships winning configurations as **Pull Requests** to a central search-config Git repo, where named approvers review and merge them into production. A conversational LLM agent describes the loop, proposes the search space, and dispatches the trials — but the engineering moat is the loop itself, not the agent. -The tool is a single, engine-agnostic, provider-agnostic system: one UI, one workflow, one schema. Differences between Elasticsearch / OpenSearch, Lucidworks Fusion, and any future engine (pure Solr, Vespa, etc.) are isolated behind a thin adapter interface — and the same adapter pattern applies to LLM providers (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, self-hosted Ollama/vLLM) and Git providers (GitHub, GitLab, Bitbucket). Multi-tenancy is supported from the schema level so a single deployment can serve many downstream customers in isolation. +This combination — *Bayesian/TPE optimization across the full query-time search space, on every major OSS engine, with a Git-PR apply path* — is genuinely unique in May 2026: -**Delivery is incremental across six releases**, each meaningful as a discrete capability bundle: +- **OpenSearch's Search Relevance Workbench** ships GA query sets, judgment lists (LLM-as-judge + UBI-via-COEC), and search-config A/B comparison, but its only optimizer is a **66-cell grid search** restricted to hybrid-search weights ([SRW docs](https://docs.opensearch.org/latest/search-plugins/search-relevance/optimize-hybrid-search/)). Bayesian optimization is in RFC #934 with no shipped code. SRW also has **no apply path by explicit RFC design** ([RFC #17735](https://github.com/opensearch-project/OpenSearch/issues/17735)): "focuses on evaluation and analysis, not production deployment mechanisms." SRW is OpenSearch-only by architecture. +- **Elasticsearch** deprecated its higher-level tuning tools (Behavioral Analytics + Search Applications) in 9.0 and ships only the raw `_rank_eval` API primitive. There is no native ES equivalent to SRW or RelyLoop. +- **The Solr ecosystem** (Quepid + Chorus + RRE) is mature for manual evaluation and judgment-list management, but has no auto-optimizer and no Git-PR apply path. -- **MVP1 / v0.1 (5 weeks) — "The Loop":** Karpathy loop end-to-end on a laptop. ES + OpenSearch, OpenAI, GitHub, single-tenant, basic logging. Demonstrates the value prop. -- **MVP1.5 / v0.1.5 (+2 weeks) — "Real Signals":** OpenSearch UBI as a first-class judgment source. `UbiReader` (engine-agnostic) + pluggable `SignalsConverter` Protocol + hybrid UBI+LLM mode. Earns the evaluation of operators with real traffic who distrust LLM-as-judge as the primary trust anchor. -- **MVP2 / v0.2 (+3 weeks) — "Observable":** Langfuse + SigNoz + event catalog + audit immutability + lineage columns + PII redaction. Trustworthy enough for serious evaluation. -- **MVP3 / v0.3 (+3 weeks) — "Production Stacks":** Lucidworks Fusion adapter (and its native signals reader feeding the MVP1.5 Protocol) + multi-Git provider abstraction (GitLab, Bitbucket) + adapter contract tests. Works against real enterprise stacks. -- **MVP4 / v0.4 (+3 weeks) — "Multi-tenant, Multi-LLM":** Tenants + tenant-scoped API keys + multi-LLM provider abstraction (Anthropic, Bedrock, Azure OpenAI, Vertex, Ollama/vLLM). Platform-team scale. -- **GA v1 / v1.0 (+3 weeks) — "Production-ready":** LangGraph orchestrator + full agent-first API surface + four-layer test pyramid + full GitHub Actions CI/CD with security gates + complete OSS governance. +See [`docs/07_research/comparison.md`](../../07_research/comparison.md) for the full citation-backed comparison matrix. -Total: ~19 weeks single-engineer, 12–14 weeks with two. Each release ships a coherent step-up in adopter value and audience reach. +The system is **engine-neutral, provider-neutral, and Git-provider-neutral by design**: differences between Elasticsearch, OpenSearch, and Solr are isolated behind a thin `SearchAdapter` Protocol; LLM providers (OpenAI today, others post-GA) are isolated behind a `ChatModel` adapter; Git providers (GitHub today, others in the backlog) are isolated behind a `GitProvider` adapter. One UI, one workflow, one schema, regardless of what the operator already runs. + +**Delivery is incremental across three pre-GA releases plus a polish-and-governance GA:** + +- **MVP1 / v0.1 (shipped) — "The Loop":** Karpathy loop end-to-end on a laptop. Elasticsearch + OpenSearch, OpenAI-compatible LLM, GitHub, single-tenant, basic logging. Demonstrates the moat: Bayesian/TPE optimization + Git-PR apply + conversational agent that runs the loop. +- **MVP2 / v0.2 — "Three-Engine + Real Signals":** Apache Solr adapter + UBI judgments + hybrid UBI+LLM converter, bundled. After MVP2, RelyLoop runs on all three OSS engines with UBI on every one of them. The engine-neutral claim becomes verifiable, not rhetorical. +- **MVP3 / v0.3 — "Observable":** Langfuse + SigNoz + event catalog + audit immutability + lineage columns + PII redaction + trace context propagation. Trustworthy enough for unattended overnight runs by a serious platform team. +- **GA v1 / v1.0 — "Production-ready":** Polish, governance, hardening — LangGraph orchestrator, full four-layer test pyramid at 90% coverage, full CI/CD with security gates, complete OSS launch infrastructure (docs, ADRs, signed container images, contributor onboarding, design-partner references). No new product surface beyond MVP3; the lift is from "working" to "production-ready, contributor-ready, fully governed." + +All six of RelyLoop's differentiators (Bayesian/TPE optimization, Git-PR apply path, conversational agent that runs the loop, all three OSS engines, hybrid UBI+LLM judgments, local-first observability) are live by MVP3. GA v1 adds no new product surface. + +**Backlog** (captured, not in flight): multi-Git-provider abstraction (GitLab, Bitbucket); multi-tenancy primitives + multi-LLM provider abstraction (Anthropic, Bedrock, Azure OpenAI, Vertex, Ollama/vLLM); Path B production-monitoring + bandit-style online learning; Lucidworks Fusion adapter (explicitly dropped, see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)). The HTTP API is designed as a first-class product, not just the back end of the UI. Every operation a human or the in-tool orchestrator can perform is also callable by an external agent over plain REST, with bearer-token auth, OpenAPI 3.1 publication, idempotency keys, outgoing webhooks, SSE event streams, and machine-readable capability discovery. See §21 *Agent integration*. -The orchestrator itself is built on **LangGraph** with Postgres-backed state persistence; LLM observability uses **self-hosted Langfuse**, distributed observability uses **self-hosted SigNoz**. Nothing about LLM behavior or system telemetry leaves the deployment VM. See §15 *LLM orchestration & observability*. +The orchestrator is built on **LangGraph** with Postgres-backed state persistence; LLM observability uses **self-hosted Langfuse**, distributed observability uses **self-hosted SigNoz**. Nothing about LLM behavior or system telemetry leaves the deployment VM. See §15 *LLM orchestration & observability*. -Engineering quality is gated by a four-layer test pyramid (unit ≥90% coverage, contract, integration, end-to-end) and GitHub Actions CI/CD. See §23 *Non-functional requirements*. +Engineering quality is gated by a four-layer test pyramid (unit ≥90% coverage at GA, contract, integration, end-to-end) and GitHub Actions CI/CD. See §23 *Non-functional requirements*. Released under **Apache License 2.0**. Initial maintainer: soundminds.ai, with an explicit transition path to community maintainership over 12–24 months. See §28 *OSS positioning & governance*. ## 2. Context & motivation -Search relevance tuning at our organization is currently manual, ad-hoc, and engineer-time-bound. A relevance engineer hypothesizes a change, edits a query template, eyeballs a few queries, and either ships it or doesn't. Two things are missing: +Search relevance tuning at most organizations is manual, ad-hoc, and engineer-time-bound. A relevance engineer hypothesizes a change, edits a query template, eyeballs a few queries, and either ships it or doesn't. Two things are missing: + +1. **Systematic exploration.** The space of tunable parameters (field weights, boosts, tie-breakers, fuzziness, slop, function-score parameters, hybrid-search alphas) is too large to explore manually. Teams routinely ship the first plausible win rather than the best win. Off-the-shelf workbenches (Quepid, RRE, Chorus) make manual A/B comparison easier but don't drive automated overnight sweeps. +2. **Quantified evaluation.** Without a standing query set and judgment list, teams can't tell whether a change generalizes or just happens to fix the three queries the engineer noticed. + +The current OSS landscape (May 2026): -1. **Systematic exploration.** The space of tunable parameters (field weights, boosts, tie-breakers, fuzziness, slop, function-score parameters, hybrid-search alphas) is too large to explore manually. We routinely ship the first plausible win rather than the best win. -2. **Quantified evaluation.** Without a standing query set and judgment list, we can't tell whether a change generalizes or just happens to fix the three queries the engineer noticed. +- **OpenSearch Search Relevance Workbench** covers a substantial slice of the workflow GA today — query sets, judgment lists (LLM-as-judge and UBI-derived via COEC), search-config A/B comparison, multi-cluster, scheduled experiments. But its only optimizer is a 66-cell grid search restricted to hybrid-search weights; the full-search-space Bayesian optimization that would close the "systematic exploration" gap is in RFC #934 with no shipped code. SRW also has no apply path by explicit RFC design, and is architecturally OpenSearch-only. +- **OpenSearch Relevance Agent** (experimental in 3.6) is a conversational DSL recommender. It suggests edits; it does not run multi-thousand-trial sweeps and does not write to Git. +- **Elasticsearch** deprecated Behavioral Analytics + Search Applications in 9.0 and offers only the `_rank_eval` API primitive. The implicit Elastic message is "DIY through query DSL + retrievers." +- **The Solr ecosystem** (Quepid + Chorus + RRE) is mature for manual evaluation but has no auto-optimizer. -Off-the-shelf tools (Quepid, RRE, Chorus) cover the manual workbench problem well but don't drive automated overnight studies, and don't have an LLM in the loop to design the search space. The OpenSearch Relevance Agent does the LLM-and-conversation part but is OpenSearch-only and lacks the autonomous-optimization loop. This tool combines both. +RelyLoop is the tool that **fills the gap none of the above closes**: automated Bayesian/TPE optimization across the full search space, on every major OSS engine, with a Git-PR apply path. The conversational agent is the front door that makes the loop accessible; the Bayesian loop and the Git-PR posture are the actual engineering moat. See [`docs/07_research/comparison.md`](../../07_research/comparison.md) for the full citation-backed comparison. ## 3. Goals -The tool must enable the relevance team to: +The tool must enable a relevance engineering team to: - Define a query set and a calibrated judgment list once, reuse them across studies - Conversationally describe a relevance problem and have an agent propose what to tune -- Run automated, parallelized, overnight studies of thousands of trials per query set +- Run automated, parallelized, overnight studies of thousands of trials per query set against the full query-time search space (field weights, function scores, fuzziness, slop, `mm`, tie-breakers, hybrid weights — not just one slice) - Produce a parameter-importance analysis and an LLM-written digest by morning -- Open a Git PR against a central search-config repo with the winning configuration +- Open a Git PR against a central search-config repo with the winning configuration, where the operator's existing approvers and CI handle deployment - Track which proposals are pending, merged, deployed, or rejected — across multiple clusters and environments -- Operate identically against Elasticsearch and Lucidworks Fusion clusters, with a path to add pure Solr, Vespa, or others later +- Operate identically against Elasticsearch, OpenSearch, and Apache Solr — the three open-source engines the relevance community treats as canonical ## 4. Non-goals The tool will not: +- **Compete with OpenSearch SRW on manual A/B comparison.** SRW already does that GA, in-cluster, for OpenSearch. RelyLoop's value is what SRW deliberately doesn't do (full-search-space Bayesian optimization + Git-PR apply path) and what it can't do (engine-neutral across ES, OpenSearch, and Solr). Operators who only need manual A/B on OpenSearch should use SRW; RelyLoop is for the next-class problem. - Run online A/B tests on production traffic. It evaluates offline against judgment lists. -- Train Learning-to-Rank (LTR) models in v1. The output is query-time DSL/edismax parameter changes, not learned reranker weights. -- Manage the search-config repo's CI/CD. The tool opens PRs; the user's existing CI handles deployment. +- Train Learning-to-Rank (LTR) models in v1. The output is query-time DSL/edismax parameter changes, not learned reranker weights. LTR training is a v2 candidate. +- Manage the search-config repo's CI/CD. The tool opens PRs; the operator's existing CI handles deployment. - Make schema/mapping/analyzer changes. Tuning is restricted to query-time parameters. - Function as a search-engine UI. It does not show end-user search results; it shows experiment results. - Modify production cluster configuration directly. All changes flow through Git. - Provide an MCP server. The tool's HTTP API uses OpenAPI 3.1 + idiomatic REST + outgoing webhooks instead, which is testable with any HTTP client and consumable by any agent framework. The same operations the in-tool orchestrator uses are exposed externally — there is no second-class agent interface. +- **Support Lucidworks Fusion.** Fusion was scoped as MVP3 in an earlier plan and is dropped in the 2026-05-27 reframe; see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md) for the rationale (vendor entanglement, narrower audience overlap with the OSC/Haystack community, materially higher build cost than Solr, no roadmap commitment to commercial engines). A community-contributed Fusion adapter remains possible because the `SearchAdapter` Protocol is unchanged; the project does not own that direction. - **Sit on the live search-serving path.** The tool is for offline experimentation and change management. It does not score, rank, or rerank production search results in real time, and it is never an inline dependency of the search-serving infrastructure. Production search behavior is determined by the configs that have been merged into the config repo and deployed by the operator's CI — the tool's role ends at the PR. -- **Provide real-time production search-quality monitoring.** Streaming user signals into rolling-window quality metrics, alerting on degradation, and incident dashboards belong to operational observability tooling (APM, Fusion's own analytics, custom Grafana boards). The tool is deliberately scoped to the experimentation-and-change-management problem; expanding into production monitoring is a coherent v2 direction (see §27) but is **not** in v1. -- **Provide shadow validation against a live production traffic stream.** Pre-deploy validation in v1 is offline against query sets and judgment lists, plus the optional read-only "validate on prod" pass already in §17. Streaming a sample of live queries through a candidate config in real time is more confidence-building but requires stream-processing infrastructure that v1 deliberately avoids. +- **Provide real-time production search-quality monitoring.** Streaming user signals into rolling-window quality metrics, alerting on degradation, and incident dashboards belong to operational observability tooling (APM, SRW's own metrics surface, custom Grafana boards). The tool is deliberately scoped to the experimentation-and-change-management problem; expanding into production monitoring is a coherent v2 Path B direction (see §27) but is **not** in scope through GA v1. +- **Provide shadow validation against a live production traffic stream.** Pre-deploy validation is offline against query sets and judgment lists, plus the optional read-only "validate on prod" pass already in §17. Streaming a sample of live queries through a candidate config in real time is more confidence-building but requires stream-processing infrastructure that the project deliberately avoids through GA v1. - **Auto-rollback merged proposals based on real-time metrics.** Even if v2 adds production monitoring, auto-rollback is explicitly rejected. False positives are common, and auto-reverting deliberate human-approved changes breaks the change-management posture the tool is built around. v2 will surface alerts and a one-click manual rollback path; the human stays in the loop. -- **Bandit-style online learning / continuous deployment of mixture configs.** This is the most attractive v2 candidate (multi-armed bandits routing real production traffic across promising configs and progressively shifting toward winners) but is explicitly rejected for v1. It requires real-time integration into the search-serving path, which v1's architecture deliberately stays out of. Documented as a v2 direction in §27. +- **Bandit-style online learning / continuous deployment of mixture configs.** This is the most attractive v2 candidate (multi-armed bandits routing real production traffic across promising configs and progressively shifting toward winners) but is explicitly rejected for v1. It requires real-time integration into the search-serving path, which the project's architecture deliberately stays out of. Documented as a v2 Path B direction in §27. ## 5. Glossary -- **Cluster** — a single Elasticsearch, Lucidworks Fusion, or Solr deployment (e.g., `products-prod-es`, `inventory-staging-fusion`). -- **Target** — a specific index (ES) or collection (Fusion / Solr) on a cluster, plus a query template. For Fusion, the target also implies a Fusion app and (default) query pipeline. +- **Cluster** — a single Elasticsearch, OpenSearch, or Apache Solr deployment (e.g., `products-prod-es`, `products-staging-opensearch`, `catalog-prod-solr`). +- **Target** — a specific index (ES / OpenSearch) or collection (Solr) on a cluster, plus a query template. - **Query set** — a named, versioned collection of queries used as the input population for evaluation. - **Judgment list** — for each (query, document) pair in scope, a relevance rating (0–3 or binary). Sourced from LLM-as-judge initially; human-overridable. - **Query template** — a parametrized query definition (Jinja-rendered) for a specific engine. Has named parameters that match the search space. @@ -131,15 +148,16 @@ The tool will not: │ │ │ ┌────────▼──┐ ┌─────▼────┐ ┌───▼────────┐ │ Adapters │ │ ir_ │ │ Git provider│ - │ - ES │ │ measures │ │ - GitHub │ - │ - Fusion │ │ │ │ - PR API │ - │ - (Solr) │ │ │ │ │ + │ - ES/OS │ │ measures │ │ - GitHub │ + │ - Solr │ │ │ │ - PR API │ + │ (MVP2) │ │ │ │ │ └─┬─────────┘ └──────────┘ └────────────┘ │ ┌──────────▼──────────────────────────┐ │ Tuned clusters │ │ - ES: products-prod, products-staging, products-dev - │ - Fusion: inventory-prod, inventory-staging, ... + │ - OpenSearch: catalog-prod, catalog-staging + │ - Solr: archive-prod (MVP2) └─────────────────────────────────────┘ ``` @@ -168,7 +186,7 @@ from typing import Protocol, runtime_checkable @runtime_checkable class SearchAdapter(Protocol): - engine_type: str # "elasticsearch" | "opensearch" | "solr" | "lucidworks_fusion" + engine_type: str # "elasticsearch" | "opensearch" | "solr" def health_check(self) -> HealthStatus: ... def list_targets(self) -> list[TargetInfo]: ... @@ -201,9 +219,9 @@ class SearchAdapter(Protocol): `search_batch` is the only hot-path method during a study. Everything else is define-time or debug-time. -### ElasticAdapter / OpenSearchAdapter +### ElasticAdapter / OpenSearchAdapter (MVP1, shipped) -A single adapter handles both **Elasticsearch** and **OpenSearch**. The engine_type column distinguishes them at the database level (`elasticsearch` vs `opensearch`), and the adapter branches on that flag for the small set of behaviors that differ between the two engines. Reasons for one-adapter-two-engines: +A single adapter handles both **Elasticsearch** and **OpenSearch**. The `engine_type` column distinguishes them at the database level (`elasticsearch` vs `opensearch`), and the adapter branches on that flag for the small set of behaviors that differ between the two engines. Reasons for one-adapter-two-engines: - ES and OpenSearch share the same Query DSL — `multi_match`, `function_score`, `bool`, etc. work identically across both - The `_msearch` and `_explain` endpoints exist on both with the same shape @@ -213,53 +231,43 @@ A single adapter handles both **Elasticsearch** and **OpenSearch**. The engine_t Implementation notes: - `search_batch` is implemented via the `_msearch` API for efficiency. -- `render` produces ES/OpenSearch Query DSL JSON; Jinja templates live under `templates/elasticsearch/` and work against both engines unmodified for the v1 query patterns. +- `render` produces ES/OpenSearch Query DSL JSON; Jinja templates live under `templates/elasticsearch/` and work against both engines unmodified for the MVP1 query patterns. - `explain` calls the `_explain` endpoint. - Engine support: Elasticsearch 8.11+ and 9.x; OpenSearch 2.x (matches ES 7.10 baseline) and 3.x. Older versions explicitly out of scope. - Authentication: ES uses API keys (or basic auth for older deployments); OpenSearch supports basic auth, API keys, and AWS SigV4 (when running in AWS managed OpenSearch). The adapter selects auth flow via `cluster.auth_kind`. Why this matters for licensing and OSS positioning: Elasticsearch's Basic license is free for self-hosting but is not OSI-approved OSS. OpenSearch is Apache 2.0. Supporting both means RelyLoop adopters who care about the licensing distinction can choose OpenSearch without losing functionality, and adopters already on ES don't need to migrate. -### LucidworksFusionAdapter notes +### SolrAdapter (MVP2, bundled with UBI judgments) -The primary "Solr-side" adapter for v1. Lucidworks Fusion is built on Solr but exposes a different API surface centered on Query Pipelines. Pure-Solr deployments are supported architecturally (see SolrAdapter notes below) but are not in v1 scope. +Apache Solr ships in MVP2 alongside the UBI judgments feature; together they complete RelyLoop's three-engine sweep with UBI on every engine. See [`infra_adapter_solr/idea.md`](../../02_product/planned_features/infra_adapter_solr/idea.md) for the full scope. -- `search_batch` posts to Fusion's query API: `POST /api/apps/{app}/query/{collection}` with the request body holding query text and per-stage parameter overrides (`params.{stageId}.{paramName}`). Parallelism is handled with a small connection pool, similar to the Solr adapter. -- `render` produces a Fusion request body, **not** a raw Solr query. A "template" in Fusion is a query pipeline definition exported as JSON, plus a parameter-binding map that says which template parameters override which pipeline-stage parameters at request time. Rendering takes the pipeline definition + binding + parameter values and produces an override-laden Fusion request. -- `get_schema` queries Fusion's catalog API for the schema of the configured collection. -- `explain` uses Fusion's debug-enabled query (`params.solr.debugQuery=true`) and parses the `debug.explain` block returned through the Fusion gateway. -- **Authentication.** Fusion uses session-based auth (`POST /api/session` returning a session cookie) or JWT. The adapter manages a session pool; the `auth_kind` field on the cluster row distinguishes `fusion_session` vs `fusion_jwt`. Credentials referenced via the same `credentials_ref` pattern. -- **Pipeline export/import.** Apply path uses Fusion's `objects-export` and `objects-import` APIs (see §16). Pipeline JSON is the canonical Git artifact. -- **Signals** (v1.5+). Fusion's signals collections (`{app}_signals`) capture user click, view, and refinement events. The adapter exposes a `pull_signals` operation that returns aggregated signals over a window, suitable for click-derived judgment generation. Not on the v1 hot path because the user's deployment hasn't enabled signals yet. -- Supports Fusion 5.x (current). Fusion 4.x deferred until needed. - -### SolrAdapter notes (architectural reference; not v1 scope) - -Pure Apache Solr is supported by the same adapter pattern but is not built in v1 because the user's deployment is Lucidworks Fusion. The notes below describe what a SolrAdapter implementation would do, both as a future engine and as evidence that the architecture isn't Fusion-locked. - -- `search_batch` is implemented via parallel `/select` requests, one per query, with a small connection pool. (Solr has no `_msearch` equivalent; the JSON Request API allows multi-query but is awkward.) -- `render` produces Solr query parameters as a dict (later URL-encoded); supports `lucene`, `edismax`, and `dismax` parsers. +- `search_batch` uses parallel `/select` requests with a small connection pool. Solr has no `_msearch` equivalent; the JSON Request API allows multi-query but is awkward and undertested across versions. +- `render` produces a Solr request parameter dict (later URL-encoded). Supports `edismax` (primary), `dismax`, and `lucene` parsers. Templates live under `templates/solr/` as Jinja templates that emit parameter maps, mirroring `templates/elasticsearch/` shape. +- `get_schema` uses Solr's Schema API (`/schema/fields`, `/schema/dynamicfields`, `/schema/fieldtypes`). +- `list_targets` uses CoresAdmin API (`/admin/cores?action=STATUS`) for standalone; CollectionsAdmin (`/admin/collections?action=LIST`) for SolrCloud. Selects automatically based on a startup capability probe. - `explain` uses `debugQuery=true&debug=results` and parses the `debug.explain` block. -- Supports Solr 8.11+ and 9.x. SolrCloud and standalone both supported. -- Authentication via basic auth or API tokens. +- Engine support: Solr 9.x (current widely-deployed) and Solr 10.x (released 2026-03 with `modules/ltr` stable + new LTR cache). SolrCloud and standalone modes both supported. Solr 8.x and earlier explicitly out of scope. +- Authentication: `auth_kind` extended to include `solr_basic` (HTTP Basic) and `solr_apikey` (Solr 9+ JWT via the security.json `JWTAuthPlugin`). +- LTR rescoring: applies a pre-existing `MultipleAdditiveTreesModel` (XGBoost-compatible) loaded via Solr's `/schema/model-store` as a rescore stage in a trial. Training is out of scope; the adapter consumes models the operator uploads separately. +- UBI on Solr: Solr ships `` in core ([reference guide](https://solr.apache.org/guide/solr/latest/query-guide/learning-to-rank.html); [UBI tools index](https://www.ubisearch.dev/tools/)) writing the same `ubi_queries` + `ubi_events` schema as the OpenSearch UBI plugin. The MVP2 `UbiReader` works on Solr unchanged. ### Cross-engine parameter naming Each adapter maps a unified parameter vocabulary to native names. Templates use the unified names; rendering pivots them. -| Concept | Unified name | ES (`multi_match`) | Lucidworks Fusion | Solr (`edismax`) | -|---|---|---|---|---| -| Per-field weights | `field_boosts: {f: w}` | `fields: ["f^w"]` | stage param `searchFields.fields` or `params.solr.qf` override | `qf=f^w` | -| Phrase fields | `phrase_field_boosts` | nested `phrase` clause | `params.solr.pf` override | `pf` | -| Tie breaker | `tie_breaker` | `tie_breaker` | `params.solr.tie` override | `tie` | -| Min should match | `min_should_match` | `minimum_should_match` | `params.solr.mm` override | `mm` | -| Fuzziness | `fuzziness` | `fuzziness` | (manual via `~` in query parser) | (manual via `~`) | -| Slop | `slop` | `slop` | `params.solr.ps` override | `ps` | -| Boost function | `boost_fn: {field, type, params}` | `function_score` | boosting stage `bq` override | `boost`, `bf` | -| Reranker model | `rerank_model: {id, top_k}` | `rescore.window_size` + LTR | rerank stage `modelId`, `topK` | LTR plugin model | -| Pipeline stage toggle | `stage_enabled: {stage_id: bool}` | (n/a) | per-stage `enabled` param | (n/a) | - -Where a concept doesn't exist natively (e.g., ES `function_score` rendered as Fusion `bq`), the adapter either provides a best-effort translation or raises `UnsupportedParameter` at render time and the search-space validator rejects the study before it runs. Fusion's `stage_enabled` parameter is unique to Fusion — it lets a study toggle individual pipeline stages on/off as a categorical parameter, which is a powerful and engine-specific tuning lever. +| Concept | Unified name | ES / OpenSearch (`multi_match`) | Solr (`edismax`) | +|---|---|---|---| +| Per-field weights | `field_boosts: {f: w}` | `fields: ["f^w"]` | `qf=f^w` | +| Phrase fields | `phrase_field_boosts` | nested `phrase` clause | `pf` | +| Tie breaker | `tie_breaker` | `tie_breaker` | `tie` | +| Min should match | `min_should_match` | `minimum_should_match` | `mm` (accepts richer arithmetic syntax — `2<-25% 9<-3`) | +| Fuzziness | `fuzziness` | `fuzziness` | (manual via `~` in query parser) | +| Slop | `slop` | `slop` | `ps` | +| Boost function | `boost_fn: {field, type, params, combine: "add"|"multiply"}` | `function_score` (multiplicative by default; additive when `combine=add`) | `bf` (additive) or `boost` (multiplicative) chosen by `combine` | +| Reranker model | `rerank_model: {id, top_k}` | `rescore.window_size` + LTR | `rq={!ltr model=... reRankDocs=...}` | + +Where a concept doesn't exist natively, the adapter either provides a best-effort translation or raises `UnsupportedParameter` at render time and the search-space validator rejects the study before it runs. Engine-version differences (Solr 9 vs 10 LTR module path, OpenSearch 2 vs 3 hybrid retriever shape) are handled inside the adapter's capability probe — the unified vocabulary is engine-version-stable. ## 9. Data model (Postgres) @@ -285,10 +293,10 @@ clusters ( id UUID PRIMARY KEY, tenant_id UUID NOT NULL REFERENCES tenants(id), name TEXT NOT NULL, -- "products-prod-es" - engine_type TEXT NOT NULL, -- "elasticsearch" | "opensearch" | "solr" | "lucidworks_fusion" + engine_type TEXT NOT NULL, -- "elasticsearch" | "opensearch" | "solr" environment TEXT NOT NULL, -- "prod" | "staging" | "dev" base_url TEXT NOT NULL, - auth_kind TEXT NOT NULL, -- "es_apikey" | "es_basic" | "opensearch_basic" | "opensearch_sigv4" | "solr_basic" | "fusion_session" | "fusion_jwt" + auth_kind TEXT NOT NULL, -- "es_apikey" | "es_basic" | "opensearch_basic" | "opensearch_sigv4" | "solr_basic" | "solr_apikey" credentials_ref TEXT NOT NULL, -- key into mounted secrets config_repo_id UUID REFERENCES config_repos(id), config_path TEXT NOT NULL, -- where in repo this cluster's templates live @@ -301,8 +309,7 @@ clusters ( -- engine_config shape per engine_type: -- elasticsearch: null or {api_version: "8" | "9"} -- opensearch: null or {os_version: "2" | "3"} --- solr: {solr_cloud: bool, default_collection: text} --- lucidworks_fusion: {app: text, default_pipeline: text, signals_collection: text?, fusion_version: "5"} +-- solr: {solr_cloud: bool, default_collection: text, solr_version: "9" | "10"} -- Config repository registry config_repos ( @@ -563,33 +570,7 @@ Templates are Jinja2 source files. Storage: rows in `query_templates`, body is t } ``` -### Example: Lucidworks Fusion template (Query Pipeline override) - -A Fusion template stores the pipeline definition as a versioned blob in the config repo (canonical source of truth) and a Jinja-rendered request body that supplies parameter overrides at request time. The pipeline itself is unchanged by tuning — only its parameters at request time vary. - -```jinja -{ - "params": [ - {"name": "q", "value": "{{ query_text }}"}, - {"name": "rows", "value": {{ top_k | default(10) }}}, - {"name": "params.solr.qf", "value": "title^{{ field_boosts.title }} body^{{ field_boosts.body }}{% if field_boosts.tags %} tags^{{ field_boosts.tags }}{% endif %}"}, - {"name": "params.solr.tie", "value": "{{ tie_breaker }}"}, - {"name": "params.solr.mm", "value": "{{ min_should_match | default('2<-25%') }}"}, - {"name": "params.solr.ps", "value": "{{ slop | default(0) }}"} - {% if rerank_model and rerank_model.id %}, - {"name": "params.rerank.modelId", "value": "{{ rerank_model.id }}"}, - {"name": "params.rerank.topK", "value": "{{ rerank_model.top_k | default(50) }}"} - {% endif %} - {% for stage_id, enabled in stage_enabled.items() %}, - {"name": "params.{{ stage_id }}.enabled", "value": "{{ enabled }}"} - {% endfor %} - ] -} -``` - -This is dispatched via `POST /api/apps/{app}/query/{collection}` with the rendered body. The pipeline definition itself (the stages, their default parameters) lives in `pipeline.json` alongside the template — versioned together so a study is reproducible against a known pipeline shape. - -### Example: Solr template (edismax) — reference for future engine support +### Example: Solr template (edismax) — MVP2 ```jinja { @@ -600,10 +581,13 @@ This is dispatched via `POST /api/apps/{app}/query/{collection}` with the render "ps": "{{ slop | default(0) }}", "q": "{{ query_text }}", "rows": {{ top_k | default(10) }} + {% if rerank_model and rerank_model.id %}, + "rq": "{!ltr model={{ rerank_model.id }} reRankDocs={{ rerank_model.top_k | default(50) }}}" + {% endif %} } ``` -All three templates declare parameters using the unified vocabulary (`field_boosts.*`, `tie_breaker`, `min_should_match`, `slop`, etc.). Engine-unique parameters like `fuzziness` (ES) and `stage_enabled` (Fusion) are declared per template. The search space references these names. +Both templates declare parameters using the unified vocabulary (`field_boosts.*`, `tie_breaker`, `min_should_match`, `slop`, etc.). Engine-unique parameters like `fuzziness` (ES) are declared per template. The search space references these names. ### Authoring & versioning @@ -688,7 +672,7 @@ queued → running → completed ### Engine: provider-abstracted IR evaluation via `ir_measures` -Workers always evaluate via `ir_measures`, never `_rank_eval`. This guarantees identical metric semantics across ES, Fusion, and Solr, and simplifies cross-engine comparisons. Reasoning: +Workers always evaluate via `ir_measures`, never `_rank_eval`. This guarantees identical metric semantics across Elasticsearch, OpenSearch, and Apache Solr, and simplifies cross-engine comparisons. Reasoning: - `ir_measures` (from the PyTerrier team) wraps multiple IR-evaluation backends behind a typed metric-object DSL (`nDCG@10`, `AP@5`, `RR`, `P@k`, `R@k`). The provider abstraction means swapping the underlying backend is a config change rather than a rewrite — protecting against future single-maintainer abandonment risk. - ES `_rank_eval` and `ir_measures` don't always agree to many decimal places (different normalization conventions across engines). @@ -735,13 +719,13 @@ Because UBI is just two indices in the cluster RelyLoop is already adapting, the - conversion rate (where the operator emits conversion events) - query-refinement rate -The pluggable `SignalsConverter` then maps these features to a 0–3 rating. Initial converters: position-bias-corrected CTR threshold, dwell-time threshold, and **hybrid UBI+LLM** (UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold). Counterfactual click models (CCM, DBN) are documented as v1.5+ post-GA extensions because they need enough impressions per (query, doc) to be statistically meaningful. +The pluggable `SignalsConverter` then maps these features to a 0–3 rating. Initial converters: position-bias-corrected CTR threshold, dwell-time threshold, and **hybrid UBI+LLM** (UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold). Counterfactual click models (CCM, DBN) are documented as post-MVP2 extensions because they need enough impressions per (query, doc) to be statistically meaningful. -The judgments table accepts mixed-source lists today (the `source IN ('llm', 'human', 'click')` CHECK has shipped since MVP1) — no schema migration is required to turn this on. The MVP1.5 deliverable is the `UbiReader` + `SignalsConverter` + a new `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + a new `generate_judgments_from_ubi` agent tool. See [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md) for the planned-feature scope. +The judgments table accepts mixed-source lists today (the `source IN ('llm', 'human', 'click')` CHECK has shipped since MVP1) — no schema migration is required to turn this on. The MVP2 deliverable bundles the `UbiReader` + `SignalsConverter` + `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + `generate_judgments_from_ubi` agent tool with the Solr adapter, so all three engines support UBI judgments from the moment MVP2 ships. See [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md) and [`infra_adapter_solr/idea.md`](../../02_product/planned_features/infra_adapter_solr/idea.md) for the planned-feature scope. -Predicated on the operator having installed the OpenSearch UBI plugin and logged enough events to be statistically useful. Deployments without UBI continue to run LLM-as-judge unchanged. +Predicated on the operator having installed the UBI plugin on their engine (OpenSearch UBI plugin, the o19s Elasticsearch UBI fork, or Solr's first-party `solr.UBIComponent`) and logged enough events to be statistically useful. Deployments without UBI continue to run LLM-as-judge unchanged. -**Engine-native readers as a drop-in extension.** Operators on engines that haven't adopted UBI but have their own behavioral-data stream — Elastic Behavioral Analytics for ES clusters, the Fusion `{app}_signals` collection for Fusion clusters — get a thin engine-specific reader feeding the same `SignalsConverter` Protocol. Reader work is local to the adapter that ships it (the Fusion reader rides MVP3 alongside the Fusion adapter; the ES Behavioral Analytics reader rides v2). The converter library, the API surface, and the storage shape are unchanged across all readers. +**Engine-native readers as a drop-in extension.** Operators on ES who haven't adopted UBI but use Elastic Behavioral Analytics (despite Elastic's 9.0 deprecation, residual deployments remain through ~2028) can be supported by a thin engine-specific reader feeding the same `SignalsConverter` Protocol. This is a backlog item, not pre-GA scope — the engine-neutral UBI path covers the vast majority of clickstream sources today. ### LLM-as-judge @@ -985,7 +969,7 @@ OTLP exporter pointed at `http://signoz-otel-collector:4317`. One configuration - `relyloop_openai_tokens_total{kind}` - `relyloop_optuna_ask_duration_seconds` -W3C Trace Context (`traceparent`) is propagated through to ES / Fusion, so distributed traces span the full agent → API → engine boundary. +W3C Trace Context (`traceparent`) is propagated through to ES / OpenSearch / Solr, so distributed traces span the full agent → API → engine boundary. ### How they fit together @@ -998,7 +982,7 @@ Workers ├── Trial executions ──→ OTLP ──→ SigNoz └── Digest LLM calls ──→ Langfuse handler (also OTLP for surrounding spans) -Adapters → ES / Fusion +Adapters → ES / OpenSearch / Solr └── HTTP spans ──→ OTLP ──→ SigNoz (with traceparent propagated) ``` @@ -1027,12 +1011,10 @@ search-configs/ templates/ product_search.yaml product_search.yaml.params.json - inventory-prod-fusion/ - pipelines/ - product_search_pipeline.json - product_search_pipeline.params.json - profiles/ - product_search.profile.json + catalog-prod-solr/ + templates/ + catalog_search.yaml + catalog_search.yaml.params.json ``` The `*.params.json` file holds the production parameter values that the deployment pipeline injects into the template at deploy time. **The tool only edits `*.params.json`; it does not edit templates themselves.** @@ -1045,26 +1027,11 @@ This matters because: ### Engine-specific apply path details -**Elasticsearch.** The `*.params.json` file is read by the user's deployment pipeline and injected into the index template / search application configuration at deploy time. The tool does not interact with the cluster directly during apply. +The apply path is uniform across all three supported engines — the tool edits `*.params.json`, the operator's CI does the rest. Engine-specific notes: -**Lucidworks Fusion.** Fusion's pipelines are versioned objects in Fusion's own catalog, so the apply path is two-step: +**Elasticsearch / OpenSearch.** The `*.params.json` file is read by the operator's deployment pipeline and injected into the index template / search application configuration at deploy time. The tool does not interact with the cluster directly during apply. -1. Tool edits `*.params.json` and (where the change is large enough to warrant a new pipeline version) commits an updated `pipeline.json` to the same path. PR is opened. -2. After PR merge, the user's CI runs Fusion's `objects-import` API (or `fusion-cli`) to push the updated pipeline into Fusion. The tool does **not** push to Fusion directly — same principle as the ES case, the tool stops at the PR and CI handles deployment. - -The conventions we recommend for the config repo when targeting Fusion: - -``` -search-configs/ - products-prod-fusion/ - pipelines/ - product_search_pipeline.json ← canonical pipeline definition - product_search_pipeline.params.json ← what the tool edits - profiles/ - product_search.profile.json ← optional: query profile binding -``` - -CI should verify that `pipeline.json` plus `params.json` together produce a valid Fusion pipeline before importing. A small validator script using Fusion's pipeline-validate API is recommended. +**Apache Solr (MVP2).** The `*.params.json` file is consumed by the operator's CI which writes the updated parameters into Solr via the Request Parameters API (`POST /api/config/params`) or by editing `solrconfig.xml` `` defaults and reposting. The tool does not push to Solr directly — same principle as the ES/OpenSearch case, the tool stops at the PR and CI handles deployment. ### PR creation flow @@ -1194,9 +1161,9 @@ The `clusters` table holds every cluster the tool can talk to, scoped by tenant. |---|---|---|---| | acme-corp | products-prod-es | elasticsearch | prod | | acme-corp | products-staging-es | elasticsearch | staging | -| acme-corp | inventory-prod-fusion | lucidworks_fusion | prod | -| beta-co | search-prod-fusion | lucidworks_fusion | prod | -| beta-co | search-staging-fusion | lucidworks_fusion | staging | +| acme-corp | catalog-prod-solr | solr | prod | +| beta-co | search-prod-opensearch | opensearch | prod | +| beta-co | search-staging-opensearch | opensearch | staging | | internal-platform | docs-prod-es | elasticsearch | prod | In single-tenant deployments the `tenant` column is implicit (always `default`), and the cluster name is the unique identifier on its own. @@ -1280,13 +1247,6 @@ The orchestrator agent in the API backend uses OpenAI function calling. Tool inv - `get_schema(cluster_id, target)` → `Schema` - `list_query_parsers(cluster_id)` → `[str]` -### Fusion-specific (Fusion clusters only) - -- `list_pipelines(cluster_id)` → `[PipelineSummary]` — list query pipelines available in the Fusion app -- `get_pipeline(cluster_id, pipeline_id)` → `PipelineDefinition` — full pipeline JSON with stages -- `list_query_profiles(cluster_id)` → `[QueryProfileSummary]` -- `pull_signals(cluster_id, since, until?, query_filter?)` → `SignalsAggregate` — *(MVP3, requires Fusion Signals enabled)* aggregate raw Fusion `{app}_signals` events into per-(query, doc) interaction features. Engine-specific reader feeding the shared `SignalsConverter` Protocol introduced at MVP1.5; see §14 "Click-derived judgments from user behavior data". - ### Templates - `list_templates(engine_type?)` → `[TemplateSummary]` @@ -1299,7 +1259,7 @@ The orchestrator agent in the API backend uses OpenAI function calling. Tool inv - `create_query_set(name, queries[])` → `QuerySet` - `import_queries_from_csv(query_set_id, csv_data)` → `int` - `generate_judgments_llm(query_set_id, cluster_id, target, current_template_id, rubric)` → `JudgmentList` -- `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList` — *(MVP1.5, requires OpenSearch UBI plugin)* read `ubi_queries` + `ubi_events`, aggregate per-(query, doc) features via `UbiReader`, run the named `SignalsConverter`, and (optionally) fill the long tail with LLM-as-judge when impression count < `llm_fill_threshold`. Emits a judgment list with mixed `source` rows (`click` + optional `llm`). See §14. +- `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList` — *(MVP2, requires the UBI plugin installed on the engine — OpenSearch UBI plugin, Elasticsearch o19s fork, or Solr's first-party `solr.UBIComponent`)* read `ubi_queries` + `ubi_events`, aggregate per-(query, doc) features via `UbiReader`, run the named `SignalsConverter`, and (optionally) fill the long tail with LLM-as-judge when impression count < `llm_fill_threshold`. Emits a judgment list with mixed `source` rows (`click` + optional `llm`). See §14. - `get_calibration(judgment_list_id)` → `CalibrationStats` ### Search space proposal @@ -1426,7 +1386,7 @@ Three endpoints let an agent learn what the API can do without out-of-band docum ```json { "version": "1.0.0", - "engines_supported": ["elasticsearch", "lucidworks_fusion"], + "engines_supported": ["elasticsearch", "opensearch", "solr"], "clusters": [ {"id": "...", "name": "products-prod-es", "engine_type": "elasticsearch", "environment": "prod"}, ... @@ -1565,7 +1525,7 @@ Agents can pass through W3C Trace Context: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 ``` -The API and workers honor and propagate it, so distributed traces span the agent → API → ES/Fusion boundary. `X-Request-ID` is also accepted and echoed in responses. +The API and workers honor and propagate it, so distributed traces span the agent → API → search clusters boundary. `X-Request-ID` is also accepted and echoed in responses. Rate-limit headers are present on every response: @@ -1678,7 +1638,7 @@ This is an internal tool used during business hours; targets reflect that. - Langfuse down → LLM calls still succeed; observability traces queue locally and replay. - SigNoz down → spans buffer in OTel collector; metrics best-effort. - GitHub API down → proposal stays in `pending`; PR worker retries with backoff. - - Single ES/Fusion target down → studies on that target fail; others unaffected. + - Single search clusters target down → studies on that target fail; others unaffected. Recovery objectives: @@ -1756,7 +1716,7 @@ A four-layer test pyramid. Each layer has a distinct purpose, runtime profile, a - **Adapter contract tests** — every `SearchAdapter` implementation runs the same conformance suite. Lives in `tests/contracts/test_search_adapter_contract.py`, parameterized by adapter. Verifies `render`, `search_batch`, `explain`, `health_check` behavior. - **Tool definition contract tests** — every `@tool`-decorated function has its OpenAPI schema, OpenAI function-calling schema, and Python signature checked for mutual consistency. - **OpenAPI contract tests** — every endpoint listed in §20 is reachable, returns the documented schema, and accepts the documented parameters. - - **External provider contract tests** — a small suite that hits OpenAI's API, GitHub's API, and (via cassette refresh) Fusion's gateway to confirm our assumptions about request/response shapes still hold. Run on the nightly schedule, not per-PR. + - **External provider contract tests** — a small suite that hits OpenAI's API and GitHub's API to confirm our assumptions about request/response shapes still hold. Run on the nightly schedule, not per-PR. Search-engine contracts are exercised by integration tests against the Compose ES/OpenSearch/Solr stack, not by external probes. - **Mocking**: minimal — only mocks the layer below the contract boundary. Adapter contract tests use `pytest-recording` cassettes. - **Coverage gate**: structural — every adapter implementation, every `@tool`, and every endpoint must have at least one contract test. Enforced by a custom CI check. - **Runtime**: < 60 s in CI. @@ -1764,7 +1724,7 @@ A four-layer test pyramid. Each layer has a distinct purpose, runtime profile, a #### Integration tests - **Scope**: multiple components composed together, with **only external systems mocked**. Internal services (Postgres, Redis, the agent backend, workers) run for real in CI via Docker Compose. -- **Mocking**: external HTTP only — OpenAI calls (cassetted via `vcrpy`), Fusion query gateway (cassetted), GitHub API (cassetted), ES (real, free, runs in CI Compose). +- **Mocking**: external HTTP only — OpenAI calls (cassetted via `vcrpy`), GitHub API (cassetted). Search engines (ES, OpenSearch, Solr) run for real in the CI Compose stack — all three are Apache 2.0 or free-Basic and run in service containers. - **Examples**: - Full Optuna loop with a real Postgres but cassetted OpenAI / search-engine calls - LangGraph orchestrator processing a chat message end-to-end with cassetted LLM responses @@ -1777,13 +1737,13 @@ A four-layer test pyramid. Each layer has a distinct purpose, runtime profile, a - **Scope**: full system, **no mocking, real external services**. Run against a dedicated test environment. - **Live services used**: - - Real OpenAI API (separate budget-capped API key for E2E) - - The shared dev Fusion cluster (with namespaced test pipelines per CI run; see §25 *Deployment*) - - A live Elasticsearch instance (free, deployed alongside) - - A test config repo on GitHub (separate from the production config repo) + - Real OpenAI-compatible API endpoint (separate budget-capped key for E2E) + - Compose ES + OpenSearch + Solr (MVP2+) service containers + - A test config repo on GitHub (separate from any production config repo) - **Examples**: - - Run a 10-trial study against the staging Fusion cluster, verify metrics improve - - Generate judgments via real OpenAI calls for a small fixed query set + - Run a 10-trial study against each of the three engines, verify metrics improve + - Generate judgments via real LLM calls for a small fixed query set + - Generate UBI-derived judgments (MVP2+) against seeded `ubi_queries` / `ubi_events` - Create a proposal that opens a real PR in the test config repo, then auto-close it - Drive a complete chat conversation with the orchestrator, verify expected tool calls - **Mocking**: forbidden — if a test needs to mock something, it belongs in integration tests instead. @@ -1800,7 +1760,6 @@ tests/ integration/ # Compose-based, external HTTP cassetted e2e/ # full-stack, no mocks fixtures/ - fusion-cassettes/ openai-cassettes/ github-cassettes/ ``` @@ -1841,7 +1800,7 @@ Five workflows in `.github/workflows/`: - Dependency vulnerability scan against the latest images 5. **`cassette-refresh.yml`** — manual `workflow_dispatch`: - - Re-records cassettes for the named external service (Fusion, OpenAI, GitHub) + - Re-records cassettes for the named external service (OpenAI, GitHub) - Opens a PR with the updated cassette files Caching: @@ -1863,7 +1822,7 @@ Branch protection rules on `main`: - **README** with a 5-minute quickstart for new engineers (clone, `docker compose up`, point at the local UI). - **OpenAPI spec** auto-published at `/openapi.json` and rendered by Stoplight or Redoc at `/docs`. -- **ADRs** (Architecture Decision Records) for big choices: LangGraph, Langfuse, SigNoz, Fusion-as-primary-Solr-side-adapter, no-MCP, etc. One file per decision in `docs/09_decisions/`. +- **ADRs** (Architecture Decision Records) for big choices: LangGraph, Langfuse, SigNoz, Solr-as-MVP2-third-engine, Fusion-explicitly-dropped, no-MCP, etc. One file per decision in `docs/09_decisions/`. - **Runbooks** in `docs/03_runbooks/` for: cassette refresh, eval-suite failure investigation, Langfuse storage cleanup, Postgres restore, study cancellation cleanup. - **Inline**: every Pydantic model has field descriptions; every public function has a docstring (enforced by `ruff D`). @@ -1951,7 +1910,7 @@ Event domains and a representative subset: | `digest.*` | `digest.requested`, `digest.generated`, `digest.failed` | | `proposal.*` | `proposal.created`, `proposal.pr_open_requested`, `proposal.pr_opened`, `proposal.pr_merged`, `proposal.pr_closed`, `proposal.rejected`, `proposal.cancelled` | | `agent.*` | `agent.conversation_started`, `agent.message_received`, `agent.tool_called`, `agent.tool_call_failed`, `agent.interrupt_requested`, `agent.interrupt_resolved` | -| `adapter.*` | `adapter.search_batch_started`, `adapter.search_batch_completed`, `adapter.session_renewed` (Fusion), `adapter.pipeline_drift_detected` (Fusion) | +| `adapter.*` | `adapter.search_batch_started`, `adapter.search_batch_completed`, `adapter.auth_renewed`, `adapter.capability_probe_completed` | | `worker.*` | `worker.started`, `worker.job_picked`, `worker.job_completed`, `worker.job_failed`, `worker.shutdown` | | `git.*` | `git.clone_started`, `git.branch_created`, `git.commit_pushed`, `git.pr_created`, `git.webhook_received` | | `system.*` | `system.startup`, `system.shutdown`, `system.config_loaded`, `system.config_invalid`, `system.slow_operation` | @@ -1995,7 +1954,7 @@ W3C `traceparent` propagates through every service boundary: | API → Redis (queue enqueue) | Custom: serialize `traceparent` into Arq job headers | | Redis → worker (job pickup) | Custom: deserialize `traceparent`, attach to worker span | | Worker → adapter HTTP | OTel httpx instrumentation injects header | -| Adapter → ES / Fusion | Outbound `traceparent` header on every search call | +| Adapter → ES / OpenSearch / Solr | Outbound `traceparent` header on every search call | | API → Git provider | OTel httpx instrumentation injects header | | API → OpenAI | Langfuse handler reads ambient OTel context, records as Langfuse trace metadata | @@ -2197,11 +2156,9 @@ Sizing rule of thumb: one VM with 8 vCPU + 32 GB RAM handles 10 concurrent studi ### Local development environment -Lucidworks Fusion has no free tier or community edition; the only supported paths are commercial licenses, evaluation licenses (30–90 days), and Fusion Cloud. To keep day-one onboarding friction low and avoid blocking on license requests, the development model **does not require a local Fusion instance**. - -Three tiers of test/dev environment: +All three supported engines (Elasticsearch, OpenSearch, Apache Solr) are free and open source, so the development model runs entirely on a laptop with no external dependencies, eval licenses, or vendor accounts. -**Tier 1 — Local docker-compose (no Fusion).** The default `docker-compose.yml` adds three free-and-open engine containers: Elasticsearch (free Basic license), OpenSearch (Apache 2.0), and Apache Solr. ~80% of the system — data model, agent orchestrator, Optuna loop, ir_measures, UI, proposals, PR flow, agent integration layer — can be developed and tested entirely on this stack. New engineers clone, `docker compose up`, and are productive without any Lucidworks involvement. **For the MVP / v0.1 release, ES + OpenSearch are the only engines supported**; Fusion ships in GA v1 and Solr in v2. +**Local docker-compose.** The default `docker-compose.yml` adds three engine containers and the supporting services. New engineers clone, `make up`, and are productive against the full three-engine stack from day one. MVP1 ships with the ES + OpenSearch containers live; the Solr container activates with MVP2 alongside the `SolrAdapter`. ```yaml # docker-compose.yml additions for local dev @@ -2220,54 +2177,22 @@ services: - DISABLE_SECURITY_PLUGIN=true # local dev only; production requires security plugin ports: ["9201:9200"] # different host port to coexist with ES - solr: - image: solr:9.5 + solr: # activates at MVP2 + image: solr:10.0 ports: ["8983:8983"] - command: ["solr-precreate", "default-collection"] -``` - -**Tier 2 — Fusion adapter unit tests with replay fixtures.** When developing the `LucidworksFusionAdapter`, use [`pytest-recording`](https://pytest-recording.readthedocs.io/) (built on `vcrpy`). Tests run once against a real Fusion to record HTTP interactions into YAML cassettes (`tests/fixtures/fusion-cassettes/`), then replay deterministically without network access. Cassettes are checked into the repo and refreshed only when the upstream Fusion API contract changes. Engineers running just unit tests never need Fusion access. - -```python -@pytest.mark.vcr # replays from tests/fixtures/fusion-cassettes/test_query_pipeline.yaml -def test_fusion_query_pipeline_render(fusion_adapter, sample_template): - result = fusion_adapter.search_batch(target="products", queries=[...], top_k=10) - assert "products" in result + command: ["solr-precreate", "products"] ``` -**Tier 3 — Integration tests against the shared dev Fusion cluster.** CI runs Fusion-touching integration tests against the org's existing Fusion dev environment. The integration test runner creates dedicated, namespaced pipelines (`relyloop-test-{branch}-{run_id}`) at setup and tears them down at teardown, so concurrent CI runs and engineer-driven manual tweaks never conflict. This is a hard requirement, not optional — without namespace isolation, the dev cluster's pipeline state will drift and tests will fail for unrelated reasons. - -**Tier 4 (optional) — Mock Fusion service for UI/demo work.** A small companion overlay file `docker-compose.dev.yml` adds a `fusion-mock` service: ~200 lines of FastAPI emulating the Fusion query gateway with canned responses. Useful for UI development, screenshots, and demos when even the shared dev cluster isn't reachable (e.g., off-network, intermittent connectivity). Not used in unit or integration tests — those use cassettes or real Fusion. - -```yaml -# docker-compose.dev.yml — opt-in via `docker compose -f docker-compose.yml -f docker-compose.dev.yml up` -services: - fusion-mock: - build: ./fusion-mock - ports: ["8764:8764"] - environment: - MOCK_FIXTURES_DIR: /fixtures - volumes: [./fusion-mock/fixtures:/fixtures] -``` - -### When a real local Fusion is required - -A handful of cases still need real Fusion locally; for these, request a 30-day Lucidworks evaluation license per engineer (renewable as needed): - -- Initial Fusion adapter development — recording the first round of cassettes -- Adding new Fusion-specific search-space parameters (e.g., `stage_enabled` extensions) -- Reproducing Fusion-specific bugs that only manifest under specific cluster state -- Validating session-auth or JWT flows in a controlled environment +**Tests.** Unit tests are hermetic and don't need any engine running. Integration tests against ES + OpenSearch + Solr run against the Compose containers (CI provisions service containers in GitHub Actions). E2E tests run the full Karpathy loop against the live Compose stack with a test config repo. -In each case, the org's existing Fusion dev cluster is usually a viable substitute for an eval license, depending on access policy. +There is no third-party vendor license, no shared dev cluster, no replay-cassette infrastructure to maintain. The forward-only documentation stance applies here too — earlier drafts of this spec included a four-tier model with Fusion eval licenses and replay cassettes; that complexity went away with the Fusion drop. ## 26. Failure modes & edge cases | Failure | Detection | Handling | |---|---|---| -| ES/Fusion cluster down | Adapter `health_check()` fails or batch search returns 5xx | Trial marked failed; if 5+ consecutive trial failures, study auto-cancels | -| Fusion session expired mid-study | 401 from Fusion gateway | Adapter re-authenticates transparently and retries the trial; counts as one ordinary retry, not a failure | -| Fusion pipeline edited out of band during study | Same template, different upstream pipeline shape | Detected by hashing pipeline JSON at study start vs. trial time; mismatch fails the trial with `error_code = pipeline_drift` | +| Search cluster down (ES / OpenSearch / Solr) | Adapter `health_check()` fails or batch search returns 5xx | Trial marked failed; if 5+ consecutive trial failures, study auto-cancels | +| Auth token expired mid-study (ES API key, OpenSearch SigV4, Solr JWT) | 401 from cluster | Adapter re-authenticates transparently and retries the trial; counts as one ordinary retry, not a failure | | Worker crashes mid-trial | Arq job failure; Optuna ask-without-tell | Trial lost; Optuna will re-suggest similar params; idempotent | | Optuna RDB lock contention | Slow `study.ask()` calls | Backoff; if persistent, reduce study parallelism | | OpenAI API rate-limit | Tool call fails | Exponential backoff; surface to user if all retries fail | @@ -2285,18 +2210,29 @@ In each case, the org's existing Fusion dev cluster is usually a viable substitu ## 27. Phased delivery -Delivery is incremental: six releases (MVP1 → MVP1.5 → MVP2 → MVP3 → MVP4 → GA v1), each meaningful as a discrete capability bundle. Each release ships a coherent step-up in adopter value and audience reach, never a partial build. Total wall-clock estimate: **~19 weeks single-engineer**, or roughly **12–14 weeks with two engineers** working in parallel after MVP1. +Delivery is incremental across three pre-GA releases plus a polish-and-governance GA. The 2026-05-27 reframe (see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)) compressed the prior six-release plan: Fusion was dropped outright; Solr was promoted to MVP2 and bundled with UBI judgments; observability moved to MVP3; multi-Git, multi-tenancy, and multi-LLM moved to the backlog. The result is a tighter narrative that lands all six of RelyLoop's differentiators by MVP3 and reserves GA for polish + governance + hardening. -| Release | Theme | Timeline | Audience | +| Release | Theme | Adds | Audience | |---|---|---|---| -| MVP1 / v0.1 | The Loop | 5 weeks | Technical evaluators willing to test on a laptop | -| MVP1.5 / v0.1.5 | Real Signals | +2 weeks | Operators running OpenSearch UBI; teams that want trust anchored in real user behavior, not LLM ratings | -| MVP2 / v0.2 | Observable | +3 weeks | Platform teams considering serious evaluation | -| MVP3 / v0.3 | Production Stacks | +3 weeks | Lucidworks shops, GitLab/Bitbucket enterprises | -| MVP4 / v0.4 | Multi-tenant, Multi-LLM | +3 weeks | Platform teams operating for many customers | -| GA v1 / v1.0 | Production-ready | +3 weeks | Production deployments, contributors, the community | +| MVP1 / v0.1 (shipped) | The Loop | ES + OpenSearch + LLM-as-judge + Optuna/TPE + Git PR + conversational agent | Technical evaluators willing to test on a laptop | +| MVP2 / v0.2 | Three-Engine + Real Signals | Apache Solr adapter + UBI judgments + hybrid UBI+LLM converter (bundled) | Operators on any of the three OSS engines who want the engine-neutral claim verifiable + trust anchored in real user behavior | +| MVP3 / v0.3 | Observable | Langfuse + SigNoz + audit-log immutability + lineage columns + PII redaction + trace propagation | Platform teams considering unattended overnight runs in earnest | +| GA v1 / v1.0 | Production-ready | LangGraph orchestrator + 90% coverage + full CI/CD with security gates + docs + design-partner references + signed container images + complete OSS governance | Production adopters, contributors, the broader community | -### MVP1 / v0.1 — "The Loop" (target: 5 weeks, 1 engineer — or ~3 weeks with two) +**All six differentiators are GA by MVP3.** GA v1 adds no new product surface — the lift is from "working" to "production-ready, contributor-ready, fully governed." + +| Differentiator | Lands in | +|---|---| +| Bayesian/TPE optimization over the full query-time search space | MVP1 (shipped) | +| Git-PR apply path with named-approver merge gate | MVP1 (shipped) | +| Conversational agent that runs the loop (not just suggests DSL edits) | MVP1 (shipped) | +| All three OSS engines (ES + OpenSearch + Apache Solr) | MVP2 | +| Hybrid UBI+LLM judgments + position-bias correction | MVP2 | +| Local-first LLM observability (self-hosted Langfuse + SigNoz) | MVP3 | + +**Backlog** (captured but not in flight): multi-Git provider abstraction (GitLab, Bitbucket); multi-tenancy primitives + multi-LLM provider abstraction; Path B (production-quality monitoring, bandit-style online learning, shadow validation, manual one-click rollback); LTR training; Lucidworks Fusion adapter (explicitly dropped, would require a community contribution to revive). + +### MVP1 / v0.1 — "The Loop" (shipped) **Headline: The Karpathy loop, working.** @@ -2336,115 +2272,82 @@ What MVP1 delivers: a relevance engineer can `docker compose up`, point at a loc --- -### MVP1.5 / v0.1.5 — "Real Signals" (target: +2 weeks) +### MVP2 / v0.2 — "Three-Engine + Real Signals" (target: ~4–5 engineer-weeks combined, ~3–4 with two engineers) + +**Headline: Engine-neutral becomes verifiable, and judgments come from real users.** -**Headline: The loop, grounded in what users actually do.** +MVP2 bundles two capabilities into one release: the Apache Solr adapter and UBI-derived judgments with a hybrid UBI+LLM converter. They ship together because they tell one coherent story — RelyLoop runs on all three OSS engines (Elasticsearch, OpenSearch, Apache Solr) with UBI on every one of them — and because Solr's `solr.UBIComponent` writes the same standardized UBI schema as the OpenSearch UBI plugin, so the UBI reader works on Solr unchanged the moment the adapter lands. -MVP1 ships with LLM-as-judge as the only authoritative judgment source. That's enough to demonstrate the optimization loop, but for operators with production traffic it's a weaker trust anchor than real user behavior. MVP1.5 closes that gap by making **OpenSearch UBI** (User Behavior Insights — a standardized, engine-neutral event-capture schema championed by Eric Pugh / OpenSource Connections, shipped as the OpenSearch UBI plugin in 2024) a first-class judgment source alongside LLM-as-judge. +**MVP2 adds on top of MVP1:** -**MVP1.5 adds on top of MVP1:** +*Apache Solr adapter* (see [`infra_adapter_solr/idea.md`](../../02_product/planned_features/infra_adapter_solr/idea.md)): -- **`UbiReader`** (engine-agnostic) reads the standardized `ubi_queries` + `ubi_events` indices via any `SearchAdapter`'s `search_batch` — no engine-specific code, no new Compose service. Aggregates raw events over an operator-specified window into per-(query, doc) interaction features: click count, impression count, position-bias-corrected CTR, post-click dwell-time mean, conversion rate (where conversions are emitted), refinement rate. +- Full `SearchAdapter` Protocol implementation: `search_batch` via parallel `/select`; `render` for `edismax` (primary), `dismax`, `lucene`; `get_schema` via Solr Schema API; `list_targets` via CoresAdmin (standalone) / CollectionsAdmin (SolrCloud); `explain` via `debugQuery=true`. +- Engine support: Solr 9.x + Solr 10.x; SolrCloud + standalone. +- Authentication: `solr_basic` (HTTP Basic) and `solr_apikey` (Solr 9+ JWT via `JWTAuthPlugin`). +- LTR rescoring: applies pre-existing `MultipleAdditiveTreesModel` (XGBoost-compatible) via Solr's `/schema/model-store` as a rescore stage in a trial. Training is out of scope. +- Compose service `solr` (Apache 2.0 image, `solr:10`) bound to `127.0.0.1:8983`, mirroring the existing `elasticsearch` and `opensearch` service shape. +- Sample collection `products` seeded from the same `samples/products.json` MVP1 uses for ES. +- Capability probe at adapter construction: detects Solr version, SolrCloud-vs-standalone, presence of `solr.UBIComponent`, presence of `ltr` module — written to `clusters.engine_config` JSONB. +- One migration extending `clusters.auth_kind` and `engine_type` CHECK constraints to accept the Solr values. No new tables. + +*UBI judgments* (see [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md)): + +- **`UbiReader`** (engine-agnostic) reads the standardized `ubi_queries` + `ubi_events` indices via any `SearchAdapter`'s `search_batch`. Works on Elasticsearch, OpenSearch (via the OpenSearch UBI plugin), and Solr (via `solr.UBIComponent`) without engine-specific code. +- Aggregates raw events over an operator-specified window into per-(query, doc) interaction features: click count, impression count, position-bias-corrected CTR (Wang-Bendersky correction), post-click dwell-time mean, conversion rate (where conversions are emitted), refinement rate. - **Pluggable `SignalsConverter` Protocol** mapping features → 0–3 ratings. Initial implementations: - **Position-bias-corrected CTR threshold** (default, conservative) - **Dwell-time threshold** (good for content discovery / long-read use cases) - - **Hybrid UBI+LLM** — UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold. The mixed-`source` judgment list is the operating mode most adopters will ship to production. -- **No schema migration.** The `judgments.source` CHECK constraint accepts `click` today; a single judgment list can mix `llm` + `human` + `click` rows. The MVP1 schema was designed for this. + - **Hybrid UBI+LLM** — UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold. The mixed-`source` judgment list is the operating mode most adopters will ship to production. This is the differentiated converter (SRW's UBI path uses COEC alone; no hybrid). +- **No schema migration for UBI.** The `judgments.source` CHECK constraint accepts `click` today; a single judgment list can mix `llm` + `human` + `click` rows. The MVP1 schema was designed for this. - **`POST /api/v1/judgment-lists/generate-from-ubi`** endpoint + **`generate_judgments_from_ubi`** agent tool. Same code path on both surfaces (agent-first symmetry per §21). -- **Calibration spot-check workflow** — same Cohen's kappa / agreement-stat surface as MVP1's LLM calibration, run between UBI-derived ratings and a 30–50 row hand-labeled sample. Catches mis-tuned converters (e.g., dwell-time threshold set too low for the traffic shape). -- **Operator docs** — runbook for installing the OpenSearch UBI plugin, configuring event capture in the application, choosing the right converter for the use case, and a tutorial extension to the MVP1 tutorial that swaps the LLM judgment list for a UBI-derived one once enough events have been captured. -- **Documented Phase 2 extensions** (NOT shipped at MVP1.5): counterfactual click models (CCM, DBN); engine-native behavioral-data readers for clusters that haven't adopted UBI — Elastic Behavioral Analytics and others — all feeding the same `SignalsConverter` Protocol unchanged. +- **Calibration spot-check workflow** — same Cohen's kappa / agreement-stat surface as MVP1's LLM calibration, run between UBI-derived ratings and a 30–50 row hand-labeled sample. +- **Documented post-MVP2 extensions** (not shipped here): counterfactual click models (CCM, DBN) as additional `SignalsConverter` implementations once operators accumulate enough impressions per (query, doc) to make them statistically valid; engine-native behavioral-data readers for clusters that haven't adopted UBI (e.g., Elastic Behavioral Analytics) — all feeding the same `SignalsConverter` Protocol unchanged. + +*Documentation*: -**MVP1.5 does NOT include:** +- **New runbook:** `docs/03_runbooks/solr-cluster-registration.md` — register a Solr cluster, configure `edismax`, enable `solr.UBIComponent`, upload an LTR model. +- **New runbook:** `docs/03_runbooks/ubi-judgment-generation.md` — install OpenSearch UBI plugin, configure event capture, choose the right converter, calibrate thresholds. +- **Tutorial extensions:** `docs/08_guides/tutorial-first-study.md` gains a Step 0 Path C ("Run the tutorial against Solr instead of ES") and a Step 7 ("Swap the LLM judgment list for a UBI-derived one"). -- A second Compose service. `UbiReader` runs inside the existing API + worker containers. -- Real-time signal streaming. UBI ratings are computed batch-wise at judgment-list creation time, not on the live serving path — this is still strictly offline Path A (per §27 "Why the deferral is right today"). -- Production quality monitoring or alerting (Path B, v2). -- A schema migration. UBI rides the existing `judgments` table. +**MVP2 does NOT include:** -**Audience expansion:** Operators with production search traffic and OpenSearch UBI logging enabled. These adopters disproportionately distrust LLM-as-judge ratings as a primary trust anchor; MVP1.5 is the release that earns their evaluation. Also: open-source signals that UBI is a first-class direction for RelyLoop, not deferred to a post-GA milestone — relevant for the OSC community where UBI was incubated. +- A second observability stack. Langfuse + SigNoz land at MVP3. +- Multi-Git provider abstraction (GitLab, Bitbucket) — in the backlog. +- Multi-tenancy primitives — in the backlog. +- Multi-LLM provider abstraction (Anthropic, Bedrock, Vertex, etc.) — in the backlog. OpenAI-compatible endpoints (Ollama, vLLM, LM Studio, TGI) continue to work via `OPENAI_BASE_URL` redirection, exactly as in MVP1. +- LTR training (cross-engine model training is a v2 candidate; MVP2's Solr LTR support is consume-only). +- Real-time signal streaming. UBI ratings are computed batch-wise at judgment-list creation time, not on the live serving path — strictly offline Path A. +- Fusion. Explicitly dropped; see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md). -**Strategic rationale:** The optimization loop's quality is bounded by the quality of the judgments it scores against. LLM-as-judge unblocks the MVP1 demonstration, but it caps the believability of every winning trial behind "did the LLM actually get the relevance call right?" UBI removes that ceiling for operators with real traffic. Shipping it as the very next release (rather than waiting for MVP2's observability layer or MVP3's Fusion work) keeps the focus on the core value proposition: trustworthy automated relevance tuning. +**Audience expansion:** Apache Solr operators (the OSC + Sease + Querqy + Quepid/Chorus community, predominantly Solr-native); operators with production search traffic and UBI logging enabled on any of the three engines; operators who distrust LLM-as-judge as the only trust anchor. + +**Strategic rationale:** MVP2 is the release that makes the engine-neutral claim verifiable rather than rhetorical, and the trust-anchor problem (LLM-as-judge alone is bounded by "did the LLM get the relevance call right?") solved for operators with traffic. Bundling Solr + UBI is cheaper than splitting them across two releases because UBI on Solr is free once the adapter ships, and the combined narrative is sharper than either capability alone. --- -### MVP2 / v0.2 — "Observable" (target: +3 weeks) +### MVP3 / v0.3 — "Observable" (target: +3 weeks) **Headline: The loop you can audit.** -Without trustworthy observability, no platform team will run RelyLoop unattended overnight. v0.2 adds the full observability layer so adopters can see what the tool is doing, why, and what it produced — and have an immutable audit trail for governance. +Without trustworthy observability, no platform team will run RelyLoop unattended overnight. MVP3 adds the full observability layer so adopters can see what the tool is doing, why, and what it produced — and have an immutable audit trail for governance. Because all three engines and both judgment sources are in place by the start of MVP3, this work instruments the full system in one pass rather than retrofitting per engine. -**MVP2 adds on top of MVP1:** +**MVP3 adds on top of MVP2:** -- **Langfuse self-hosted** — every LLM call captured: prompts, responses, costs, token counts, latency. LangChain callback handler integrated. Eval datasets seeded for `propose_search_space`, `generate_judgments_llm`, `digest_narrative`. -- **SigNoz self-hosted** — distributed traces, metrics, logs via OpenTelemetry. Auto-instrumentation for FastAPI, Postgres, Redis, OpenAI client. +- **Langfuse self-hosted** — every LLM call captured: prompts, responses, costs, token counts, latency. LangChain callback handler integrated. Eval datasets seeded for `propose_search_space`, `generate_judgments_llm`, `generate_judgments_from_ubi`, `digest_narrative`. +- **SigNoz self-hosted** — distributed traces, metrics, logs via OpenTelemetry. Auto-instrumentation for FastAPI, Postgres, Redis, the OpenAI-compatible client, and all three engine adapters. - **Structured event catalog** — `src/events.py` with ~50 named events across 13 domains (auth, study, proposal, agent, etc.), backed by Pydantic schemas. CI gate rejects unregistered event names. -- **Audit log immutability** — Postgres trigger blocking UPDATE/DELETE on `audit_log`. INSERT-only role for the API. -- **Lineage columns** — `langfuse_trace_id`, `prompt_version`, `input_hash` on judgments, digests, proposals. Full provenance for every LLM-produced artifact. +- **Audit log immutability** — `audit_log` table + Postgres trigger blocking UPDATE/DELETE. INSERT-only role for the API. Every state-mutating endpoint or service function calls `create_audit_event()` in the same transaction as the primary mutation. +- **Lineage columns** — `langfuse_trace_id`, `prompt_version`, `input_hash` on judgments (`source='llm'` rows; `source='click'` rows leave it NULL), digests, proposals. Full provenance for every LLM-produced artifact. - **Trace context propagation** — W3C `traceparent` flows through API → Redis → worker → adapter → engine, including the Arq enqueue→pickup boundary that needs custom serialization. -- **Cross-system correlation** — Langfuse traces annotated with SigNoz span IDs and vice versa. Two-clicks navigation between the two observability stacks. +- **Cross-system correlation** — Langfuse traces annotated with SigNoz span IDs and vice versa. Two-click navigation between the two observability stacks. - **PII redaction processor** — centralized `structlog` processor scrubbing tokens, keys, credentials; configurable email and query-text redaction. CI runs a regex sweep of test logs to flag accidental leakage. - **Slow-operation flagging** — spans exceeding 5× their p99 SLO emit `system.slow_operation` regardless of trace sampling rate. - **Unified retention policy** — documented across audit log, application logs, traces, LLM traces, eval results, backups. -**Audience expansion:** Platform teams considering serious evaluation. Without observability, the tool is a curiosity; with it, the tool can be assessed for production-style operation. - -**Strategic rationale:** Observability is a foundational reliability layer that benefits every adopter regardless of engine, LLM provider, or scale. Adding it before broadening engine support means all subsequent MVPs ship with full traceability from day one — no retrofit needed. - ---- - -### MVP3 / v0.3 — "Production Stacks" (target: +3 weeks) - -**Headline: Works against your real production stack — Fusion, GitLab, Bitbucket.** - -v0.3 broadens the supported production stack by adding the Lucidworks Fusion adapter and the multi-Git-provider abstraction. After v0.3, RelyLoop can be evaluated against the search engine and Git provider you already run, not just ES + GitHub. - -**MVP3 adds on top of MVP2:** - -- **Lucidworks Fusion adapter** — full implementation: - - `search_batch` via Fusion's gateway query API (`POST /api/apps/{app}/query/{collection}`) - - `render` produces Fusion request bodies with parameter overrides; pipeline JSON is the canonical Git artifact - - Auth via session cookies or JWT - - Fusion-specific tools: `list_pipelines`, `get_pipeline`, `list_query_profiles` - - Two-step apply path (PR edits pipeline params; CI runs `objects-import` to deploy) - - `auth_kind = "fusion_session"` and `"fusion_jwt"` paths -- **Engine-native signals reader for Fusion** — aggregates events from the `{app}_signals` collection into the same per-(query, doc) feature shape MVP1.5's `UbiReader` produces. Reuses the MVP1.5 `SignalsConverter` Protocol unchanged; only the read path is Fusion-specific. Relevant for Fusion deployments that haven't adopted UBI. -- **Multi-Git-provider abstraction** — `GitProvider` Protocol with three implementations: - - GitHub (already present from MVP1) - - GitLab — token or app auth, project-level webhooks, MR + approval rules - - Bitbucket — workspace tokens, webhook UUID, default reviewers + branch restrictions - - Per-provider webhook endpoints (`/webhooks/github`, `/webhooks/gitlab`, `/webhooks/bitbucket`) -- **Adapter contract tests** — every `SearchAdapter` and `GitProvider` implementation runs the same conformance suite. Future community-contributed adapters pass the same suite to be merged. -- **Cassette-based testing infrastructure** — `pytest-recording` for Fusion adapter unit tests; deterministic replay without requiring a live Fusion instance. -- **Fusion-specific docs** — config-repo conventions for Fusion (pipelines + params + profiles directory layout), pipeline-validate CI integration, two-step apply path runbook. - -**Audience expansion:** Lucidworks shops (a substantial enterprise-search audience), GitLab-using enterprises, Bitbucket-using enterprises. Roughly doubles the addressable adopter pool. - -**Strategic rationale:** Engine and Git providers are the two interfaces that gate enterprise adoption. v0.3 removes both as blockers for the most common production deployments — and the adapter contract tests it introduces become the foundation for community-contributed adapters going forward. - ---- - -### MVP4 / v0.4 — "Multi-tenant, Multi-LLM" (target: +3 weeks) - -**Headline: Run RelyLoop for many customers, with the LLM provider you need.** - -v0.4 enables platform-team-scale adoption: a single deployment serving many downstream customers in isolation, optionally with different LLM provider choices per tenant. - -**MVP4 adds on top of MVP3:** - -- **Multi-tenancy primitives** — `tenants` table, `tenant_id` columns across all user-facing tables (clusters, query_sets, judgment_lists, query_templates, studies, proposals, conversations, audit_log, config_repos), `tenant_memberships` junction table with per-tenant roles (`viewer`, `runner`, `tenant_admin`), `platform_admin` super-role for cross-tenant operations. -- **Tenant scoping on all operations** — list endpoints filter by tenant, write endpoints enforce tenant context, audit log rolls up per tenant. -- **Per-tenant configuration overrides** — `tenants.settings` JSONB allows different LLM providers, cost caps, default samplers per tenant. -- **Bearer-token API keys** — `api_keys` table with Argon2id-hashed keys, role + scopes (e.g., `studies:write`, `proposals:write`), expiration, revocation. Tenant-scoped by default. Service accounts get long-lived keys; admins issue and rotate. -- **Multi-LLM provider abstraction** — pluggable `ChatModel` adapter with implementations for OpenAI (already from MVP1), Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI, and self-hosted (Ollama, vLLM). Provider selection per-tenant via config; capability validation at startup (refuses to start if the chosen provider lacks structured-output support). -- **Cost tracking** — Langfuse-derived per-tenant LLM cost rollups exposed in the UI. -- **Tenant switcher in UI** — for users who belong to multiple tenants. +**Audience expansion:** Platform teams considering serious evaluation of unattended overnight runs. Without observability, the tool is a curiosity; with it, the tool can be assessed for production-style operation. -**Migration:** Single-tenant MVP1-MVP3 deployments are migrated into the new schema with an auto-created `default` tenant and all existing rows backfilled with that tenant_id. The migration is documented and CI-tested. - -**Audience expansion:** Platform teams running search for many internal/external customers (the target audience that motivated the project from the start); orgs with strict LLM provider policies (Bedrock-only AWS shops, Vertex-only GCP shops, air-gapped deployments on Ollama/vLLM). - -**Strategic rationale:** Multi-tenancy is the boundary between "internal team tool" and "platform-team product." Multi-LLM is the boundary between "OpenAI-only" and "fits any enterprise's LLM strategy." Both are needed for the platform-team use case that motivated the project from the start. +**Strategic rationale:** Observability is a foundational reliability layer. Landing it after the engine sweep (MVP2) means three engines × two judgment sources are instrumented in one release of work, rather than per-engine retrofit. After MVP3, every product capability is in place; GA v1 is pure polish and governance. --- @@ -2452,17 +2355,17 @@ v0.4 enables platform-team-scale adoption: a single deployment serving many down **Headline: The 1.0 release — production-ready, contributor-ready, fully governed.** -GA v1 layers in the polish that elevates RelyLoop from a working tool to a proper open-source product: orchestrator architecture migration to LangGraph, the full agent-first API surface, the four-layer test pyramid, complete CI/CD with security gates, and the OSS launch infrastructure (governance, docs, ADRs, distribution). +GA v1 layers in the polish that elevates RelyLoop from a working tool to a proper open-source product: orchestrator architecture migration to LangGraph, the full agent-first API surface, the four-layer test pyramid at 90% coverage, complete CI/CD with security gates, and the OSS launch infrastructure (governance, docs, ADRs, distribution). **No new product surface beyond MVP3** — all six differentiators are already live. -**GA v1 adds on top of MVP4:** +**GA v1 adds on top of MVP3:** - **LangGraph orchestrator** — replaces MVP1's plain OpenAI function calling with a state graph (orchestrator + hypothesis-gen subagent + evaluation subagent). Postgres-backed state persistence via `PostgresSaver`; resumable conversations; human-in-the-loop interrupts at three points (PR open, prod-cluster studies, judgment regeneration). - **Full agent-first API surface** — `/openapi.json`, `/capabilities`, `/tools.json`, idempotency keys with conflict semantics, RFC 7807 error format with `error_code` + `retryable` extensions, cursor pagination, rate-limit headers, outgoing webhook subscriptions with HMAC signing, SSE streams on `/studies/{id}/events` and `/proposals/{id}/events`. - **Full four-layer test pyramid:** - - Unit tests: **90% line coverage** (up from 80% in MVP1-4), 85% branch coverage - - Contract tests: every adapter, every `@tool`, every endpoint covered (extends the contract-test foundation laid in MVP3) + - Unit tests: **90% line coverage** (up from 80% in MVP1–3), 85% branch coverage + - Contract tests: every adapter (ES/OpenSearch + Solr), every `@tool`, every endpoint covered - Integration tests: Compose-based with cassetted external HTTP, < 5 min runtime - - E2E tests: live OpenAI + shared Fusion dev cluster + test config repo, < 20 min runtime, $5/run budget cap + - E2E tests: live OpenAI-compatible endpoint + Compose ES/OpenSearch/Solr stack + test config repo, < 20 min runtime, $5/run budget cap - **Full GitHub Actions CI/CD** — five workflows (`pr.yml`, `main.yml`, `release.yml`, `nightly.yml`, `cassette-refresh.yml`) with security scans (Trivy, bandit, pip-audit, npm audit), branch protection on `main`, auto-deploy to staging on merge, manual gate to prod on tag. - **Complete code quality gates** — ruff, mypy strict, eslint, prettier, tsc strict, pre-commit hooks, secret-leak detection. - **Backup & DR baseline** — daily Postgres dumps with 30-day retention, runbook for restore, quarterly DR exercise. @@ -2475,69 +2378,53 @@ GA v1 layers in the polish that elevates RelyLoop from a working tool to a prope - API reference auto-generated from OpenAPI and rendered with Stoplight or Redoc - **ZDR (Zero Data Retention) enforcement** — deployment refuses to start if ZDR is required by config but the LLM key isn't enrolled. - **Telemetry stance** — explicit zero-telemetry commitment with CI grep gate against telemetry-pattern strings. -- **Public-launch readiness** — design partners onboarded and live, brand naming and trademark verifications complete (see §28 and §29 #23), at least one public reference customer with permission. +- **Public-launch readiness** — design partners onboarded and live (target: one each on ES, OpenSearch, Solr), brand naming and trademark verifications complete (see §28 and §29), at least one public reference customer with permission. +- **Public benchmark** — head-to-head comparison of RelyLoop's Optuna/TPE loop vs OpenSearch SRW's 66-cell grid search on the same hybrid-weight problem, run on the same OpenSearch cluster, published with code and reproduction steps. The single most credible proof artifact for the Bayesian-optimization differentiator. **Audience expansion:** Production deployment by enterprise platform teams; foundation for community contributors; long-term sustainability of the project. -**Strategic rationale:** GA v1 is the moment RelyLoop becomes a real open-source product, not just a working tool. It's contributor-ready (governance), production-ready (testing, security, observability already in place since MVP2), and adoption-ready (docs, distribution, design partners). +**Strategic rationale:** GA v1 is the moment RelyLoop becomes a real open-source product, not just a working tool. It's contributor-ready (governance), production-ready (testing, security, observability all in place since MVP3), and adoption-ready (docs, distribution, design partners, public benchmark). -### v1.5+ (post-GA, target: +4 weeks) +--- -Post-GA polish items. UBI (MVP1.5) and engine-native behavioral-data readers (MVP3 / v2) used to live here; they were promoted to the release timeline when MVP1.5 was introduced as a formal tier. +### Backlog (captured, not in flight) -- Multiple config repos -- Outgoing webhooks for resource lifecycle events (study, digest, proposal, PR state) — replaces polling for both internal and external agents -- SSE streams on `/studies/{id}/events` and `/proposals/{id}/events` -- Prod-validation flow (run winning config read-only against the prod cluster before opening the staging PR) -- Calibration UI for judgment lists -- Audit log UI -- Performance hardening (worker pool tuning, RDB indexes) -- Cost dashboard and per-user OpenAI quotas -- W3C Trace Context (`traceparent`) propagation through to ES/Fusion -- Counterfactual click models (CCM, DBN) as additional `SignalsConverter` implementations on top of the MVP1.5 Protocol — relevant once enough impressions per (query, doc) have accumulated to make them statistically valid +Items previously in the release timeline that the 2026-05-27 reframe moved out of the pre-GA path. Captured here so they're not lost; promoted to a release if and when a design-partner conversation or a specific adopter request makes them load-bearing. -### v2 (TBD) +**Multi-Git provider abstraction.** `GitProvider` Protocol with GitLab + Bitbucket implementations alongside the existing GitHub provider. Was bundled with the dropped MVP3 Fusion work in the prior plan. Promoted out when an adopter on a non-GitHub provider commits to evaluating. Until then, all adopters use GitHub (which the global enterprise-search community overwhelmingly does). -#### Path A continuations — refinements to the experimentation-and-change-management tool +**Multi-tenancy primitives.** `tenants` + `tenant_memberships` + `users` + `api_keys` tables, `tenant_id` columns across all user-facing tables, per-tenant configuration overrides, tenant switcher UI. Was MVP4 in the prior plan. The platform-team-running-search-for-many-customers use case is real but underserved by the pre-GA path — single-tenant + SSO via reverse proxy is sufficient through GA v1. Promoted out when a multi-customer platform team commits to evaluating. -- Conditional parameters in search space -- Multi-objective optimization (nDCG vs latency Pareto) -- Pure-Solr adapter (when needed by a non-Fusion deployment) -- Elastic Behavioral Analytics integration (real click data → judgments) for ES clusters -- LTR plugin support (train + deploy XGBoost rerankers); Fusion ML reranker training integration -- Vespa adapter -- Cross-cluster fan-out studies +**Multi-LLM provider abstraction.** Native (non-OpenAI-compatible) provider SDKs for Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI. OpenAI-compatible endpoints (Ollama, vLLM, LM Studio, TGI) work today via `OPENAI_BASE_URL` redirection — covering the air-gapped + local-LLM use cases without provider-specific code. Was MVP4 in the prior plan. Promoted out when an adopter with strict Bedrock-only or Vertex-only policy commits to evaluating. -#### Path B — Search Quality Platform expansion +**LTR training.** Cross-engine model training (XGBoost rerankers for ES + OpenSearch via the LTR plugin / native LTR; same XGBoost path for Solr via `MultipleAdditiveTreesModel`). MVP2's Solr LTR support is consume-only. Promoted to release status when adopter feedback prioritizes it over Path B. -A coherent v2 direction is to expand from "experimentation and change management" into "experimentation and change management *plus* real-time production observability and online learning." This shifts the tool from Quepid-territory toward commercial-platform-territory (Coveo, Algolia, Bloomreach). It's deliberately deferred from v1 because: +**Lucidworks Fusion adapter.** Dropped outright; see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md). The `SearchAdapter` Protocol shape means a community-contributed Fusion adapter remains possible, but the project does not own that direction. -- v1 is already substantial scope; piling Path B on top jeopardizes shipping -- Path A is independently valuable; Path B builds on Path A but isn't a prerequisite for it -- Path B requires stream-processing infrastructure (Kafka or Redis Streams + ClickHouse rolling-window aggregation), which is a meaningful architectural addition -- Path B changes the audience — Path A serves search engineers; Path B also serves search ops / SREs. Different mental model in the UI. +**Path B — Search Quality Platform expansion.** A coherent v2 direction is to expand from "experimentation and change management" into "experimentation and change management *plus* real-time production observability and online learning." This shifts the tool from Quepid-territory toward commercial-platform-territory (Coveo, Algolia, Bloomreach). Captured as a backlog/v2 set rather than a GA path because it requires stream-processing infrastructure (Kafka or Redis Streams + ClickHouse rolling-window aggregation) and changes the audience (Path A serves search engineers; Path B also serves search ops / SREs). -**Path B candidates, ordered by likely priority:** +Path B candidates, ordered by likely priority: -- **Production quality monitoring.** Stream signals (Fusion `*_signals` collection or ES Behavioral Analytics) into rolling-window quality metrics — CTR, dwell time, refinement rate, zero-result rate, position-1 abandonment. Alert when metrics degrade beyond thresholds. Optionally trigger an LLM agent investigation that pulls recent failing queries and surfaces hypotheses. Most universally valuable Path B capability; turns the tool into a daily-driver for search platform teams, not just a tuning workbench. -- **Bandit-style online learning.** Multi-armed bandits (Thompson sampling, contextual bandits via Vowpal Wabbit or similar) routing live production traffic across promising candidate configs and progressively shifting toward winners. The offline Karpathy-loop studies feed the bandit candidates; the bandit produces real-time learning. This is the most ambitious Path B addition because it requires the tool to participate in (or coordinate with) the production search-serving path, not just sit alongside it. Architecturally, two viable shapes: +- **Production quality monitoring.** Stream signals into rolling-window quality metrics — CTR, dwell time, refinement rate, zero-result rate, position-1 abandonment. Alert when metrics degrade beyond thresholds. Optionally trigger an LLM agent investigation that pulls recent failing queries and surfaces hypotheses. Most universally valuable Path B capability; turns the tool into a daily-driver for search platform teams, not just a tuning workbench. +- **Bandit-style online learning.** Multi-armed bandits (Thompson sampling, contextual bandits via Vowpal Wabbit or similar) routing live production traffic across promising candidate configs and progressively shifting toward winners. The offline Karpathy-loop studies feed the bandit candidates; the bandit produces real-time learning. Two viable architectural shapes: - **External coordinator.** The tool maintains the bandit state and exposes a `/api/v1/bandit/select?cluster=X` endpoint the search service calls per query to choose which config to serve. Adds latency to the hot path; clean integration boundary. - - **In-engine.** The bandit logic lives in the search engine itself (a Solr request handler or a Fusion stage), driven by a config the tool publishes. No hot-path latency; harder to debug. - - The decision affects v2 scoping significantly. External-coordinator is the more natural OSS extension; in-engine implementations would likely be community-contributed adapters per engine. -- **Shadow validation pre-deploy.** When a PR is merged but before CI promotes it to live serving, run the new config against a sampled live-query stream (read-only, results discarded) for 30–60 minutes, compare metrics against the current production config, and either auto-approve the deploy or flag for human review. Stronger confidence than offline judgment-list eval, lower risk than direct deploy. Builds on production monitoring infra. -- **Fusion Experiments integration.** Online A/B testing of winning configs against current production via Fusion's native experiments feature; results flow back to the tool's experiment table. + - **In-engine.** The bandit logic lives in the search engine itself (a Solr request handler driven by a config the tool publishes). No hot-path latency; harder to debug. Likely community-contributed per engine. +- **Shadow validation pre-deploy.** When a PR is merged but before CI promotes it to live serving, run the new config against a sampled live-query stream (read-only, results discarded) for 30–60 minutes, compare metrics against the current production config, and either auto-approve the deploy or flag for human review. - **Manual one-click rollback.** Surfaced from the production monitoring UI when metrics degrade. Opens a revert PR against the config repo, triggering the same review-and-deploy path. Auto-rollback explicitly rejected (see §4 Non-goals). -#### Why the deferral is right today +**Path A continuations (post-GA polish).** -The honest reasoning, in case the priority changes later: - -1. Shipping Path A as a focused, high-quality OSS release is more valuable than shipping a partial Path B that doesn't fully cover either side. -2. Path A has demonstrable value standalone — Quepid users get a meaningful upgrade, search platform teams get measurable relevance improvements, and the experimentation-and-change-management problem is real and underserved on its own. -3. Path B is a different *kind* of problem. It pulls in stream processing, real-time alerting, on-call operational thinking. Mixing both in v1 creates a product that's less coherent on each axis. -4. If the project succeeds in Path A, Path B becomes the natural roadmap. If Path A struggles (low adoption, slow community formation), Path B was never going to save it. - -The bandit capability specifically has been called out as the single most interesting v2 candidate by the project sponsor; it's deliberately set aside for v1 to keep focus, with the explicit option to revisit after Path A ships. +- Conditional parameters in search space +- Multi-objective optimization (nDCG vs latency Pareto) +- Counterfactual click models (CCM, DBN) as additional `SignalsConverter` implementations — relevant once enough impressions per (query, doc) have accumulated to make them statistically valid +- Elastic Behavioral Analytics-derived judgments for ES clusters that haven't adopted UBI (despite Elastic's BA deprecation in 9.0, residual deployments remain through ~2028) +- Vespa adapter +- Cross-cluster fan-out studies +- Multiple config repos per cluster +- Outgoing webhooks for resource lifecycle events (study, digest, proposal, PR state) +- Prod-validation flow (run winning config read-only against the prod cluster before opening the staging PR) +- Cost dashboard and per-user LLM cost quotas +- W3C Trace Context (`traceparent`) propagation extending through to the search engine ## 28. Tech stack & implementation decisions @@ -2597,7 +2484,7 @@ This section consolidates every implementation-level decision that shapes how Re | Database (app) | Postgres 16 | Primary application state + Optuna RDBStorage (single instance) | | Cache / queue | Redis 7 | Arq queue + LangChain cache (MVP4+) | | Trace storage (LLM) | ClickHouse 24 | Required by Langfuse (MVP2+) | -| Search engines (targets) | Elasticsearch 8.11+/9.x; OpenSearch 2.x/3.x; Lucidworks Fusion 5.x; Solr 9.x (v2+) | Per-engine version support documented in §8 | +| Search engines (targets) | Elasticsearch 8.11+/9.x; OpenSearch 2.x/3.x (MVP1); Apache Solr 9.x/10.x (MVP2) | Per-engine version support documented in §8 | | Reverse proxy | Caddy 2 | TLS termination, SSO via oauth2-proxy or Authelia | | Container runtime | Docker 24+ with Compose | MVP1 deployment target | | Helm chart (v1.5+) | Helm 3 | Kubernetes deployment for adopters that prefer it | @@ -2753,13 +2640,13 @@ The Microsoft Loop product (collaboration app, separate goods/services category) ### Audience -The primary intended adopter is an internal search platform team at a medium-to-large enterprise that runs search engines (Elasticsearch, Lucidworks Fusion, Solr) for one or more downstream "customers" (other product teams, business units, or external clients). These teams typically share three pains: +The primary intended adopter is an internal search platform team at a medium-to-large enterprise that runs an open-source search engine (Elasticsearch, OpenSearch, or Apache Solr) for one or more downstream "customers" (other product teams, business units, or external clients). These teams typically share three pains: - Manual relevance tuning is slow and expert-bound; doesn't scale across many indexes/customers -- Quantifying relevance improvements for stakeholders is hard without a standing eval harness -- AI/LLM tooling for search is hyped but practical, deployable, customer-data-respecting answers are scarce +- Quantifying relevance improvements for stakeholders is hard without a standing eval harness — and the closest workbench tools (Quepid, RRE) require human judgments at a scale most teams don't have +- The OpenSearch-only auto-tuning surface (SRW's hybrid-weight grid search) doesn't cover the field-boost / function-score / fuzziness / `mm` parameter space where most relevance wins actually live -Secondary adopters: search-as-a-service vendors building on top of OSS engines, and sophisticated single-product teams with one important search. +Secondary adopters: search-as-a-service vendors building on top of OSS engines, sophisticated single-product teams with one important search, and the OSC + Sease + Querqy + Haystack community (predominantly Solr-native; the natural early-adopter pool for a Bayesian-loop upgrade to their existing manual workbenches). ### License @@ -2853,19 +2740,22 @@ Without design partners, OSS projects in this space often ship features that don ### Comparison with alternatives -The README's `comparison.md` covers the full set; representative summary: +The full citation-backed matrix lives at [`docs/07_research/comparison.md`](../../07_research/comparison.md). Representative summary at a glance: + +| Tool | OSS? | Engines | Bayesian/TPE optimizer over full search space? | Git-PR apply path? | Local LLM obs? | Apache 2.0? | +|---|---|---|---|---|---|---| +| RelyLoop | yes | ES + OpenSearch (MVP1); + Solr (MVP2) | yes (Optuna TPE, thousands of trials) | yes | yes (MVP3) | yes | +| **OpenSearch Search Relevance Workbench (3.6)** | yes | OpenSearch only | no — 66-cell grid search over hybrid weights only ([docs](https://docs.opensearch.org/latest/search-plugins/search-relevance/optimize-hybrid-search/)); Bayesian in RFC #934, no shipped code | no — explicitly out of scope by [RFC #17735](https://github.com/opensearch-project/OpenSearch/issues/17735) | n/a (no LLM obs surface) | yes | +| **OpenSearch Relevance Agent (3.6, experimental)** | yes | OpenSearch only | no — DSL recommender, doesn't run sweeps | no | n/a | yes | +| Quepid | yes | Solr + ES + OpenSearch | no — manual workbench | no | n/a | yes | +| RRE (Sease) | yes | Solr + ES | no — offline evaluator, no sweeps | no | n/a | yes | +| Chorus (Querqy / OSC) | yes | Solr (primary) + OpenSearch (partial) | no | no | n/a | yes | +| Elasticsearch (native) | Elastic License 2.0 + SSPL | ES only | no — `_rank_eval` is an API primitive; BA + Search Applications deprecated in 9.0 ([release notes](https://www.elastic.co/guide/en/elastic-stack/9.0/release-notes-elasticsearch-9.0.0.html)) | no | n/a | no | +| Coveo / Algolia / Bloomreach | no (SaaS) | vendor only | partial (proprietary) | no | n/a | n/a | -| Tool | OSS? | Multi-engine? | Karpathy loop? | Local LLM obs? | Apache 2.0? | -|---|---|---|---|---|---| -| RelyLoop | yes | ES + Fusion (+ Solr v2) | yes | yes | yes | -| Quepid | yes | yes | no | no LLM | yes | -| RRE | yes | yes | no | no LLM | Apache 2.0 | -| LangSmith | no | n/a | partial | hosted only | n/a | -| Phoenix (Arize) | yes | n/a | no | yes | Apache 2.0 | -| Lucidworks Springboard | no | Fusion only | partial | n/a | n/a | -| Coveo / Algolia / Bloomreach | no (SaaS) | vendor only | partial (proprietary) | n/a | n/a | +**The defensible bundle:** *Bayesian/TPE optimization across the full search space + Git-PR apply path + works on every major OSS engine (Elasticsearch, OpenSearch, Apache Solr) + conversational agent that runs the loop + hybrid UBI+LLM judgments + local-first observability + Apache 2.0.* -The defensible position: **Quepid + LLM-driven Karpathy loop + agent-first API + local-first observability + multi-engine + Git-as-source-of-truth**, all OSS under Apache 2.0. No other project covers this combination. +Each individual ingredient above has at least one OSS comparable. The combination does not. The closest competitor is OpenSearch SRW, which is OpenSearch-only by architecture, grid-search-only by current implementation, and has no apply path by explicit RFC choice — three constraints RelyLoop is built specifically to lift. ### Sustainability risks @@ -2886,21 +2776,14 @@ A few honest acknowledgements: 6. **Parameter ranges.** When the LLM proposes ranges, can it propose ranges that are out of bounds for the engine (e.g., negative boost)? Validator catches this, but worth defensive testing. 7. **Agent runtimes to test against in v1.5.** The API is framework-agnostic, but we should pick 2–3 reference agent runtimes (LangGraph? OpenAI Assistants? Bedrock Agents? Claude Agent SDK with HTTP tools? a hand-rolled agent?) to validate the workflow ergonomics on. Choice influences the worked example in `x-agent-workflows`. 8. **Service-account naming and rotation policy.** Are agent service accounts shared across multiple agent codebases or always one-per-agent? What's the rotation cadence and the rotation runbook? Affects API-key UX in v1.5. -9. **Fusion pipeline forking strategy.** When a study recommends parameter changes that effectively constitute a new pipeline shape (e.g., a previously-disabled stage now matters), should the tool propose creating a *new* pipeline version (new ID) or modifying the existing one in place? Implications for promotion across environments and for rollback. Default v1 stance: edit in place; revisit if it bites us. -10. **Fusion Signals enablement plan.** When does the user enable Signals in DEV, then STAGING, then PROD? What sample sizes do we need before signals-derived judgments are trustworthy enough to drive studies? Belongs in the v1.5 kickoff conversation. -11. **Fusion app/collection scoping.** Some Fusion installations use one app per collection; others use one app for many collections. Does our `clusters.engine_config.app` model fit, or do we need a finer-grained "app + collection" target? Currently spec'd as one app per cluster row; revisit if the user has multi-app clusters. -12. **Lucidworks eval license policy for engineers.** When a developer needs hands-on Fusion access (recording new cassettes, reproducing a bug, validating new adapter parameters), what's the request flow? Options: (a) negotiate a longer-term Lucidworks dev license that the team shares, (b) rely on the org's existing Fusion dev cluster with per-engineer scoped credentials, (c) per-engineer 30-day eval licenses on demand. Affects developer-onboarding ergonomics. Recommended default: option (b) for routine work, option (c) for engineers doing initial adapter implementation. -13. **Cassette refresh cadence and ownership.** Who is responsible for re-recording the Fusion replay cassettes when the upstream Fusion API changes (e.g., a Fusion version upgrade)? Include in the v1 runbook. Consider a quarterly cassette-freshness CI check that pings the dev cluster and flags drift. -14. **Mock Fusion fidelity scope.** The `fusion-mock` service emulates a small subset of the Fusion query gateway. How comprehensive should it be — just enough for UI demos, or a high-fidelity simulator suitable for some classes of integration testing? Bigger ambition increases maintenance burden. Recommended default: minimal, demo-only. -15. **LLM eval cadence and triggers.** The Langfuse eval suite runs nightly and on prompt PRs by default. Should it also run on every model-version bump? On every Langfuse upgrade? On a schedule independent of code changes (e.g., monthly model-drift checks against the same prompts)? Affects CI runtime and cost. Recommended default: nightly + on prompt PRs in v1; add monthly drift checks in v1.5 once we have baseline scores to compare against. -16. **Eval gold-set ownership.** Who maintains the `judgment_generation_eval` 200-tuple gold set? Refresh cadence? This is the single most important quality signal for the LLM-as-judge layer; if it drifts or rots, evals stop catching regressions. Recommended default: relevance team owns it, quarterly refresh, with a CI check that flags if the gold set hasn't been touched in 6 months. -17. **Langfuse retention policy.** ClickHouse storage for traces grows linearly with usage. What's the retention period — 30 days? 90 days? 1 year? Affects disk sizing. Recommended default: 90 days for traces, indefinite for eval results (low volume). -18. **v1 scope vs. team size.** v1 is now a 12-week single-engineer effort or ~7 weeks with two engineers. Three options: (a) commit two engineers and ship in 7 weeks, (b) accept 12 weeks for one engineer, (c) defer one major area (most reversibly: cut Langfuse and SigNoz from v1 — accept basic logging only — and add them in v1.5 once the core loop is proven; saves ~2 weeks). Recommended default: (a) if a second engineer is available, otherwise (b). Option (c) is structurally riskier because retrofitting observability is painful. -19. **E2E test budget and frequency.** E2E tests use real OpenAI calls (~$5/run cap) and hit the shared Fusion dev cluster. At per-merge-to-main + nightly cadence, this could be ~$200/month in OpenAI costs alone, plus Fusion dev cluster contention. Worth confirming the budget envelope and whether per-merge E2E is the right cadence (alternative: nightly only + on-demand via PR label). -20. **Performance benchmark suite for v1.5.** What hot paths are most worth regressionproofing — trial execution, OpenAPI serving, agent first-token, the digest LLM call? Pick 3–5 for v1.5 `pytest-benchmark` suite and decide pass/fail thresholds. -21. **Path A vs. Path B long-term commitment.** v1 is strictly Path A (experimentation and change management). Path B (production quality monitoring, bandit-style online learning, shadow validation) is documented as a v2 direction but explicitly deferred. The strategic question is whether soundminds.ai commits to Path B as the long-term direction once v1 is shipped and adopted, or stays focused on Path A and treats Path B as community-driven expansion / fork territory. Affects roadmap signals to early adopters and contributor recruitment. Recommended default: revisit after 2–3 design partners are running Path A in production and we have real signal on what they want next. Early bandit-capability scoping (which architectural shape — external coordinator vs. in-engine, see §27) can begin in parallel without committing to v2 timelines. -22. **Bandit architectural shape if Path B is pursued.** External coordinator (tool maintains bandit state, search service calls a tool endpoint per query) vs. in-engine (bandit logic embedded in Solr request handler or Fusion stage, driven by tool-published config). External coordinator is cleaner but adds hot-path latency; in-engine has no latency but is harder to debug and requires per-engine implementations. The bandit decision has the most architectural blast radius of any Path B capability — worth reaching alignment before any work begins. -23. **Pre-launch RelyLoop trademark and namespace verification.** Before public announcement, the following must be completed and signed off: +9. **LLM eval cadence and triggers.** The Langfuse eval suite runs nightly and on prompt PRs by default. Should it also run on every model-version bump? On every Langfuse upgrade? On a schedule independent of code changes (e.g., monthly model-drift checks against the same prompts)? Affects CI runtime and cost. Recommended default: nightly + on prompt PRs at MVP3; add monthly drift checks at GA once we have baseline scores to compare against. +10. **Eval gold-set ownership.** Who maintains the `judgment_generation_eval` 200-tuple gold set? Refresh cadence? This is the single most important quality signal for the LLM-as-judge layer; if it drifts or rots, evals stop catching regressions. Recommended default: relevance team owns it, quarterly refresh, with a CI check that flags if the gold set hasn't been touched in 6 months. +11. **Langfuse retention policy.** ClickHouse storage for traces grows linearly with usage. What's the retention period — 30 days? 90 days? 1 year? Affects disk sizing. Recommended default: 90 days for traces, indefinite for eval results (low volume). +12. **E2E test budget and frequency.** E2E tests use real LLM calls (~$5/run cap) and hit the Compose ES/OpenSearch/Solr stack. At per-merge-to-main + nightly cadence, this could be ~$200/month in LLM costs. Worth confirming the budget envelope and whether per-merge E2E is the right cadence (alternative: nightly only + on-demand via PR label). +13. **Performance benchmark suite.** What hot paths are most worth regressionproofing — trial execution, OpenAPI serving, agent first-token, the digest LLM call? Pick 3–5 for a `pytest-benchmark` suite at MVP3 and decide pass/fail thresholds. +14. **Path A vs. Path B long-term commitment.** GA v1 is strictly Path A (experimentation and change management). Path B (production quality monitoring, bandit-style online learning, shadow validation) is documented as a v2 direction but explicitly deferred. The strategic question is whether soundminds.ai commits to Path B as the long-term direction once GA v1 is shipped and adopted, or stays focused on Path A and treats Path B as community-driven expansion / fork territory. Recommended default: revisit after 2–3 design partners are running GA v1 in production and we have real signal on what they want next. +15. **Bandit architectural shape if Path B is pursued.** External coordinator (tool maintains bandit state, search service calls a tool endpoint per query) vs. in-engine (bandit logic embedded in a Solr request handler, driven by tool-published config). External coordinator is cleaner but adds hot-path latency; in-engine has no latency but is harder to debug and requires per-engine implementations. The bandit decision has the most architectural blast radius of any Path B capability — worth reaching alignment before any work begins. +16. **Pre-launch RelyLoop trademark and namespace verification.** Before public announcement, the following must be completed and signed off: - **USPTO TESS search** for "RELYLOOP" and stylization variants (RelyLoop, Rely Loop, Rely-Loop) in software-related classes (Class 9 — downloadable software; Class 42 — SaaS / IT services). If a live registration or pending application is found, escalate to legal review before proceeding. - **Domain registration** for `relyloop.com`, `relyloop.io`, `relyloop.dev`, and ideally `relyloop.org`. Cost is minimal; squatting after public announcement is expensive. - **GitHub organization** `relyloop` reserved (and `rely-loop` as a backup). Ditto for npm scope `@relyloop` and PyPI package prefix `relyloop-*` to prevent typosquatting. diff --git a/docs/01_architecture/adapters.md b/docs/01_architecture/adapters.md index ca167ae5..eceac832 100644 --- a/docs/01_architecture/adapters.md +++ b/docs/01_architecture/adapters.md @@ -1,7 +1,7 @@ # Adapters -**Status:** Adopted for MVP1. ElasticAdapter (handling ES + OpenSearch) is the only implementation in MVP1; Lucidworks Fusion ships at MVP3; Apache Solr at v2+. Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §8](../00_overview/product/relevance-copilot-spec.md) ("Engine adapter specification") and §11 ("Search space & parameters"). +**Status:** Adopted for MVP1. ElasticAdapter (handling ES + OpenSearch) is the only implementation in MVP1; SolrAdapter ships at MVP2 alongside UBI judgments. Lucidworks Fusion is explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)) — a community-contributed Fusion adapter remains possible against this Protocol, but the project does not own that direction. Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §8](../00_overview/relyloop-spec.md) ("Engine adapter specification") and §11 ("Search space & parameters"). --- @@ -20,7 +20,7 @@ from typing import Protocol, runtime_checkable @runtime_checkable class SearchAdapter(Protocol): - engine_type: str # "elasticsearch" | "opensearch" | "lucidworks_fusion" | "solr" + engine_type: str # "elasticsearch" | "opensearch" | "solr" def health_check(self) -> HealthStatus: ... def list_targets(self, *, target_filter: str | None = None) -> list[TargetInfo]: ... @@ -64,7 +64,7 @@ The asymmetry on 401/403 (`list_targets` distinguishes; `get_schema` conflates w **`list_targets` filter semantics** (added by [`feat_cluster_target_filter`](../00_overview/implemented_features/_feat_cluster_target_filter/)). When the caller passes `target_filter=""`, the adapter restricts the result to names where `fnmatch.fnmatchcase(name, glob)` returns True. Glob syntax: `*`, `?`, `[seq]`, `[!seq]` — no brace expansion. Case-sensitive via `fnmatchcase` (avoids platform-dependent `os.path.normcase` in `fnmatch.fnmatch`). **Order of operations:** the engine's system-index `.` exclusion runs FIRST; the glob filter runs SECOND. Operators cannot re-expose `.kibana_1` or similar via a permissive filter. The router resolves `cluster.target_filter` from the DB row before calling the adapter — `target_filter` is per-cluster metadata, not a per-request query parameter. -The Protocol lives in `backend/app/adapters/protocol.py`. Adapter implementations live as siblings (`backend/app/adapters/elastic.py`, future `backend/app/adapters/fusion.py`, etc.). +The Protocol lives in `backend/app/adapters/protocol.py`. Adapter implementations live as siblings (`backend/app/adapters/elastic.py` today; `backend/app/adapters/solr.py` arrives with MVP2). ## ElasticAdapter (MVP1) @@ -95,21 +95,20 @@ The adapter selects engine-specific behavior via the `engine_type` flag passed a Templates use **unified parameter names**. The adapter pivots them to native names. This table is the contract; adding a new parameter means extending the unified vocabulary and updating every adapter that supports it. -| Concept | Unified name | ES (`multi_match`) | Lucidworks Fusion (MVP3) | Solr (`edismax`) (v2+) | -|---|---|---|---|---| -| Per-field weights | `field_boosts: {f: w}` | `fields: ["f^w"]` | stage param `searchFields.fields` or `params.solr.qf` override | `qf=f^w` | -| Phrase fields | `phrase_field_boosts` | nested `phrase` clause | `params.solr.pf` override | `pf` | -| Tie breaker | `tie_breaker` | `tie_breaker` | `params.solr.tie` override | `tie` | -| Min should match | `min_should_match` | `minimum_should_match` | `params.solr.mm` override | `mm` | -| Fuzziness | `fuzziness` | `fuzziness` | (manual via `~` in query parser) | (manual via `~`) | -| Slop | `slop` | `slop` | `params.solr.ps` override | `ps` | -| Boost function | `boost_fn: {field, type, params}` | `function_score` | boosting stage `bq` override | `boost`, `bf` | -| Reranker model | `rerank_model: {id, top_k}` | `rescore.window_size` + LTR | rerank stage `modelId`, `topK` | LTR plugin model | -| Pipeline stage toggle | `stage_enabled: {stage_id: bool}` | (n/a) | per-stage `enabled` param | (n/a) | +| Concept | Unified name | ES / OpenSearch (`multi_match`) | Solr (`edismax`) (MVP2) | +|---|---|---|---| +| Per-field weights | `field_boosts: {f: w}` | `fields: ["f^w"]` | `qf=f^w` | +| Phrase fields | `phrase_field_boosts` | nested `phrase` clause | `pf` | +| Tie breaker | `tie_breaker` | `tie_breaker` | `tie` | +| Min should match | `min_should_match` | `minimum_should_match` | `mm` (richer arithmetic syntax — `2<-25% 9<-3`) | +| Fuzziness | `fuzziness` | `fuzziness` | (manual via `~` in query parser) | +| Slop | `slop` | `slop` | `ps` | +| Boost function | `boost_fn: {field, type, params, combine: "add"\|"multiply"}` | `function_score` (multiplicative default; additive when `combine=add`) | `bf` (additive) or `boost` (multiplicative) chosen by `combine` | +| Reranker model | `rerank_model: {id, top_k}` | `rescore.window_size` + LTR | `rq={!ltr model=... reRankDocs=...}` | -**When a concept doesn't exist natively** (e.g., ES `function_score` rendered as Fusion `bq`), the adapter either provides a best-effort translation OR raises `UnsupportedParameter` at render time. The search-space validator catches this before a study runs (rejects the study definition rather than failing trials individually). +**When a concept doesn't exist natively**, the adapter either provides a best-effort translation OR raises `UnsupportedParameter` at render time. The search-space validator catches this before a study runs (rejects the study definition rather than failing trials individually). -**Fusion's `stage_enabled` parameter** is unique to Fusion — it lets a study toggle individual pipeline stages on/off as a categorical parameter, which is a powerful and engine-specific tuning lever. +The earlier `stage_enabled` unified-vocabulary parameter (Fusion-specific pipeline stage toggle) was removed when Fusion was dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md). ## Authentication and credentials @@ -120,36 +119,23 @@ Credentials never live in the database. The `clusters.credentials_ref` column is | `es_apikey` | base64-encoded `id:api_key` string | Active | | `es_basic` | YAML: `{username, password}` | Active | | `opensearch_basic` | YAML: `{username, password}` | Active | -| `opensearch_sigv4` | YAML: `{access_key_id, secret_access_key, region, role_arn?}` | Reserved; raises `NotImplementedError` until **MVP3** (AWS managed OpenSearch) | -| `fusion_session` | YAML: `{username, password, session_url}` | Reserved for **MVP3** (Lucidworks Fusion adapter) | -| `fusion_jwt` | YAML: `{jwt_token, refresh_url?}` | Reserved for **MVP3** (Lucidworks Fusion adapter) | -| `solr_basic` | YAML: `{username, password}` | Reserved for **v2+** (Apache Solr adapter) | +| `opensearch_sigv4` | YAML: `{access_key_id, secret_access_key, region, role_arn?}` | Reserved; raises `NotImplementedError` until AWS managed OpenSearch is wired up (GA v1 hardening) | +| `solr_basic` | YAML: `{username, password}` | Activates at **MVP2** (Apache Solr adapter) | +| `solr_apikey` | YAML: `{jwt_token, refresh_url?}` for Solr 9+ `JWTAuthPlugin` | Activates at **MVP2** (Apache Solr adapter) | ## Reserved for later releases -Adapter implementations described here for architectural orientation. Each will get its own implementation file when it ships. - -### LucidworksFusionAdapter (MVP3) - -Lucidworks Fusion is built on Solr but exposes a different API surface centered on Query Pipelines. Pure-Solr deployments will be supported architecturally (see SolrAdapter notes below) but are deferred to v2+. - -- `search_batch` posts to Fusion's query API: `POST /api/apps/{app}/query/{collection}` with the request body holding query text and per-stage parameter overrides (`params.{stageId}.{paramName}`). -- `render` produces a Fusion request body, NOT a raw Solr query. A "template" in Fusion is a query pipeline definition exported as JSON, plus a parameter-binding map. -- `get_schema` queries Fusion's catalog API. -- `explain` uses `params.solr.debugQuery=true` and parses the `debug.explain` block returned through the Fusion gateway. -- **Authentication:** session-based (`POST /api/session`) or JWT. -- **Pipeline export/import:** apply path uses Fusion's `objects-export` and `objects-import` APIs. -- **Signals (v1.5+):** Fusion's signals collections capture user click/view/refinement events. The adapter exposes a `pull_signals` operation for click-derived judgment generation. -- Supports Fusion 5.x. - -### SolrAdapter (v2+; architectural reference only) +### SolrAdapter (MVP2) -Pure Apache Solr is supported by the same adapter pattern but is not built before v2 because the early-release user's deployment is Lucidworks Fusion (which arrives at MVP3). +Apache Solr ships in MVP2 alongside UBI judgments. Full scope in [`infra_adapter_solr/idea.md`](../../02_product/planned_features/infra_adapter_solr/idea.md). Summary: -- `search_batch` uses parallel `/select` requests (Solr has no `_msearch` equivalent). -- `render` produces Solr query parameters as a dict; supports `lucene`, `edismax`, `dismax` parsers. +- `search_batch` uses parallel `/select` requests with a connection pool (Solr has no `_msearch` equivalent). +- `render` produces a Solr request parameter dict; templates under `templates/solr/` mirror the `templates/elasticsearch/` shape. Supports `edismax` (primary), `dismax`, `lucene` parsers. +- `get_schema` uses Solr's Schema API; `list_targets` selects CoresAdmin (standalone) or CollectionsAdmin (SolrCloud) based on a startup capability probe. - `explain` uses `debugQuery=true&debug=results`. -- Supports Solr 8.11+ and 9.x; SolrCloud and standalone. +- LTR rescoring: applies a pre-existing `MultipleAdditiveTreesModel` (XGBoost-compatible) loaded via Solr's `/schema/model-store` as a rescore stage in a trial. Training is out of scope (LTR training is in the backlog). +- UBI on Solr: Solr ships `solr.UBIComponent` in core writing the same `ubi_queries` + `ubi_events` schema as the OpenSearch UBI plugin. The MVP2 `UbiReader` works on Solr unchanged. +- Supports Solr 9.x and 10.x; SolrCloud and standalone. ## Cross-references diff --git a/docs/01_architecture/agent-tools.md b/docs/01_architecture/agent-tools.md index 6f84d170..1c6e817b 100644 --- a/docs/01_architecture/agent-tools.md +++ b/docs/01_architecture/agent-tools.md @@ -1,7 +1,7 @@ # Agent Tools **Status:** Adopted for MVP1 with OpenAI function-calling. The tool registry pattern persists into LangGraph (GA v1) without breaking changes. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §19](../00_overview/product/relevance-copilot-spec.md) ("Agent tools") + §21 ("Agent integration"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §19](../00_overview/relyloop-spec.md) ("Agent tools") + §21 ("Agent integration"). --- @@ -125,7 +125,7 @@ The dispatcher **MUST** validate `tool_call.arguments` against the tool's Pydant | `fork_study(study_id, narrowed_search_space?, name?)` → `Study` | MVP2 (study forking with narrowed ranges) | | `run_pairwise(cluster_id, target, query_a, query_b, query_text)` → `PairwiseResult` | MVP2 (interactive comparison) | | `run_rank_eval(cluster_id, target, template_rendered, query_set_id, judgment_list_id, metric)` → `EvalResult` | MVP2 (one-off eval without a study) | -| Fusion-specific tools (`list_pipelines`, `get_pipeline`, `list_query_profiles`, `pull_signals`) | MVP3 (with Fusion adapter) | +| `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList` | MVP2 (with UBI judgments + Solr adapter) | | LangGraph state-graph orchestrator (replaces plain `openai` + function calling) | GA v1 | | Hypothesis-gen + evaluation subagents (per umbrella §15 architecture diagram) | GA v1 | | Human-in-the-loop interrupts before `open_pr`, prod-cluster studies, judgment regen | GA v1 | diff --git a/docs/01_architecture/api-conventions.md b/docs/01_architecture/api-conventions.md index d25928c9..11c8d864 100644 --- a/docs/01_architecture/api-conventions.md +++ b/docs/01_architecture/api-conventions.md @@ -1,7 +1,7 @@ # API Conventions **Status:** Adopted for MVP1. New conventions activate at the release noted on each row. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §28](../00_overview/product/relevance-copilot-spec.md) ("API conventions" subsection). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §28](../00_overview/relyloop-spec.md) ("API conventions" subsection). --- diff --git a/docs/01_architecture/apply-path.md b/docs/01_architecture/apply-path.md index 4440cb9f..7dcaec91 100644 --- a/docs/01_architecture/apply-path.md +++ b/docs/01_architecture/apply-path.md @@ -1,7 +1,7 @@ # Apply Path: Git PR Workflow **Status:** Adopted for MVP1 with GitHub-only. Multi-Git-provider abstraction (GitLab + Bitbucket) ships at MVP3 per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §16](../00_overview/product/relevance-copilot-spec.md) ("Apply path: Git PR workflow"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §16](../00_overview/relyloop-spec.md) ("Apply path: Git PR workflow"). --- @@ -105,14 +105,14 @@ Per [`feat_github_webhook`](../02_product/planned_features/feat_github_webhook/f | Capability | Activates at | |---|---| -| Multi-Git-provider abstraction (`GitProvider` Protocol) with GitLab + Bitbucket implementations | **MVP3** ("Production Stacks") | -| GitLab (project token / app, project-level webhooks, MR + approval rules) | MVP3 | -| Bitbucket (workspace tokens, webhook UUID, default reviewers + branch restrictions) | MVP3 | -| GitHub App auth (installation tokens, JWT signing) | MVP3 | -| Per-provider webhook signature verification beyond GitHub HMAC-SHA256 | MVP3 | -| Lucidworks Fusion two-step apply path (PR edits pipeline params; CI runs `objects-import` to deploy) | MVP3 (with Fusion adapter) | -| Slack notifications on PR open / review-requested / merged | MVP2 | -| Validation re-run on prod after staging win (top user story #2 from umbrella §6) | MVP2 | +| Multi-Git-provider abstraction (`GitProvider` Protocol) with GitLab + Bitbucket implementations | **Backlog** (was MVP3 in the prior plan) | +| GitLab (project token / app, project-level webhooks, MR + approval rules) | Backlog | +| Bitbucket (workspace tokens, webhook UUID, default reviewers + branch restrictions) | Backlog | +| GitHub App auth (installation tokens, JWT signing) | Backlog | +| Per-provider webhook signature verification beyond GitHub HMAC-SHA256 | Backlog | +| Apache Solr apply path (PR edits `*.params.json`; CI writes to Solr via Request Parameters API or `solrconfig.xml` swap) | MVP2 (with Solr adapter) | +| Slack notifications on PR open / review-requested / merged | MVP3 (observability layer) | +| Validation re-run on prod after staging win (top user story #2 from umbrella §6) | MVP3 | ## Cross-references diff --git a/docs/01_architecture/data-model.md b/docs/01_architecture/data-model.md index 0630d8ac..8594d7fa 100644 --- a/docs/01_architecture/data-model.md +++ b/docs/01_architecture/data-model.md @@ -1,7 +1,7 @@ # Data Model **Status:** Adopted for MVP1. Tables shown with their MVP1 shape; deferred columns and tables are flagged. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §9](../00_overview/product/relevance-copilot-spec.md) ("Data model"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §9](../00_overview/relyloop-spec.md) ("Data model"). --- diff --git a/docs/01_architecture/deployment.md b/docs/01_architecture/deployment.md index 9a7b29ae..96723a38 100644 --- a/docs/01_architecture/deployment.md +++ b/docs/01_architecture/deployment.md @@ -1,7 +1,7 @@ # Deployment **Status:** Adopted for MVP1. Local Docker Compose only; production-grade deployment activates as later releases add the missing pieces (TLS, SSO, observability). -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §25](../00_overview/product/relevance-copilot-spec.md) ("Deployment"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §25](../00_overview/relyloop-spec.md) ("Deployment"). --- @@ -189,19 +189,19 @@ Resetting state: `docker compose down -v && rm -rf ./data` returns to a clean in **MVP1: all services bind to `127.0.0.1` only.** The API is reachable on `localhost:8000`; ES on `localhost:9200`; OpenSearch on `localhost:9201`. No service is reachable from the network beyond the host. -This is appropriate for laptop installs. **MVP3** adds a Caddy reverse proxy with TLS termination (Let's Encrypt) for production-style network exposure — but with **no authentication yet** (the API is reachable over TLS but unauthenticated; appropriate only for trusted-network deployments). **MVP4** adds SSO (oauth2-proxy or Authelia in front of Caddy) and bearer API keys, completing the authenticated-install story per umbrella §18. +This is appropriate for laptop installs. **GA v1** adds a Caddy reverse proxy with TLS termination (Let's Encrypt) for production-style network exposure — but with **no authentication yet** (the API is reachable over TLS but unauthenticated; appropriate only for trusted-network deployments). SSO (oauth2-proxy or Authelia in front of Caddy) and bearer API keys ship when multi-tenancy is promoted from backlog. ## Reserved for later releases -The umbrella spec §25 lists the full GA v1 deployment (which includes Caddy, Langfuse, ClickHouse, SigNoz, fusion-mock). MVP1 ships only the 6 containers above. The remaining services activate at: +The umbrella spec §25 lists the full GA v1 deployment (which includes Caddy, Langfuse, ClickHouse, SigNoz). MVP1 ships only the 6 containers above. The remaining services activate at: | Service | Activates at | Why | |---|---|---| -| `langfuse-web`, `langfuse-worker`, `clickhouse` | **MVP2** | LLM observability theme. | -| `signoz`, `signoz-otel-collector` | **MVP2** | Distributed tracing theme. | -| `caddy` (reverse proxy + Let's Encrypt TLS) | **MVP3** | Production-style install (TLS, network exposure) lands with production-stack hardening. **No SSO yet** at MVP3 — Caddy alone provides TLS for trusted-network deployments. | -| `fusion-mock` | **MVP3** | Lucidworks Fusion adapter ships here; mock service for UI/demo dev when shared dev cluster isn't reachable. | -| `oauth2-proxy` / Authelia (SSO in front of Caddy) | **MVP4** | Auth surface arrives with `users` + `tenants` + API keys; SSO completes the authenticated-install story per umbrella §18. | +| `solr` | **MVP2** | Apache Solr 10 container, bound to `127.0.0.1:8983`; ships alongside the `SolrAdapter` and UBI judgments. | +| `langfuse-web`, `langfuse-worker`, `clickhouse` | **MVP3** | LLM observability theme ("Observable"). | +| `signoz`, `signoz-otel-collector` | **MVP3** | Distributed tracing also MVP3. | +| `caddy` (reverse proxy + Let's Encrypt TLS) | **GA v1** | Production-style install (TLS, network exposure) lands with GA v1 hardening. **No SSO yet** — Caddy alone provides TLS for trusted-network deployments. | +| `oauth2-proxy` / Authelia (SSO in front of Caddy) | **Backlog** | Auth surface arrives when multi-tenancy is promoted from backlog (`users` + `tenants` + API keys). | ## Operator workflow (MVP1) diff --git a/docs/01_architecture/llm-orchestration.md b/docs/01_architecture/llm-orchestration.md index 7d5207b9..ded67502 100644 --- a/docs/01_architecture/llm-orchestration.md +++ b/docs/01_architecture/llm-orchestration.md @@ -1,7 +1,7 @@ # LLM Orchestration **Status:** Adopted for MVP1 with the plain `openai` SDK + function calling. **The SDK is pointed at any OpenAI-compatible endpoint via `OPENAI_BASE_URL`** (defaults to `https://api.openai.com/v1`; works against Ollama, LM Studio, vLLM, HuggingFace TGI for air-gapped evaluation). LangGraph orchestrator + native non-OpenAI-compatible provider SDKs (Anthropic, Bedrock, Vertex) + Langfuse + RedisCache arrive at later releases per the canonical [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §15](../00_overview/product/relevance-copilot-spec.md) ("LLM orchestration & observability"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §15](../00_overview/relyloop-spec.md) ("LLM orchestration & observability"). --- diff --git a/docs/01_architecture/mvp1-overview.md b/docs/01_architecture/mvp1-overview.md index a39d1044..12900515 100644 --- a/docs/01_architecture/mvp1-overview.md +++ b/docs/01_architecture/mvp1-overview.md @@ -2,7 +2,7 @@ **Status:** This is the architecture as it exists in MVP1 ("The Loop"). Each topical doc covers all releases; this page is a fast entry point that filters them down to MVP1's active scope. -**For product context:** [docs/00_overview/product/relevance-copilot-spec.md §27](../00_overview/product/relevance-copilot-spec.md) ("MVP1 / v0.1 — The Loop"). +**For product context:** [docs/00_overview/relyloop-spec.md §27](../00_overview/relyloop-spec.md) ("MVP1 / v0.1 — The Loop"). --- @@ -25,44 +25,44 @@ A single `make up` (which auto-generates required secrets on first run, then inv These appear in the topical arch docs because the docs cover all releases — but they're **not MVP1 work**. Skip them while building MVP1. Per-release timing is the canonical [`tech-stack.md` §"Canonical release matrix"](tech-stack.md); the lists below are derived from it. -### Reserved for MVP2 ("Observable") +### Reserved for MVP2 ("Three-Engine + Real Signals") +- **`SolrAdapter`** + `solr` Compose service (Apache Solr 9.x / 10.x; `edismax` + `{!ltr}` rescoring) +- **UBI judgments**: `UbiReader` (engine-agnostic) + `SignalsConverter` Protocol with three impls (CTR threshold, dwell-time, hybrid UBI+LLM) +- `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + `generate_judgments_from_ubi` agent tool +- Templates under `templates/solr/` mirroring the `templates/elasticsearch/` shape +- Tutorial extensions (Step 0 Path C "Run against Solr"; Step 7 "Swap LLM judgments for UBI-derived") +- One migration extending `clusters.engine_type` + `auth_kind` CHECK constraints + +### Reserved for MVP3 ("Observable") - `langfuse-web`, `langfuse-worker`, `clickhouse` — LLM observability stack - `signoz`, `signoz-otel-collector` — distributed tracing -- **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs; `actor_type` ENUM (`system`, `agent`, `anonymous`); FKs added at MVP4) +- **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs; `actor_type` ENUM (`system`, `agent`, `anonymous`)) - Lineage columns on `judgments`, `digests`, `proposals` (`langfuse_trace_id`, `prompt_version`, `input_hash`) - PII redaction processor in structlog - Canonical event catalog (`backend/app/events.py`) -- Trace context propagation through API → Redis → worker → adapter → engine (custom Arq enqueue→pickup serialization) -- Forking studies with narrowed search-space ranges -- Slack notifications on PR open -- Validation re-run on prod after staging win - -### Reserved for MVP3 ("Production Stacks") -- **`LucidworksFusionAdapter`** (and the `fusion-mock` Compose service) -- GitLab and Bitbucket as Git providers; multi-Git-provider abstraction (`GitProvider` Protocol) -- Adapter contract test suite (every `SearchAdapter` and `GitProvider` runs the same conformance suite) -- AWS managed OpenSearch (`opensearch_sigv4` auth kind activates) -- Production-style install: Caddy + Let's Encrypt TLS, managed Postgres/Redis. **No SSO yet** — production-stack hardening only. -- Container image scanning (Trivy) -- Image signing (cosign) — *may slip earlier into chore_tutorial_polish if cheap* - -### Reserved for MVP4 ("Multi-tenant, Multi-LLM") -- `tenants`, `tenant_memberships`, `users`, `api_keys` tables -- `tenant_id` column on every user-facing table (with backfill auto-creating a `default` tenant) -- FK constraints added to `audit_log.actor_id` and `audit_log.tenant_id`; `actor_type` ENUM extended to include `user` -- **SSO for humans** via reverse proxy (oauth2-proxy or Authelia injecting `X-Auth-Email`); proxy verified by mTLS or shared secret -- **Argon2id-hashed bearer API keys** for service accounts (`Authorization: Bearer `) -- Roles: `viewer` / `runner` / `tenant_admin` (per-tenant) + `platform_admin` (cross-tenant) -- Multi-LLM provider abstraction (Anthropic, AWS Bedrock, Google Vertex, Ollama, vLLM) via LangChain provider packages -- LangChain `RedisCache` for LLM responses +- Trace context propagation through API → Redis → worker → adapter → engine (custom Arq enqueue→pickup serialization) for all three engines ### Reserved for GA v1 ("Production-ready") - LangGraph orchestrator + `PostgresSaver` (replaces the plain `openai` SDK + function calling) - Full RFC 7807 Problem Details for errors - `Idempotency-Key` header on POST/PATCH/DELETE -- Helm 3 chart for Kubernetes deployments -- Container scanning, deps audit, image signing all operational -- 90% backend coverage gate (up from 80% in MVP1) +- Full four-layer test pyramid at 90% coverage +- Container scanning (Trivy), deps audit (pip-audit, npm audit), image signing (cosign keyless OIDC) +- Production-style install: Caddy + Let's Encrypt TLS, managed Postgres/Redis (trusted-network deployments; SSO is in the backlog) +- AWS managed OpenSearch (`opensearch_sigv4` auth kind activates) +- Adapter contract test suite (every `SearchAdapter` runs the same conformance suite) +- Public Optuna-vs-SRW-grid benchmark +- Design-partner references (target: one each on ES, OpenSearch, Solr) + +### Backlog (out of pre-GA scope) +- Multi-Git provider abstraction (`GitProvider` Protocol with GitLab + Bitbucket implementations) +- Multi-tenancy primitives (`tenants`, `tenant_memberships`, `users`, `api_keys` tables; `tenant_id` columns) +- SSO via reverse proxy (oauth2-proxy or Authelia); Argon2id-hashed bearer API keys for service accounts +- Native non-OpenAI provider SDKs (Anthropic, AWS Bedrock, Google Vertex AI, Azure OpenAI) via LangChain `BaseChatModel`; per-tenant LLM provider selection +- LTR training (cross-engine model training; MVP2's LTR support is consume-only) +- Path B (production monitoring, bandits, shadow validation) +- Helm chart maturity; Kubernetes-native operator +- Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)) ### Reserved for v2+ - `SolrAdapter` (pure Apache Solr support) @@ -126,4 +126,4 @@ The "TBA" docs are authored alongside their corresponding feature spec. - All arch docs in this section: [`docs/01_architecture/`](./) - MVP1 feature folders: [`docs/02_product/planned_features/`](../02_product/planned_features/) - MVP1 user stories: [`docs/02_product/mvp1-user-stories.md`](../02_product/mvp1-user-stories.md) -- Umbrella spec MVP1 section: [`docs/00_overview/product/relevance-copilot-spec.md` §27](../00_overview/product/relevance-copilot-spec.md) +- Umbrella spec MVP1 section: [`docs/00_overview/relyloop-spec.md` §27](../00_overview/relyloop-spec.md) diff --git a/docs/01_architecture/optimization.md b/docs/01_architecture/optimization.md index 7b1bb7df..e5323812 100644 --- a/docs/01_architecture/optimization.md +++ b/docs/01_architecture/optimization.md @@ -1,7 +1,7 @@ # Optimization (Optuna + ir_measures) **Status:** Adopted for MVP1. Single-objective TPE + median pruner; provider-abstracted IR evaluation via `ir_measures` (wraps multiple cut-aware-metric backends behind a typed metric-object DSL). Multi-objective optimization (CMA-ES + multi-metric) reserved for v2 per umbrella spec. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §13–§14](../00_overview/product/relevance-copilot-spec.md). Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §13–§14](../00_overview/relyloop-spec.md). Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). --- @@ -52,7 +52,7 @@ Per umbrella spec §14, RelyLoop **always** evaluates via `ir_measures` — neve - `ir_measures` (from the PyTerrier team) wraps multiple IR-evaluation backends behind a typed metric-object DSL (`nDCG@10`, `AP@5`, `RR`, `P@k`, `R@k`). The provider abstraction means swapping the underlying backend is a config change rather than a rewrite — protecting against future single-maintainer abandonment risk. - ES `_rank_eval` and `ir_measures` don't always agree to many decimal places (different normalization conventions across engines). - Per-query scores are inspectable, enabling deep debugging. -- Cross-engine comparability: the same metric semantics apply whether the underlying engine is ES, OpenSearch, Fusion, or Solr. +- Cross-engine comparability: the same metric semantics apply whether the underlying engine is Elasticsearch, OpenSearch, or Apache Solr. ### Supported metrics (MVP1) @@ -192,5 +192,6 @@ contract is reviewed in [`feat_pr_metric_confidence/feature_spec.md`](../02_prod | CMA-ES sampler (selectable per study) | MVP2 | TPE is sufficient for MVP1's low-dim search spaces; CMA-ES becomes valuable when adopters tune ≥7 continuous parameters. | | Intermediate-step pruning (truly active `MedianPruner`) | MVP2 | Requires multi-step trials (e.g., evaluate after each query batch); MVP1 trials evaluate once per (params, full query set). | | Multi-objective optimization (Pareto fronts via NSGA-II) | v2 | Single scalar objective is sufficient through GA v1; multi-objective adds product complexity (which Pareto trade-off do you ship?). | -| Click-derived judgments from Fusion Signals | v1.5+ | Requires Fusion adapter (MVP3) + Signals enabled in the user's deployment. The judgment `source = 'click'` enum value is reserved from MVP1 forward; the converter plug-ins land at v1.5+. | -| LLM+signals hybrid judgments | v1.5+ | Same — depends on Fusion Signals integration. | +| UBI-derived judgments + hybrid UBI+LLM converter | MVP2 | Bundled with the Solr adapter in MVP2 (see [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md)). The judgment `source = 'click'` enum value is reserved from MVP1 forward; the `UbiReader` + `SignalsConverter` land at MVP2. | +| Counterfactual click models (CCM, DBN) as additional `SignalsConverter` impls | Backlog | Require enough impressions per (query, doc) to be statistically valid; promoted out when post-MVP2 adopter traffic supports it. | +| Engine-native click readers (Elastic Behavioral Analytics) | Backlog | UBI covers the engine-neutral path for ES + OpenSearch + Solr. Elastic BA is a residual ES-shop bridge despite Elastic's 9.0 deprecation; landed when an adopter requires it. | diff --git a/docs/01_architecture/system-overview.md b/docs/01_architecture/system-overview.md index 1efcc865..85fa2b3b 100644 --- a/docs/01_architecture/system-overview.md +++ b/docs/01_architecture/system-overview.md @@ -1,7 +1,7 @@ # System Overview **Status:** Adopted for MVP1. Each release adds services; this doc shows the full topology with MVP1-active services highlighted. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §7](../00_overview/product/relevance-copilot-spec.md) ("System architecture"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §7](../00_overview/relyloop-spec.md) ("System architecture"). --- @@ -114,11 +114,11 @@ Services in the umbrella spec §25 deployment that are NOT in MVP1: | Service | Activates at | Why deferred for MVP1 | |---|---|---| | `ui` (containerized) | Late MVP1 / chore_tutorial_polish | UI runs via `pnpm dev` during MVP1 development; containerization is a polish item. | -| `caddy` (reverse proxy + Let's Encrypt TLS) | MVP3 | Production-style install adds TLS + network exposure. **No SSO yet** — trusted-network deployments only. | -| `oauth2-proxy` / Authelia (SSO in front of Caddy) | MVP4 | Auth surface arrives with `users` + `tenants` + API keys per umbrella §18. | -| `langfuse-web`, `langfuse-worker`, `clickhouse` | MVP2 | LLM observability is the MVP2 theme ("Observable"). | -| `signoz`, `signoz-otel-collector` | MVP2 | Distributed tracing also MVP2. | -| `fusion-mock` | MVP3 | Ships with the Lucidworks Fusion adapter; mock service for UI/demo dev when shared dev cluster isn't reachable. | +| `solr` | MVP2 | Apache Solr 10 container, bound to `127.0.0.1:8983`. Mirrors the existing `elasticsearch` and `opensearch` service shape; ships alongside the `SolrAdapter`. | +| `langfuse-web`, `langfuse-worker`, `clickhouse` | MVP3 | LLM observability is the MVP3 theme ("Observable"). | +| `signoz`, `signoz-otel-collector` | MVP3 | Distributed tracing also MVP3. | +| `caddy` (reverse proxy + Let's Encrypt TLS) | GA v1 | Production-style install adds TLS + network exposure. **No SSO yet** — trusted-network deployments only. | +| `oauth2-proxy` / Authelia (SSO in front of Caddy) | Backlog | Auth surface arrives when multi-tenancy is promoted from backlog. | ## Deployment diff --git a/docs/01_architecture/tech-stack.md b/docs/01_architecture/tech-stack.md index 7d29b262..5fa69485 100644 --- a/docs/01_architecture/tech-stack.md +++ b/docs/01_architecture/tech-stack.md @@ -1,7 +1,7 @@ # Tech Stack **Status:** Adopted for MVP1. Revisited per release as new layers come online. -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §28](../00_overview/product/relevance-copilot-spec.md) ("Tech stack & implementation decisions"). This document is the engineering-facing distillation of those decisions, scoped to what's relevant for MVP1 with explicit notes on what activates in later releases. +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §28](../00_overview/relyloop-spec.md) ("Tech stack & implementation decisions"). This document is the engineering-facing distillation of those decisions, scoped to what's relevant for MVP1 with explicit notes on what activates in later releases. --- @@ -11,16 +11,13 @@ This is the source-of-truth release matrix that every other arch doc derives fro | Release | Theme | Adds on top of previous | |---|---|---| -| **MVP1 / v0.1** | "The Loop" | ES + OpenSearch adapter (single `ElasticAdapter`); LLM via `openai` SDK pointed at any **OpenAI-compatible endpoint** (`OPENAI_BASE_URL` config; defaults to `https://api.openai.com/v1`; works against Ollama, LM Studio, vLLM, HuggingFace TGI for air-gapped evaluation); GitHub Git provider; single-tenant (no `tenants` table, no `tenant_id`); no auth; basic structured logging; Docker Compose; Apache 2.0 LICENSE; 80% backend coverage gate. **No** native non-OpenAI-compatible providers (Anthropic/Bedrock/Vertex SDKs ship at MVP4), **no** observability stack, **no** audit_log, **no** lineage, **no** Fusion, **no** SSO, **no** API keys. | -| **MVP1.5 / v0.1.5** | "Real Signals" | **OpenSearch UBI judgments** as a first-class judgment source. New `UbiReader` (engine-agnostic; reads the standardized `ubi_queries` + `ubi_events` indices via any `SearchAdapter`'s `search_batch`) + pluggable `SignalsConverter` Protocol (initial impls: position-bias-corrected CTR, dwell-time threshold, hybrid UBI+LLM where UBI rates the dense head and LLM fills the long tail). Judgment lists can mix sources (`llm` + `human` + `click` rows in the same list — the existing `judgments.source` enum already permits this). New `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + new agent tool `generate_judgments_from_ubi`. **No** schema migration (additive — uses existing `source = 'click'` enum value), **no** new Compose service. Predicated on the operator having the OpenSearch UBI plugin installed and logging events. | -| **MVP2 / v0.2** | "Observable" | Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters wired; canonical event catalog; **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs; FKs added at MVP4); lineage columns (`langfuse_trace_id`, `prompt_version`, `input_hash`) on `judgments`/`digests`/`proposals`; PII redaction; trace context propagation through API → Redis → worker → adapter → engine. | -| **MVP3 / v0.3** | "Production Stacks" | **Lucidworks Fusion adapter** (`auth_kind = fusion_session` and `fusion_jwt`); multi-Git-provider abstraction (GitLab + Bitbucket alongside GitHub); adapter contract test suite; production-style install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); AWS managed OpenSearch (`auth_kind = opensearch_sigv4` activates). **No** SSO/auth yet (production-stack hardening only). | -| **MVP4 / v0.4** | "Multi-tenant, Multi-LLM" | `tenants` + `tenant_memberships` + `users` + `api_keys` tables; `tenant_id` columns on every user-facing table (with backfill); roles `viewer` / `runner` / `tenant_admin` (per-tenant) + `platform_admin` (cross-tenant); **SSO via reverse proxy** (oauth2-proxy or Authelia injecting `X-Auth-Email`); **Argon2id-hashed bearer API keys** for service accounts; **native non-OpenAI-compatible LLM providers via LangChain `BaseChatModel` abstraction** (Anthropic, AWS Bedrock, Google Vertex AI); per-tenant LLM provider selection + cost rollups; FK constraints added to `audit_log.actor_id` / `audit_log.tenant_id`. (OpenAI-compatible providers — including Ollama, LM Studio, vLLM, HuggingFace TGI — already work in MVP1 via `OPENAI_BASE_URL`.) | -| **GA v1 / v1.0** | "Production-ready" | **LangGraph orchestrator** (replaces plain `openai` SDK + function calling); `PostgresSaver` for resumable conversations; full RFC 7807 Problem Details on errors; `Idempotency-Key` header on POST/PATCH/DELETE; Helm 3 chart; container scanning (Trivy), deps audit (pip-audit/npm audit), image signing (cosign keyless OIDC); 90% backend coverage gate (up from 80% in MVP1). | -| **v1.5+** | post-GA | Helm chart maturity, Kubernetes-native operator. | -| **v2+** | post-GA | Apache Solr adapter (`auth_kind = solr_basic` activates). | +| **MVP1 / v0.1 (shipped)** | "The Loop" | ES + OpenSearch adapter (single `ElasticAdapter`); LLM via `openai` SDK pointed at any **OpenAI-compatible endpoint** (`OPENAI_BASE_URL` config; defaults to `https://api.openai.com/v1`; works against Ollama, LM Studio, vLLM, HuggingFace TGI for air-gapped evaluation); GitHub Git provider; single-tenant (no `tenants` table, no `tenant_id`); no auth; basic structured logging; Docker Compose; Apache 2.0 LICENSE; 80% backend coverage gate; Optuna/TPE optimization loop over the full query-time search space; Git-PR apply path; conversational agent that runs the loop. **No** native non-OpenAI-compatible providers (backlog), **no** observability stack, **no** audit_log, **no** lineage, **no** Solr yet, **no** SSO, **no** API keys. | +| **MVP2 / v0.2** | "Three-Engine + Real Signals" | **Apache Solr adapter** (`auth_kind = solr_basic` and `solr_apikey`) covering Solr 9.x + 10.x via `edismax` + `{!ltr}` rescoring + `solr.UBIComponent` for UBI capture; **UBI judgments** via engine-agnostic `UbiReader` (reads `ubi_queries` + `ubi_events` via any `SearchAdapter`'s `search_batch`); pluggable `SignalsConverter` Protocol (position-bias-corrected CTR, dwell-time threshold, **hybrid UBI+LLM**); `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool; mixed-source judgment lists (`llm` + `human` + `click` rows in the same list — the existing `judgments.source` enum already permits this). After MVP2 ships, RelyLoop runs on all three OSS engines with UBI on every one of them. **No** schema migration for UBI (additive — uses existing `source = 'click'` enum value); **one** small migration extends `engine_type` + `auth_kind` CHECK constraints to accept Solr values. | +| **MVP3 / v0.3** | "Observable" | Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters wired; canonical event catalog; **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs); lineage columns (`langfuse_trace_id`, `prompt_version`, `input_hash`) on `judgments`/`digests`/`proposals`; PII redaction; trace context propagation through API → Redis → worker → adapter → engine for all three engines. | +| **GA v1 / v1.0** | "Production-ready" | **LangGraph orchestrator** (replaces plain `openai` SDK + function calling); `PostgresSaver` for resumable conversations; full RFC 7807 Problem Details on errors; `Idempotency-Key` header on POST/PATCH/DELETE; full four-layer test pyramid at 90% coverage; complete CI/CD with security gates (Trivy, bandit, pip-audit, npm audit); image signing (cosign keyless OIDC); Helm 3 chart; complete OSS launch infrastructure (docs, ADRs, contributor onboarding, design-partner references); public Optuna-vs-SRW-grid benchmark. **No new product surface** — all six differentiators are GA by MVP3; GA v1 is polish + governance + hardening. | +| **Backlog** | — | Multi-Git provider abstraction (GitLab + Bitbucket); multi-tenancy primitives (`tenants` + `tenant_memberships` + `users` + `api_keys` tables; `tenant_id` columns; roles `viewer`/`runner`/`tenant_admin`/`platform_admin`); SSO via reverse proxy; Argon2id-hashed bearer API keys; native non-OpenAI provider SDKs (Anthropic, Bedrock, Vertex, Azure OpenAI); LTR training; Path B (production-quality monitoring, bandits, shadow validation, manual one-click rollback); Helm chart maturity; Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | -**Audit-without-users design:** MVP2 ships `audit_log` with `actor_id` / `tenant_id` as nullable UUIDs with **no FK constraints**, plus an `actor_type` ENUM constrained to `system` / `agent` / `anonymous`. MVP4 adds the FK constraints, extends `actor_type` to include `user`, and backfills `tenant_id` from the auto-created `default` tenant. Pre-MVP4 audit rows keep `actor_id = NULL`. See [`data-model.md` §"`audit_log`"](data-model.md) for the schema. +**Audit-without-users design:** MVP3 ships `audit_log` with `actor_id` / `tenant_id` as nullable UUIDs with **no FK constraints**, plus an `actor_type` ENUM constrained to `system` / `agent` / `anonymous`. The FK constraints and the `user` actor type ship when multi-tenancy is promoted from backlog. Pre-multi-tenancy audit rows keep `actor_id = NULL`. See [`data-model.md` §"`audit_log`"](data-model.md) for the schema. --- @@ -41,8 +38,8 @@ This is the source-of-truth release matrix that every other arch doc derives fro | Optimization | Optuna with TPE sampler + RDBStorage | RDBStorage points at the same Postgres as the app. | | IR evaluation | ir_measures | Provider-abstracted; wraps multiple IR-evaluation backends behind a typed metric-object DSL; consistent metrics across engines. | | LLM SDK (MVP1) | `openai` Python SDK with function calling | LangGraph deferred to GA v1. No provider-abstraction layer in MVP1 — direct OpenAI calls. | -| Auth — humans (MVP4+) | SSO via reverse proxy (oauth2-proxy or Authelia); proxy injects `X-Auth-Email` header; API trusts the header only when verified by mTLS or a shared secret | Not present in MVP1–3. No password storage in RelyLoop itself — identity provider owns credentials. | -| Auth — service accounts (MVP4+) | Bearer API keys (`Authorization: Bearer `); keys hashed with Argon2id (passlib) at rest | Not present in MVP1–3. Per-key role + scopes + expiration; revocation via `revoked_at`. | +| Auth — humans (backlog) | SSO via reverse proxy (oauth2-proxy or Authelia); proxy injects `X-Auth-Email` header; API trusts the header only when verified by mTLS or a shared secret | Not present through GA v1. No password storage in RelyLoop itself — identity provider owns credentials. | +| Auth — service accounts (backlog) | Bearer API keys (`Authorization: Bearer `); keys hashed with Argon2id (passlib) at rest | Not present through GA v1. Per-key role + scopes + expiration; revocation via `revoked_at`. | | Testing | pytest + pytest-asyncio + pytest-mock + pytest-recording | `pytest-recording` cassettes are checked in for every external HTTP integration. | | Coverage | coverage.py | CI gate: 80% backend Python (MVP1) → 90% (GA v1). | | Linter / formatter | ruff (`check` + `format`) | Replaces flake8 + isort + black. | @@ -74,8 +71,8 @@ This is the source-of-truth release matrix that every other arch doc derives fro |---|---|---| | Database (app) | Postgres 16 | Single instance. Holds app state + Optuna RDBStorage. | | Cache / queue | Redis 7 | Used by Arq for the worker queue. | -| Search engines (targets) | Elasticsearch 8.11+ / 9.x; OpenSearch 2.x / 3.x | Lucidworks Fusion (MVP3) and Solr (v2+) are NOT in MVP1. | -| Reverse proxy | Caddy 2 | NOT in MVP1. **MVP3** adds Caddy + Let's Encrypt TLS for production-style network exposure (no SSO yet — trusted-network deployments only). **MVP4** adds oauth2-proxy or Authelia in front of Caddy for SSO. | +| Search engines (targets) | Elasticsearch 8.11+ / 9.x; OpenSearch 2.x / 3.x (MVP1); Apache Solr 9.x / 10.x (MVP2) | Lucidworks Fusion explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | +| Reverse proxy | Caddy 2 | NOT in MVP1. Production-style install (TLS via Caddy + Let's Encrypt) lands as GA v1 hardening for trusted-network deployments. SSO (oauth2-proxy or Authelia in front of Caddy) is in the backlog with multi-tenancy. | | Trace storage (LLM) | ClickHouse 24 | NOT in MVP1 (Langfuse is MVP2+). | | Container runtime | Docker 24+ with Compose v2 | MVP1 deployment target. | | Helm chart | Helm 3 | NOT in MVP1 (v1.5+). | @@ -129,7 +126,7 @@ This is the source-of-truth release matrix that every other arch doc derives fro - snake_case table and column names. - JSONB for flexible structured fields (settings, params, metrics, payloads). - All foreign keys explicit; no implicit relationships. -- Indexes on `(tenant_id, created_at)` for tenant-scoped tables — **MVP4+ only**; MVP1–3 has no `tenant_id` column. +- Indexes on `(tenant_id, created_at)` for tenant-scoped tables — **backlog only**; RelyLoop is single-tenant through GA v1 with no `tenant_id` column. ### Logging conventions @@ -142,17 +139,17 @@ This is the source-of-truth release matrix that every other arch doc derives fro - Mounted secret files only — never set in environment variables. - Source of truth: 1Password / Vault / SSM / equivalent (operator's choice). -- API keys hashed with Argon2id at rest — **MVP4+** (no auth in MVP1). +- API keys hashed with Argon2id at rest — **backlog** (no auth through GA v1). - For MVP1: `.env.example` enumerates every secret; `.env` is gitignored; Docker secrets mount each value as a file inside the container. ## Reserved for later releases These appear in the umbrella spec because the spec covers all releases. None of them are MVP1 work. Per-release timing per the §"Canonical release matrix" above: -- **MVP2:** Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters; canonical event catalog; `audit_log` table + immutability trigger (no users/tenants yet); lineage columns; PII redaction; trace context propagation through DB/Redis/worker/adapter/engine. -- **MVP3:** Lucidworks Fusion adapter; multi-Git-provider abstraction (GitLab, Bitbucket); production-style install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); AWS managed OpenSearch. -- **MVP4:** Multi-tenancy (`tenants`, `tenant_memberships`, `users`, `api_keys` tables; `tenant_id` columns); SSO via reverse proxy for humans; Argon2id-hashed bearer API keys for service accounts; roles `viewer/runner/tenant_admin/platform_admin`; multi-LLM provider abstraction (Anthropic, AWS Bedrock, Google Vertex, Ollama, vLLM); LangChain `RedisCache` for LLM responses. -- **GA v1:** LangGraph orchestrator + `PostgresSaver`; full RFC 7807 Problem Details on errors; `Idempotency-Key` header; Helm chart; container scanning (Trivy); deps audit (pip-audit/npm audit); image signing (cosign); 90% backend coverage gate. +- **MVP2 (Three-Engine + Real Signals):** Apache Solr adapter (Solr 9.x + 10.x; `edismax` + `{!ltr}` rescore; `solr.UBIComponent` support); UBI judgments via engine-agnostic `UbiReader`; pluggable `SignalsConverter` Protocol (CTR threshold, dwell-time, hybrid UBI+LLM); `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool. +- **MVP3 (Observable):** Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters; canonical event catalog; `audit_log` table + immutability trigger (no users/tenants yet); lineage columns; PII redaction; trace context propagation through DB/Redis/worker/adapter/engine. +- **GA v1 (Production-ready):** LangGraph orchestrator + `PostgresSaver`; full RFC 7807 Problem Details on errors; `Idempotency-Key` header; full four-layer test pyramid at 90% coverage; complete CI/CD with security gates (Trivy, bandit, pip-audit, npm audit); image signing (cosign); production-style install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); design-partner references; public Optuna-vs-SRW-grid benchmark. **No new product surface** — all six differentiators are GA by MVP3. +- **Backlog:** Multi-Git provider abstraction (GitLab, Bitbucket); multi-tenancy (`tenants`, `tenant_memberships`, `users`, `api_keys` tables; `tenant_id` columns; roles `viewer/runner/tenant_admin/platform_admin`); SSO via reverse proxy for humans; Argon2id-hashed bearer API keys for service accounts; native non-OpenAI provider SDKs (Anthropic, AWS Bedrock, Google Vertex AI, Azure OpenAI); LangChain `RedisCache` for LLM responses; Helm chart maturity; Kubernetes-native operator; LTR training; Path B (production monitoring, bandits, shadow validation); Lucidworks Fusion adapter (explicitly dropped). - **Out of scope (no scheduled release):** Mobile UI, i18n, WCAG AA gating, Kubernetes-native operator, multi-region. ## Cross-references diff --git a/docs/01_architecture/ui-architecture.md b/docs/01_architecture/ui-architecture.md index 048dccfb..4d07323b 100644 --- a/docs/01_architecture/ui-architecture.md +++ b/docs/01_architecture/ui-architecture.md @@ -1,7 +1,7 @@ # UI Architecture **Status:** Adopted for MVP1. Next.js 16 App Router (React 19, Turbopack) + shadcn/ui + Tailwind 4 (CSS-first) + TanStack Query + Vitest 4. Per-screen feature specs (`feat_studies_ui`, `feat_proposals_ui`, `feat_chat_agent`) implement the patterns documented here. Stack bumped from Next 14 / React 18 / Tailwind 3 / Vitest 2 on 2026-05-12 via `infra_frontend_stack_refresh` (the placeholder UI was the optimal upgrade window before `feat_studies_ui` adds component volume). -**Source of truth for product context:** [docs/00_overview/product/relevance-copilot-spec.md §22](../00_overview/product/relevance-copilot-spec.md) ("UI screens") and §28 ("Frontend stack"). +**Source of truth for product context:** [docs/00_overview/relyloop-spec.md §22](../00_overview/relyloop-spec.md) ("UI screens") and §28 ("Frontend stack"). --- diff --git a/docs/02_product/mvp1-user-stories.md b/docs/02_product/mvp1-user-stories.md index cffbbe0b..53de666d 100644 --- a/docs/02_product/mvp1-user-stories.md +++ b/docs/02_product/mvp1-user-stories.md @@ -3,11 +3,11 @@ **Status:** Source-of-truth user-story enumeration for MVP1 ("The Loop"). Each story is referenced by ID (`US-N`) from the matching feature_spec.md in `planned_features//`. **Source material:** -- Umbrella spec [§6 Personas & user stories](../00_overview/product/relevance-copilot-spec.md) (lines 85–100) — system-level stories -- Umbrella spec [§27 MVP1 scope](../00_overview/product/relevance-copilot-spec.md) (lines 2286–2322) — in-scope capabilities +- Umbrella spec [§6 Personas & user stories](../00_overview/relyloop-spec.md) (lines 85–100) — system-level stories +- Umbrella spec [§27 MVP1 scope](../00_overview/relyloop-spec.md) (lines 2286–2322) — in-scope capabilities - Umbrella spec §8, §12, §14, §15, §16, §19, §22 — capability detail -**Scope boundary:** MVP1 only. Stories that depend on later-release capabilities (Langfuse → MVP2; Lucidworks Fusion + GitLab/Bitbucket → MVP3; multi-tenant + multi-LLM provider abstraction + SSO + API keys → MVP4; LangGraph state graph + subagents + PostgresSaver → GA v1) are explicitly out of scope and live in their respective release plans. See [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../01_architecture/tech-stack.md) for the source of truth. +**Scope boundary:** MVP1 only. Stories that depend on later-release capabilities (Apache Solr adapter + UBI judgments → MVP2; Langfuse + SigNoz + audit-log immutability → MVP3; LangGraph state graph + subagents + PostgresSaver + production-style install → GA v1; multi-Git providers, multi-tenant, multi-LLM, LTR training → Backlog) are explicitly out of scope and live in their respective release plans. Lucidworks Fusion was previously in MVP3 scope but is now explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](planned_features/chore_drop_fusion_scope/idea.md). See [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../01_architecture/tech-stack.md) for the source of truth. --- @@ -123,13 +123,14 @@ For visibility — these capabilities appear in the umbrella spec but are explicitly NOT MVP1 user stories: -- **Langfuse / SigNoz observability dashboards** → MVP2 (per §27 line 2308). -- **Multi-LLM provider abstraction** (Anthropic, Bedrock, Ollama, vLLM) → MVP4 (per §27 line 2297). -- **GitLab / Bitbucket** as Git providers → MVP3 (per §27 line 2298). -- **Lucidworks Fusion** as an engine adapter → MVP3 (per umbrella §27 — "Production Stacks"). -- **Multi-tenant** (`tenants` table, `tenant_id` scoping) → MVP4 (per §27 lines 2299–2300). +- **Apache Solr adapter + UBI judgments + hybrid UBI+LLM converter** → MVP2 (per umbrella §27 — "Three-Engine + Real Signals"). +- **Langfuse / SigNoz observability dashboards** → MVP3 (per umbrella §27 — "Observable"). +- **Multi-LLM provider abstraction** (native non-OpenAI SDKs: Anthropic, Bedrock, Vertex, Azure OpenAI) → Backlog. OpenAI-compatible endpoints (Ollama, LM Studio, vLLM, TGI) already work in MVP1 via `OPENAI_BASE_URL`. +- **GitLab / Bitbucket** as Git providers → Backlog (was MVP3 in the prior plan; promoted out when an adopter on a non-GitHub provider commits to evaluating). +- **Lucidworks Fusion** as an engine adapter → **Dropped** — see [`chore_drop_fusion_scope/idea.md`](planned_features/chore_drop_fusion_scope/idea.md). +- **Multi-tenant** (`tenants` table, `tenant_id` scoping) → Backlog (was MVP4 in the prior plan). - **LangGraph state graph + subagents + `PostgresSaver`** → GA v1 per [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../01_architecture/tech-stack.md). MVP1 uses plain `openai` SDK + function calling. -- **Auth / RBAC** (`viewer` / `runner` / `tenant_admin` / `platform_admin` role enforcement; SSO via reverse proxy; bearer API keys) → MVP4 per umbrella §18. +- **Auth / RBAC** (`viewer` / `runner` / `tenant_admin` / `platform_admin` role enforcement; SSO via reverse proxy; bearer API keys) → Backlog per umbrella §18 (was MVP4). - **Forking studies with narrowed search-space ranges** (top story #4 from §6) → MVP2. - **Pairwise quick-experiment tool** (`run_pairwise`) → MVP2 nice-to-have, not required for MVP1 loop. - **Slack notifications on PR open** (top story #3 from §6) → MVP2. diff --git a/docs/02_product/planned_features/chore_drop_fusion_scope/idea.md b/docs/02_product/planned_features/chore_drop_fusion_scope/idea.md new file mode 100644 index 00000000..7a157b99 --- /dev/null +++ b/docs/02_product/planned_features/chore_drop_fusion_scope/idea.md @@ -0,0 +1,117 @@ +# Drop Lucidworks Fusion from the engine roadmap + +**Date:** 2026-05-27 +**Status:** Idea — scope decision, paired with [`infra_adapter_solr`](../infra_adapter_solr/idea.md) +**Priority:** P1 — gates the umbrella spec rewrite and the MVP2 release-theme rename +**Origin:** Positioning reframe on 2026-05-27. Triggered by the competitive analysis vs OpenSearch Search Relevance Workbench (see [`docs/07_research/comparison.md`](../../../07_research/comparison.md)) which surfaced that the "engine-neutral" pitch is the strongest moat — but only if the engines RelyLoop supports are the three open-source engines (ES, OpenSearch, Solr) that the OSC/Haystack community treats as canonical. Fusion as a fourth engine adds vendor entanglement without strengthening the moat. +**Depends on:** None — this is a documentation decision. The Fusion adapter was never implemented (Fusion was MVP3 scope per the prior release matrix; only design references existed). + +## Problem + +The prior umbrella spec ([`docs/00_overview/relyloop-spec.md`](../../../00_overview/relyloop-spec.md)) planned Lucidworks Fusion as the MVP3 engine target and Apache Solr as a v2+ "architectural reference, not v1 scope" addition. After the 2026-05-27 reframe, this ordering is reversed and compressed: + +- **Solr is promoted to MVP2**, bundled with UBI judgments (see [`infra_adapter_solr`](../infra_adapter_solr/idea.md) + [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md)). +- **Fusion is dropped entirely** — this idea documents why. +- **Multi-Git provider abstraction (GitLab, Bitbucket) is moved to the backlog** — was previously bundled with Fusion in the prior MVP3 release. + +The reasons are stack-ranked below from most to least decisive. + +### 1. Fusion doesn't strengthen the engine-neutral moat + +The competitive analysis vs OpenSearch SRW ([`docs/07_research/comparison.md`](../../../07_research/comparison.md)) identifies the defensible moat as "Bayesian/TPE optimization across the full query-time search space, on every major open-source engine, with a Git-PR apply path." SRW is OpenSearch-only by architecture; Elasticsearch has no SRW equivalent (deprecated Behavioral Analytics + Search Applications in 9.0); Solr's ecosystem (Quepid + Chorus + RRE) is mature for manual evaluation but has no auto-optimizer. + +The three engines that complete the OSS sweep are ES + OpenSearch + Solr. Fusion is a commercial layer on top of Solr — supporting it doesn't extend the engine-neutral claim, it just adds a vendor-specific surface. + +### 2. Fusion creates vendor entanglement + +The original spec called out at §29 #12 ("Lucidworks eval license policy for engineers") that hands-on Fusion access requires a Lucidworks evaluation license, with three options ranging from a shared team license to per-engineer 30-day evals. Every contributor touching the adapter needs license logistics. The replay-cassette infrastructure for offline tests was a separate maintenance burden (recording cassettes, refreshing them on Fusion version upgrades, owning the `fusion-mock` service). + +None of this overhead applies to Solr — Apache 2.0 image runs locally in Compose with no licensing. + +### 3. Fusion's audience overlap with the Quepid/Chorus community is smaller than Solr's + +The natural early-adopter community for RelyLoop is the OSC + Sease + Querqy + Haystack ecosystem — the people who already run query sets and judgment lists for a living. Their primary engine, by a wide margin, is Apache Solr (Quepid was Solr-first; Chorus is Solr-centric; RRE was originally Solr-only). Fusion's audience is enterprise platform teams who chose Lucidworks as a vendor — overlapping but smaller, and disproportionately concentrated in industries (large e-commerce, government) where the design-partner conversation is longer. + +### 4. Fusion adapter cost was material + +The prior §27 estimated MVP3 at +3 weeks for "Lucidworks Fusion adapter + multi-Git-provider abstraction." The Fusion adapter alone was estimated at substantially more than the Solr adapter (which is ~2–3 engineer-weeks per the [Solr ecosystem research](../../../07_research/comparison.md) — see also [`infra_adapter_solr/idea.md`](../infra_adapter_solr/idea.md) scope signals): + +- Fusion's query API is fundamentally different from ES/Solr Query DSL — pipeline-based, with per-stage parameter overrides. The adapter's `render` path is ~2× the complexity of the Solr adapter's edismax rendering. +- Fusion's auth model (session cookies, JWT, the session pool) is its own thing. +- The two-step apply path (PR edits pipeline params + CI runs `objects-import`) is more complex than the Solr-side single-step (PR edits `*.params.json`, CI runs `bin/post` or `solrconfig.xml` swap). +- Fusion's `*_signals` collection has a different schema from UBI, requiring a Fusion-specific reader feeding the `SignalsConverter` Protocol. Solr uses `solr.UBIComponent` with the standard UBI schema — no Solr-specific reader needed. + +Dropping Fusion + deferring multi-Git makes room in MVP2 for the Solr + UBI bundle (~4–5 engineer-weeks combined; see [`infra_adapter_solr/idea.md`](../infra_adapter_solr/idea.md) §"Why bundled with UBI into MVP2"). The four big-ticket Fusion items (adapter, signals reader, replay cassettes, mock service) are gone outright; the multi-Git work is captured separately in the backlog so it's not lost. + +### 5. Path B (future production-monitoring + bandits) doesn't need Fusion either + +The v2 Path B roadmap in the original spec called out "Fusion Experiments integration" as one Path B candidate. After this drop, that candidate is gone. The remaining Path B candidates (production quality monitoring via signal streams, bandit-style online learning, shadow validation, manual one-click rollback) are all engine-agnostic and work on ES/OpenSearch/Solr equally. + +## Proposed action + +This is a documentation-only change. No code is touched (the Fusion adapter was never implemented). + +### Files to update + +1. **`docs/00_overview/relyloop-spec.md`** (~110 Fusion mentions): + - §1 Summary — remove Fusion from the engine list; add Solr alongside ES + OpenSearch + - §6 Personas — drop Fusion-specific references + - §8 Engine adapter specification — delete the `LucidworksFusionAdapter notes` subsection; promote the `SolrAdapter notes` subsection from "architectural reference" to a concrete MVP2 plan; drop the Fusion column from the cross-engine parameter table + - §14 Evaluation — remove "Fusion Signals" subsection; the engine-native signals reader for Fusion is gone + - §16 Apply path — remove the Fusion-specific two-step apply path; the Solr apply path matches ES (single-step PR edit) + - §17 Multi-cluster — remove Fusion-specific cluster examples + - §22 UI screens — drop Fusion-specific config-repo conventions + - §25 Deployment — drop the Fusion eval-license appendix; remove `fusion-mock` from the Compose plan + - §27 Phased delivery — full release-matrix rewrite: MVP2 becomes "Three-Engine + Real Signals" (Solr adapter + UBI judgments, bundled); MVP3 becomes "Observable" (was MVP2 in the prior plan); GA v1 becomes mostly polish + governance + hardening over MVP3; multi-Git + multi-tenant + multi-LLM (prior MVP3 + MVP4 scope) moved to backlog; remove "Fusion Experiments integration" from v2 Path B + - §28 Tech stack — drop Fusion-related entries + - §29 Comparison + Open questions — drop Lucidworks-eval-license question, Fusion-cassette-refresh question, Fusion-pipeline-forking-strategy question, Fusion-app/collection-scoping question, mock-Fusion-fidelity-scope question + +2. **`docs/01_architecture/adapters.md`** (~18 mentions): + - Remove `lucidworks_fusion` from the `engine_type` Protocol literal + - Remove Fusion column from the cross-engine parameter table; promote Solr column to first-class MVP2 status + - Drop the `stage_enabled` unified-vocabulary parameter (Fusion-only) + - Remove the line about future `backend/app/adapters/fusion.py` + +3. **`docs/01_architecture/tech-stack.md`** (4 mentions): + - Update release matrix: MVP2 = "Three-Engine + Real Signals" (Solr adapter + UBI judgments); MVP3 = "Observable" (was MVP2 in the prior plan); GA = polish + governance + hardening; multi-Git + multi-tenant + multi-LLM moved to backlog; v2+ no longer lists Apache Solr + +4. **`CLAUDE.md`** (3 mentions): + - Update project overview blurb to list ES + OpenSearch + Solr (not Fusion) + - Update release matrix + +5. **`README.md`** (1 mention): + - Update headline pitch and "key design choices" + +6. **`architecture.md`** (1 mention): + - Layer 1 adapter description: drop Fusion, add Solr + +7. **`state.md`** — capture the release-matrix reshuffle (MVP2 scope, MVP3 renumber, MVP4 → backlog) + +8. **Smaller docs** — `optimization.md`, `system-overview.md`, `agent-tools.md`, `mvp1-overview.md`, `deployment.md`, `apply-path.md`, `mvp1-user-stories.md` — 1–3 prune each + +### Forward-only + +Per the project's forward-only documentation stance, the Fusion sections are deleted outright, not commented out or kept as "deprecated." The git history is the audit trail; future readers find this idea file for the rationale. + +## Scope signals + +- **Backend:** zero LOC. No Fusion adapter ever existed; no code to remove. +- **Frontend:** zero LOC. No Fusion-specific UI ever shipped. +- **Migration:** none. +- **Config:** none. +- **Audit events:** N/A. +- **Tests:** none — no Fusion test coverage to remove. +- **Documentation:** ~120 Fusion mentions across ~14 files. All deletions or rewrites, no additions beyond what `infra_adapter_solr/idea.md` adds. + +## Why drop, not defer + +Deferring Fusion to v2+ would carry the architectural surface (the `lucidworks_fusion` engine_type literal, the Fusion column in the parameter table, the Fusion-specific apply path documentation) forward indefinitely. Future contributors would read the spec, see Fusion, and assume it's the plan. Documentation-as-aspiration rots fastest. + +Dropping outright makes the spec truthful: RelyLoop supports the three open-source engines and does not have a roadmap commitment to commercial engines. If a Fusion adopter materializes later with a real workload, the adapter Protocol shape makes contributing a community adapter straightforward — but the project isn't owning that direction. + +## Relationship to other work + +- **Paired with [`infra_adapter_solr`](../infra_adapter_solr/idea.md)** — Solr fills the MVP2 engine slot Fusion is vacating. +- **Triggered by the reframe in [`docs/07_research/comparison.md`](../../../07_research/comparison.md)** — that doc names the moat (Bayesian + Git-PR + all three OSS engines); this doc executes the engine-list cleanup. +- **Coordinates with the spec §27 revision** that compresses the release matrix to three pre-GA stops (MVP1 shipped → MVP2 Three-Engine + Real Signals → MVP3 Observable → GA v1 polish). +- **Does NOT block UBI on Solr** — the `solr.UBIComponent` writes the standard UBI schema; the MVP2 UBI reader works against Solr unchanged because both ship in the same release. diff --git a/docs/02_product/planned_features/feat_ubi_judgments/idea.md b/docs/02_product/planned_features/feat_ubi_judgments/idea.md index a005d573..102fb340 100644 --- a/docs/02_product/planned_features/feat_ubi_judgments/idea.md +++ b/docs/02_product/planned_features/feat_ubi_judgments/idea.md @@ -1,20 +1,20 @@ -# UBI Judgments — make OpenSearch User Behavior Insights a first-class judgment source +# UBI Judgments — engine-neutral User Behavior Insights as a first-class judgment source -**Date:** 2026-05-22 -**Status:** Idea — anchor feature for MVP1.5 / v0.1.5 "Real Signals" -**Priority:** P1 — MVP1.5 is named for this capability; nothing else in that release ships without it. -**Origin:** Reframing prompted by an external review on 2026-05-22 (LinkedIn outreach to a senior search engineer at a relevance-tooling company who pushed back on LLM-as-judge as the only authoritative judgment source for v1). Cross-checked against [`docs/00_overview/product/relevance-copilot-spec.md`](../../../00_overview/product/relevance-copilot-spec.md) §14 — the existing spec anticipated click-derived judgments but framed them per-engine without naming UBI's standardized cross-engine schema. This idea consolidates that surface around the OpenSearch UBI plugin as the engine-neutral primary path. -**Depends on:** MVP1 shipped (specifically: [`judgments`](../../../../backend/app/db/models/judgment.py) + [`judgment_lists`](../../../../backend/app/db/models/judgment_list.py) tables, [`ElasticAdapter`](../../../../backend/app/adapters/elastic.py) with `SearchAdapter.search_batch`, [`generate_judgments_llm`](../../../../backend/workers/judgments.py) agent tool pattern). All prerequisites are in `main` as of 2026-05-23. +**Date:** 2026-05-22 (refreshed 2026-05-27 for the positioning reframe) +**Status:** Idea — bundled with [`infra_adapter_solr`](../infra_adapter_solr/idea.md) into MVP2 / v0.2 "Three-Engine + Real Signals" +**Priority:** P1 — MVP2 ships UBI + Solr together; the hybrid UBI+LLM converter is the differentiated capability vs OpenSearch SRW (which has UBI-via-COEC GA but no hybrid mode, no full-search-space Bayesian optimizer to feed). See [`docs/07_research/comparison.md`](../../../07_research/comparison.md) for the citation-backed competitive position. +**Origin:** Originally prompted by an external review on 2026-05-22 (LinkedIn outreach to a senior search engineer at a relevance-tooling company who pushed back on LLM-as-judge as the only authoritative judgment source). The 2026-05-27 reframe bundled this work with the Solr adapter into one MVP2 release because Solr's first-party `solr.UBIComponent` writes the same UBI schema as the OpenSearch UBI plugin — UBI on Solr is free once the adapter ships, and the combined release tells the engine-neutral story coherently. +**Depends on:** MVP1 shipped (`judgments` + `judgment_lists` tables, `ElasticAdapter` with `SearchAdapter.search_batch`, `generate_judgments_llm` agent tool pattern). Co-ships with [`infra_adapter_solr`](../infra_adapter_solr/idea.md) in MVP2. ## Problem -MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click` ([`backend/app/db/models/judgment.py:42-48`](../../../../backend/app/db/models/judgment.py#L42-L48)), and judgment lists can mix sources by design ([umbrella spec §14 line 719](../../../00_overview/product/relevance-copilot-spec.md)). But the actual reader, converter, and ingestion endpoint have never been built. +MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click` ([`backend/app/db/models/judgment.py:42-48`](../../../../backend/app/db/models/judgment.py#L42-L48)), and judgment lists can mix sources by design ([umbrella spec §14 line 719](../../../00_overview/relyloop-spec.md)). But the actual reader, converter, and ingestion endpoint have never been built. This leaves three unsolved gaps for operators with production search traffic: 1. **LLM-as-judge is a weaker trust anchor than real user behavior.** For e-commerce, content discovery, and any surface where user intent is the source of truth, ratings derived from clicks + dwell + conversions reflect what users *find* relevant, not what an LLM *guesses* should be relevant. The optimization loop's quality ceiling is the judgment list's quality; replacing the ceiling is the single biggest believability upgrade RelyLoop can ship. 2. **Judgment-list scale and freshness are bounded.** LLM-as-judge produces hundreds to low thousands of (query, doc) ratings per call (rate-limited, cost-bounded). The 80/20 long tail of queries users actually issue never gets rated. Each new study reuses a snapshot judgment list that goes stale; there's no continuous-refresh path. -3. **UBI is the standardized schema, and OpenSearch is the MVP1 engine target.** The OpenSearch UBI plugin (shipped 2024, championed by Eric Pugh / OpenSource Connections — the same team behind Quepid and the Haystack conference) writes two standardized indices into the cluster RelyLoop is already adapting: `ubi_queries` and `ubi_events`. The integration friction is unusually low — RelyLoop reads two indices in a cluster it already talks to, no new infrastructure on either side. The current spec framing (engine-specific `pull_signals` adapter methods, Fusion Signals at v1.5, ES Behavioral Analytics at v2) under-uses this standardization. +3. **UBI is the standardized schema across all three OSS engines.** The UBI plugin (championed by Eric Pugh / OpenSource Connections — the same team behind Quepid and the Haystack conference) writes two standardized indices: `ubi_queries` and `ubi_events`. The OpenSearch UBI plugin shipped in 2024 ([opensearch-project/user-behavior-insights](https://github.com/opensearch-project/user-behavior-insights), Apache 2.0). The o19s ES fork ([repo](https://github.com/o19s/user-behavior-insights-elasticsearch)) extends UBI to ES. Apache Solr ships UBI **first-party** as `` in core ([Solr reference guide](https://solr.apache.org/guide/solr/latest/query-guide/learning-to-rank.html); [UBI tools index](https://www.ubisearch.dev/tools/)). The integration friction is unusually low — RelyLoop reads two indices in a cluster it already talks to, on any of the three engines, with no engine-specific UBI code. ## Proposed capabilities @@ -24,20 +24,20 @@ Single-tier — small, additive, no schema migration. Five capability blocks bel - **Location:** new module `backend/app/services/ubi_reader.py` + supporting feature aggregation in `backend/app/domain/ubi/features.py`. - **Inputs:** `cluster_id`, `target` (the live index being tuned, used to disambiguate UBI events emitted from multiple applications against the same UBI indices), `since` / `until` window, optional `query_filter` (substring or exact-match), optional `max_queries` (default 5000). -- **Reads:** the standardized `ubi_queries` and `ubi_events` indices via `SearchAdapter.search_batch` — the engine adapter is unchanged, the reader uses two scrolling searches and a client-side join on `query_id`. No new adapter method, no Fusion-side branch. +- **Reads:** the standardized `ubi_queries` and `ubi_events` indices via `SearchAdapter.search_batch` — the engine adapter is unchanged, the reader uses two scrolling searches and a client-side join on `query_id`. No new adapter method, no engine-specific UBI code. - **Output:** a per-(query, doc) feature dict with click count, impression count, position-bias-corrected CTR (Wang-Bendersky correction with a configurable position-bias prior; CCM/DBN deferred to v1.5+), post-click dwell-time mean, conversion rate (where the operator emits conversion events; NULL otherwise), refinement rate. -- **Engine-agnostic by construction.** Any `SearchAdapter` that can run a `search_batch` over `ubi_queries` + `ubi_events` is supported. ES + OpenSearch both work in MVP1.5; engines added later (Fusion at MVP3, others as adapters land) work the moment their adapter ships, no UBI-specific code required. +- **Engine-agnostic by construction.** Any `SearchAdapter` that can run a `search_batch` over `ubi_queries` + `ubi_events` is supported. ES + OpenSearch both work in MVP2; engines added later (Solr ships in the same MVP2 release; future adapters land) work the moment their adapter ships, no UBI-specific code required. - **Operator-facing constraint:** the OpenSearch UBI plugin must be installed and event capture enabled in the operator's application. A capability check at endpoint entry returns 412 `UBI_NOT_ENABLED` if `ubi_queries` is absent. ### `SignalsConverter` Protocol + initial implementations - **Location:** new module `backend/app/domain/ubi/converter.py` with the Protocol + three concrete impls. - **Protocol:** `convert(features: dict[QueryDocPair, FeatureVec]) -> dict[QueryDocPair, Rating]` where `Rating` is 0–3 graded. Pure-domain, no I/O. -- **Initial implementations (MVP1.5):** +- **Initial implementations (MVP2):** - `CtrThresholdConverter` — position-bias-corrected CTR mapped to 0/1/2/3 via configurable thresholds (defaults: 0.05 / 0.15 / 0.30). Conservative, works on small-traffic clusters. - `DwellTimeThresholdConverter` — post-click dwell-time mapped to ratings. Good for content discovery / long-read surfaces where clicks alone don't separate scan-and-bounce from genuine engagement. - `HybridUbiLlmConverter` — UBI converter applies where `impressions >= llm_fill_threshold` (default 20); below the threshold the LLM-as-judge path runs over the (query, doc) pair and the resulting `source='llm'` row is interleaved with `source='click'` rows in the same judgment list. This is the operating mode most adopters will ship to production. -- **Deferred to v1.5+ post-GA:** `CcmConverter` and `DbnConverter` (counterfactual click models). Require enough impressions per (query, doc) to be statistically valid, which most early-MVP1.5 adopters won't have. Same Protocol — additive. +- **Deferred to v1.5+ post-GA:** `CcmConverter` and `DbnConverter` (counterfactual click models). Require enough impressions per (query, doc) to be statistically valid, which most early-MVP2 adopters won't have. Same Protocol — additive. ### API surface @@ -48,7 +48,7 @@ Single-tier — small, additive, no schema migration. Five capability blocks bel ### Agent tool - **New tool:** `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList`. Mirrors `generate_judgments_llm` shape so the chat agent can switch between the two transparently. Listed in spec §19 Query sets & judgments alongside `generate_judgments_llm`. -- **System prompt update:** the orchestrator's tool description for "generate a judgment list" now prefers UBI when the operator's cluster has UBI enabled (detected via a one-shot `get_schema` probe for the `ubi_queries` index), and falls back to LLM-as-judge otherwise. This is the chat ergonomic that earns the MVP1.5 release name. +- **System prompt update:** the orchestrator's tool description for "generate a judgment list" now prefers UBI when the operator's cluster has UBI enabled (detected via a one-shot `get_schema` probe for the `ubi_queries` index), and falls back to LLM-as-judge otherwise. This is the chat ergonomic that earns the MVP2 release name. ### Operator-facing documentation @@ -61,7 +61,7 @@ Single-tier — small, additive, no schema migration. Five capability blocks bel - **Frontend:** ~150 LOC — extend the judgment-generation modal (`ui/src/components/judgments/create-judgment-modal.tsx` or whatever sibling shape lands by then) with a "source: LLM | UBI | Hybrid" picker + UBI window controls; new empty-state on the judgment-list detail page when the converter dropped some pairs as insufficient-data. - **Migration:** **none.** UBI rides the existing `judgments` table; the `source IN ('llm', 'human', 'click')` CHECK already accepts the new value. Alembic head unchanged at whatever MVP1 ships. - **Config:** one new optional env var `UBI_POSITION_BIAS_PRIOR_FILE` for operators who want to override the default Wang-Bendersky prior with a learned table. Default behaves like an uninformed prior. -- **Audit events:** N/A (MVP1.5 still pre-`audit_log`; that surface activates at MVP2). +- **Audit events:** N/A (MVP2 still pre-`audit_log`; that surface activates at MVP3). - **Tests:** - Unit: converter math (CTR thresholds, dwell-time thresholds, hybrid routing), feature aggregation, position-bias correction edge cases (zero impressions, single-impression queries, NULL dwell) - Integration: end-to-end `POST /api/v1/judgment-lists/generate-from-ubi` against a stubbed `UbiReader` that returns canned feature vectors; mixed-source judgment list round-trip (INSERT + SELECT + calibration roll-up) @@ -70,16 +70,16 @@ Single-tier — small, additive, no schema migration. Five capability blocks bel ## Why not implemented inline in MVP1 -1. **MVP1 is sized to demonstrate the loop, not to maximize judgment quality.** Adding UBI inline doubles the judgment-source code path before the LLM-as-judge path has been proven against real adopter feedback. Shipping LLM-only first lets MVP1 stay focused on the optimization-loop value prop; MVP1.5 then earns the trust upgrade for operators with traffic. -2. **Converter strategy benefits from MVP1 adopter feedback.** Position-bias priors, dwell-time thresholds, and the LLM-fill cutoff are all judgment calls that get sharper after watching adopters run MVP1's LLM-as-judge against their real data. Building MVP1.5 against MVP1 adopter signal is meaningfully cheaper than building it speculatively. +1. **MVP1 is sized to demonstrate the loop, not to maximize judgment quality.** Adding UBI inline doubles the judgment-source code path before the LLM-as-judge path has been proven against real adopter feedback. Shipping LLM-only first lets MVP1 stay focused on the optimization-loop value prop; MVP2 then earns the trust upgrade for operators with traffic. +2. **Converter strategy benefits from MVP1 adopter feedback.** Position-bias priors, dwell-time thresholds, and the LLM-fill cutoff are all judgment calls that get sharper after watching adopters run MVP1's LLM-as-judge against their real data. Building MVP2 against MVP1 adopter signal is meaningfully cheaper than building it speculatively. 3. **No schema migration is required to wait.** The `judgments.source` enum, the mixed-source judgment list contract, and the `SignalsConverter` Protocol shape were designed for this upgrade from day one. Delaying ships nothing important earlier; rushing ships a less-tuned converter. -4. **Strategic positioning.** Naming a dedicated MVP1.5 "Real Signals" release for UBI signals that UBI is a first-class direction — relevant for adoption in the OSC community where UBI was incubated, and for design partners who'd otherwise discount RelyLoop as an LLM-only tuning toy. Burying UBI in MVP2 "Observable" or MVP3 "Production Stacks" misses that positioning. +4. **Strategic positioning.** Naming a dedicated MVP2 "Real Signals" release for UBI signals that UBI is a first-class direction — relevant for adoption in the OSC community where UBI was incubated, and for design partners who'd otherwise discount RelyLoop as an LLM-only tuning toy. Burying UBI in a later observability or hardening release misses that positioning. ## Relationship to other work -- **Cleans up [`docs/00_overview/product/relevance-copilot-spec.md`](../../../00_overview/product/relevance-copilot-spec.md) §14 + §19 + §27** — the spec previously framed click data as a per-engine adapter concern with engine-specific timelines. The §14 patch (landing with this idea) re-anchors the architecture around the engine-neutral OpenSearch UBI schema, with engine-native readers (Elastic Behavioral Analytics, the Fusion `{app}_signals` collection, etc.) as thin extensions feeding the same `SignalsConverter` Protocol. +- **Cleans up [`docs/00_overview/relyloop-spec.md`](../../../00_overview/relyloop-spec.md) §14 + §19 + §27** — the spec previously framed click data as a per-engine adapter concern with engine-specific timelines. The §14 patch (landing with this idea) re-anchors the architecture around the engine-neutral UBI schema (which works across all three OSS engines via their respective UBI implementations), with engine-native readers (Elastic Behavioral Analytics, etc.) as thin extensions feeding the same `SignalsConverter` Protocol when an adopter needs them. - **Composes with [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — auto-chained follow-up studies become dramatically more useful with a continuously-refreshed UBI judgment list than with a snapshot LLM-as-judge list. The two features are complementary; UBI ships first. - **Composes with [`feat_pr_metric_confidence`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/)** (shipped 2026-05-21) — the confidence framing in the PR body becomes meaningfully stronger when "the metric was scored against 50,000 UBI-derived ratings covering 90% of last week's traffic" replaces "the metric was scored against 500 LLM ratings against a snapshot query set." - **Composes with [`feat_study_baseline_trial`](../feat_study_baseline_trial/idea.md) + [`feat_config_repo_baseline_tracking`](../feat_config_repo_baseline_tracking/idea.md)** — once UBI is the judgment source, "the baseline metric on the live config" becomes a meaningful absolute number rather than a synthetic LLM-rated approximation. Materially raises the credibility of every winning trial. - **Does NOT block MVP2 "Observable"** — Langfuse and SigNoz instrumentation can layer on top of `generate_judgments_from_ubi` exactly as it would on top of `generate_judgments_llm`. The `langfuse_trace_id` lineage column landing at MVP2 will be NULL for `source='click'` rows (which never invoke an LLM) and populated for `source='llm'` rows in the hybrid case — same column, source-dependent fill. -- **Does NOT block later engine work** — the MVP1.5 `SignalsConverter` Protocol is engine-agnostic. New adapters added in later releases contribute their own engine-native reader (where they have one) feeding the same Protocol; the converter library and the API surface are unchanged regardless of which engines ship. +- **Does NOT block later engine work** — the MVP2 `SignalsConverter` Protocol is engine-agnostic. New adapters added in later releases contribute their own engine-native reader (where they have one) feeding the same Protocol; the converter library and the API surface are unchanged regardless of which engines ship. diff --git a/docs/02_product/planned_features/infra_adapter_solr/idea.md b/docs/02_product/planned_features/infra_adapter_solr/idea.md new file mode 100644 index 00000000..34928e4b --- /dev/null +++ b/docs/02_product/planned_features/infra_adapter_solr/idea.md @@ -0,0 +1,97 @@ +# Apache Solr adapter — MVP2 scope (bundled with UBI) + +**Date:** 2026-05-27 +**Status:** Idea — anchor feature for MVP2 / v0.2 "Three-Engine + Real Signals" (bundled with [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md)) +**Priority:** P1 — MVP2 is named for the bundle of this adapter + UBI judgments; together they ship four of RelyLoop's six differentiators (all three OSS engines + hybrid UBI+LLM) +**Origin:** Positioning reframe on 2026-05-27 (see [`chore_drop_fusion_scope/idea.md`](../chore_drop_fusion_scope/idea.md) for the paired Fusion-drop rationale and [`docs/07_research/comparison.md`](../../../07_research/comparison.md) for the moat analysis). Replaces the previously-planned Lucidworks Fusion adapter as the next engine target. +**Depends on:** MVP1 shipped (`ElasticAdapter`, `SearchAdapter` Protocol, study lifecycle, judgment lists, PR worker). Co-released with [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md): Solr's `solr.UBIComponent` writes the same `ubi_queries` + `ubi_events` schema, so the MVP2 `UbiReader` works unchanged against a Solr cluster from day one. + +## Problem + +After MVP1.5, RelyLoop runs against Elasticsearch and OpenSearch — but the "engine-neutral" positioning is aspirational until a third engine ships. Apache Solr is the right third engine because: + +1. **It completes the OSS-engine sweep.** Elasticsearch, OpenSearch, and Apache Solr are the three engines OSC + Sease + Querqy + the Haystack community treat as the canonical OSS search stack. Supporting all three makes the "works wherever you are" pitch verifiable rather than rhetorical. +2. **UBI on Solr is first-party.** Solr ships `` in core ([Solr reference guide](https://solr.apache.org/guide/solr/latest/query-guide/learning-to-rank.html); [UBI tools index](https://www.ubisearch.dev/tools/)) using the same schema as the OpenSearch UBI plugin. MVP1.5's `UbiReader` works unmodified — no Solr-specific UBI code. +3. **Quepid + Chorus user base is Solr-native.** OSC's primary reference stack is Solr-based. Operators who already run Quepid for manual relevance evaluation are the natural adopters for RelyLoop's Bayesian-loop upgrade on the same engine they already manage. +4. **LTR is stable.** Solr 10 (March 2026) ships `modules/ltr` with `LinearModel`, `MultipleAdditiveTreesModel` (XGBoost-compatible), and `NeuralNetworkModel`. Stable since Solr 6. The de facto OSS LTR baseline outside ES native LTR ([Sease: Solr 10 LTR overview](https://sease.io/2026/03/apache-solr-10-what-is-new-for-vector-search-and-ltr.html)). + +The Lucidworks Fusion adapter that previously occupied this slot is dropped — see [`chore_drop_fusion_scope`](../chore_drop_fusion_scope/idea.md) for the rationale (vendor entanglement, narrower audience overlap with the Quepid/Chorus community, materially higher build cost). + +## Proposed capabilities + +### `SolrAdapter` implementation + +- **Location:** new module `backend/app/adapters/solr.py` implementing the `SearchAdapter` Protocol from [`backend/app/adapters/protocol.py`](../../../../backend/app/adapters/protocol.py). +- **Engine support:** Solr 9.x (current widely-deployed) + Solr 10.x (released 2026-03). SolrCloud and standalone modes both supported. Solr 8.x and earlier explicitly out of scope. +- **`search_batch`:** parallel `/select` requests with a connection pool. Solr has no `_msearch` equivalent; the JSON Request API allows multi-query but is awkward and undertested across versions. Connection pool sized via existing settings (`HTTPX_POOL_LIMITS`). +- **`render`:** produces a Solr request parameter dict (later URL-encoded). Supports `edismax` (primary), `dismax`, and `lucene` parsers. Templates live under `templates/solr/` as Jinja templates that emit parameter maps, mirroring `templates/elasticsearch/` shape. +- **`get_schema`:** uses Solr's Schema API (`/schema/fields`, `/schema/dynamicfields`, `/schema/fieldtypes`). Result shape matches `Schema` type unchanged. +- **`list_targets`:** uses CoresAdmin API (`/admin/cores?action=STATUS`) for standalone; CollectionsAdmin (`/admin/collections?action=LIST`) for SolrCloud. Selects automatically based on a startup capability probe. +- **`explain`:** uses `debugQuery=true&debug=results` and parses `debug.explain` from the response. +- **Authentication:** `auth_kind` extended to include `solr_basic` (HTTP Basic) and `solr_apikey` (Solr 9+ JWT through the security.json `JWTAuthPlugin`). PKI auth is internal-only and not exposed. +- **Capability probe at adapter construction:** detects Solr version, SolrCloud-vs-standalone, presence of `solr.UBIComponent`, presence of `ltr` module, and writes the result to the `clusters.engine_config` JSONB. Used by the search-space validator to reject studies that reference parameters the cluster can't honor. + +### Cross-engine parameter map (additions) + +The unified parameter vocabulary defined in [`docs/01_architecture/adapters.md` §"Cross-engine parameter naming"](../../../01_architecture/adapters.md) gets a third column. The `field_boosts` / `phrase_field_boosts` / `tie_breaker` / `min_should_match` / `slop` / `boost_fn` / `rerank_model` parameters already had Solr `edismax` mappings documented in the original spec — they become real implementation, not architectural reference. + +Solr-specific notes: + +- **`mm` syntax is richer than ES `minimum_should_match`.** Solr's `mm` accepts arithmetic expressions (`2<-25% 9<-3`); the adapter accepts unified `int | float | str` and validates against the Solr syntax server-side. +- **Boosts in Solr are additive (`bf`) by default; multiplicative via `boost`.** ES `function_score` defaults to multiplicative. The unified `boost_fn` parameter carries an explicit `combine: "add" | "multiply"` field; the Solr adapter renders into `bf` or `boost` respectively. +- **LTR rescoring is `{!ltr model=... reRankDocs=...}` injected as `rq=`**, not the ES `rescore.learning_to_rank` shape. The adapter handles both at the unified `rerank_model` parameter. +- **No Solr-side "pipeline stage toggle" concept.** The `stage_enabled` parameter (was Fusion-only) is removed from the unified vocabulary as part of the Fusion drop. + +### LTR rescoring + +- **In scope for MVP2:** apply a pre-existing `MultipleAdditiveTreesModel` (XGBoost-compatible) loaded via Solr's `/schema/model-store` as a rescore stage in a study trial. Training the model is out of scope (LTR training lands in v2 Path A as a cross-engine capability). +- The unified `rerank_model: {id, top_k}` parameter renders to Solr `rq={!ltr model=${id} reRankDocs=${top_k}}`. + +### UBI on Solr + +- **Bundled with [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md) in the same MVP2 release.** The `UbiReader` reads `ubi_queries` + `ubi_events` collections via `SearchAdapter.search_batch` — works against any adapter that implements the Protocol. The `solr.UBIComponent` writes the same schema as the OpenSearch UBI plugin. Once both this adapter and `feat_ubi_judgments` ship in MVP2, every UBI path (`POST /api/v1/judgment-lists/generate-from-ubi`, `generate_judgments_from_ubi` agent tool, hybrid UBI+LLM converter) works on all three engines from day one. +- Operator-facing docs gain a section on enabling `` in `solrconfig.xml` and routing search requests through it (analogous to the OpenSearch UBI plugin enablement runbook). + +### Compose service + tests + +- New Compose service `solr` (Apache 2.0 image, `solr:10`) bound to `127.0.0.1:8983`. Mirrors the existing `elasticsearch` and `opensearch` service shape. +- Sample collection `products` seeded from `samples/products.json` (the same dataset MVP1 uses for ES). +- Adapter unit tests under `backend/tests/unit/adapters/test_solr.py` (mocked HTTP; fast). +- Adapter integration tests under `backend/tests/integration/adapters/test_solr_live.py` against the Compose Solr service. +- Contract tests extending the existing `SearchAdapter` conformance suite — every Protocol method that ES + OpenSearch pass, Solr must also pass. +- E2E test: `ui/tests/e2e/solr-study-end-to-end.spec.ts` runs the full Karpathy loop (register Solr cluster → create study → generate judgments → run trials → open PR) against the live Compose Solr. + +### Operator-facing documentation + +- **New runbook:** `docs/03_runbooks/solr-cluster-registration.md` — how to register a Solr cluster, configure `edismax` defaults, enable `solr.UBIComponent`, upload an LTR model. +- **Tutorial extension:** `docs/08_guides/tutorial-first-study.md` gains a Step 0 Path C — "Run the tutorial against Solr instead of Elasticsearch." Demonstrates the same study, same loop, same PR — different engine. + +## Scope signals + +- **Backend:** ~1,200 LOC total. Adapter ~600 LOC; templates ~150 LOC; capability probe ~100 LOC; auth + connection handling ~150 LOC; ~200 LOC tests (unit + integration + contract). Roughly 40–50% conceptually shared with ES adapter (orchestration shell, validator hooks, error mapping); the rest is Solr-specific (parameter rendering, LTR injection, `mm` syntax handling, JSON Request API quirks). +- **Frontend:** ~100 LOC. New `engine_type` option in the cluster-registration form; engine-specific help text for Solr auth flows; engine badge on cluster cards / study headers (small Solr SVG). +- **Migration:** **one migration**. Extends the `clusters.auth_kind` CHECK constraint to accept `solr_basic` and `solr_apikey`; extends `engine_type` CHECK to accept `solr`. No new tables. +- **Config:** new optional env vars `SOLR_HOST` / `SOLR_PORT` / `SOLR_ADMIN_USERNAME_FILE` / `SOLR_ADMIN_PASSWORD_FILE` for the Compose Solr service. Mirrors the ES + OpenSearch settings pattern. +- **Audit events:** N/A at the engine-adapter layer. `audit_log` activates at MVP3 (Observable) and uses event names independent of engine; multi-tenancy is in the backlog. +- **Tests:** + - Unit: parameter rendering for each `edismax` parameter, LTR rescore injection, `mm` arithmetic syntax, capability probe parsing, error mapping for 4xx/5xx + - Integration: end-to-end search against Compose Solr; LTR model upload + rescore round-trip; UBI reader against seeded `ubi_queries`/`ubi_events` + - Contract: `SearchAdapter` Protocol conformance — every method that ES + OpenSearch implement, Solr must implement with the same signature and error envelope + - E2E: full Karpathy loop against the live Compose Solr (Step 0 Path C of the tutorial, automated) + +## Why bundled with UBI into MVP2 (not split into two releases) + +1. **Together they ship the engine-neutral story.** UBI alone catches up with OpenSearch SRW (which already has UBI-via-COEC GA). The Solr adapter alone is a third engine without a Real Signals story. Bundled, they ship "RelyLoop runs on all three OSS engines with UBI on every one of them" as a single coherent headline. +2. **Solr's `solr.UBIComponent` is first-party.** The `UbiReader` works against Solr unchanged the moment the adapter lands. Splitting them into two releases means one of the two would ship a half-finished UBI story (UBI without Solr support, or Solr without same-release UBI parity). +3. **MVP3 is reserved for observability.** Langfuse + SigNoz + audit-log immutability is a foundational reliability layer that benefits every adapter and every judgment source. Landing it after the engine sweep means MVP3 instruments three engines × two judgment sources in one release of work, rather than retrofitting observability per engine. +4. **No schema or Protocol changes are required.** The `SearchAdapter` Protocol shape is engine-agnostic by design; the `judgments.source` CHECK already accepts `click`. Both capabilities are additive — bundling them is a release-cadence decision, not a technical compromise. + +**Release size estimate:** ~4–5 engineer-weeks combined (Solr adapter ~2–3, UBI ~2, ~1 week of co-integration testing + the engine-neutral tutorial extensions). Solo-engineer; ~3–4 weeks with two engineers working in parallel on the Solr and UBI tracks. + +## Relationship to other work + +- **Replaces the previously-planned Lucidworks Fusion adapter** as the next engine target. See [`chore_drop_fusion_scope/idea.md`](../chore_drop_fusion_scope/idea.md) for why Fusion was dropped. +- **Bundled with [`feat_ubi_judgments`](../feat_ubi_judgments/idea.md)** in MVP2 — Solr's `solr.UBIComponent` writes the same UBI schema; the UBI reader and hybrid UBI+LLM converter work on Solr unchanged from day one. +- **Multi-Git provider abstraction (GitLab, Bitbucket) is in the backlog** — was previously bundled with the Fusion-era MVP3; reframed as backlog because it serves a smaller adopter axis than the engine sweep + observability path. GitHub remains the only Git provider through GA v1. +- **Unlocks the verifiable "engine-neutral" claim** in [`docs/07_research/comparison.md`](../../../07_research/comparison.md) and the umbrella spec §1. The claim is rhetorical at MVP1; it becomes factual at MVP2. +- **MVP3 "Observable" follows** — Langfuse + SigNoz + audit-log immutability + lineage layers on top of all three engines and both judgment sources in one go. diff --git a/docs/07_research/comparison.md b/docs/07_research/comparison.md new file mode 100644 index 00000000..e3957b2b --- /dev/null +++ b/docs/07_research/comparison.md @@ -0,0 +1,81 @@ +# Comparison with adjacent tools + +**Status:** Factual reference. Updated when a referenced tool ships a release that changes a row. +**Last updated:** 2026-05-27. +**Scope:** OSS and commercial relevance-tuning tools that overlap RelyLoop's surface. Excludes general-purpose ML observability (Phoenix, LangSmith, Helicone) — those are complementary, not competitive. + +This page is a factual matrix, not a sales sheet. Each row links to the tool's own docs so readers can verify claims independently. Where a capability is partial, the cell describes the partial state rather than claiming "no." + +## Snapshot matrix + +| Capability | RelyLoop (v0.1.0) | OpenSearch SRW (3.6) | OpenSearch Relevance Agent (3.6) | Quepid | RRE | Chorus | Elasticsearch (native) | Splainer | +|---|---|---|---|---|---|---|---|---| +| **Bayesian / TPE optimization over full query-time search space** | yes (Optuna TPE, thousands of trials) | no — Hybrid Search Optimizer is a 66-cell grid search over `{2 norms × 3 combiners × 11 weight steps}` for hybrid weights only ([docs](https://docs.opensearch.org/latest/search-plugins/search-relevance/optimize-hybrid-search/)); Bayesian is in [RFC #934](https://github.com/opensearch-project/neural-search/issues/934) with no shipped code | no | no | no | no | no | no | +| **LLM-as-judge with customizable prompts** | yes | yes — GA in 3.5 ([release notes](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-3.5.0.md)) | suggests judgments conversationally | community plugin, not in the OSS core | no | no | no — operators DIY against `_rank_eval` | no | +| **UBI-derived judgments (click streams → ratings)** | planned MVP2 (`feat_ubi_judgments`); hybrid UBI+LLM is the differentiated mode | yes — COEC click model GA ([docs](https://docs.opensearch.org/latest/search-plugins/search-relevance/judgments/)) | no | no | no | yes — reference UBI showcase ([repo](https://github.com/o19s/chorus-opensearch-edition)) | no native UI; UBI plugin available as community fork ([repo](https://github.com/o19s/user-behavior-insights-elasticsearch)) | no | +| **Engine support** | Elasticsearch 8.11+/9.x + OpenSearch 2.x/3.x today; **Apache Solr planned MVP2**; one adapter, one workflow | OpenSearch only | OpenSearch only | Solr (since day one) + ES + OpenSearch | Solr + ES | Solr (primary) + OpenSearch (partial) | ES only | Solr + ES | +| **Git-PR apply path (winning configs land as PRs, named approvers merge)** | yes (GitHub today; GitLab + Bitbucket in backlog) | **no — explicitly out of scope by RFC** ([RFC #17735](https://github.com/opensearch-project/OpenSearch/issues/17735): "focuses on evaluation and analysis, not production deployment mechanisms") | no | no — Quepid writes judgments, not configs | no | no | no | no | +| **Search-configuration A/B comparison runner** | indirect (compare studies) | yes — GA in 3.1 ([docs](https://docs.opensearch.org/latest/search-plugins/search-relevance/using-search-relevance-workbench/)) | suggests configs, doesn't compare them at scale | yes (manual) | yes (CLI) | yes (via Quepid) | no — `_rank_eval` is an API primitive, no UI | yes (drill-down) | +| **Scheduled / unattended experiment runs** | yes (Optuna study runs overnight; `feat_auto_followup_studies` chains them) | yes — GA in 3.5 (nightly/weekly/monthly cadence) | no | no | yes (cron-driven CLI) | no | no | no | +| **Multi-cluster support** | yes (one tool, many `clusters` rows) | yes — added in 3.6 ([release notes](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-3.6.0.md)) | yes (3.6) | yes | yes | yes | no | no | +| **Conversational agent that runs the loop** | yes (chat orchestrator dispatches `start_study`, `generate_judgments_*`, `open_proposal` tools) | no (SRW UI is form-driven; the Relevance Agent is a separate experimental product) | yes — DSL recommender, **but does not run multi-thousand-trial sweeps** ([blog](https://opensearch.org/blog/introducing-opensearch-relevance-agent-ai-powered-search-tuning/)) | no | no | no | no | no | +| **Local-first LLM observability (self-hosted Langfuse/SigNoz)** | planned MVP2 | n/a (no LLM-as-judge observability surface) | n/a | n/a | n/a | n/a | n/a | n/a | +| **Apache 2.0 license** | yes | yes ([repo](https://github.com/opensearch-project/search-relevance)) | yes (OpenSearch project) | yes ([repo](https://github.com/o19s/quepid)) | yes ([repo](https://github.com/SeaseLtd/rated-ranking-evaluator)) | yes ([repo](https://github.com/querqy/chorus)) | Elastic License 2.0 + SSPL (not OSI-OSS); `_rank_eval` API is Basic-tier | yes ([repo](https://github.com/o19s/splainer-search)) | +| **License tier for relevance-tuning features** | Apache 2.0, all tiers | Apache 2.0, all tiers | Apache 2.0, all tiers | Apache 2.0, all tiers | Apache 2.0, all tiers | Apache 2.0, all tiers | `_rank_eval` Basic; native LTR + ML inference Platinum or higher ([subscriptions](https://www.elastic.co/subscriptions)) | Apache 2.0, all tiers | + +## Why the bundle matters + +Each individual capability above has at least one OSS comparable. The combination — *Bayesian/TPE optimization across the full search space, on every major open-source engine, with a Git-PR apply path* — does not. Concretely: + +- OpenSearch SRW is the closest competitor and ships GA query sets, judgment lists, A/B comparison, LLM-as-judge, scheduled experiments, and UBI judgments — but its optimizer is a 66-cell grid restricted to hybrid weights, and it has no apply path by explicit RFC decision. +- Quepid is the closest *workbench* (manual A/B with judgments) and is the strongest tool for human-rated judgment management; it does not run automated sweeps and is not LLM-driven. +- Elasticsearch ships `_rank_eval` (an API primitive) and deprecated its higher-level Behavioral Analytics and Search Applications products in 9.0 ([release notes](https://www.elastic.co/guide/en/elastic-stack/9.0/release-notes-elasticsearch-9.0.0.html)). There is no native ES equivalent to SRW or RelyLoop. +- Solr's ecosystem (Quepid + Chorus + RRE) is mature for manual evaluation but has no auto-optimizer. UBI ships first-party on Solr as `solr.UBIComponent` ([reference guide](https://solr.apache.org/guide/solr/latest/query-guide/learning-to-rank.html); [UBI tools](https://www.ubisearch.dev/tools/)). + +## What RelyLoop deliberately does NOT do + +To stay honest about scope: + +- **Online A/B testing on production traffic.** Offline evaluation only. See [umbrella spec §4](../00_overview/relyloop-spec.md). +- **Online learning / bandits.** Documented as a v2 Path B direction; deliberately deferred from v1. +- **Production search-quality monitoring.** APM, Grafana, and SRW's own metrics surface own this space. +- **Schema / mapping / analyzer changes.** Tuning is restricted to query-time parameters. +- **Sitting on the live search-serving path.** RelyLoop opens PRs; operator CI deploys them. +- **Training Learning-to-Rank models** in v1. Output is query-time parameter changes, not learned reranker weights. LTR support is a v2 Path A candidate. + +## Update cadence + +This page is updated when: + +- A row changes (a referenced tool ships a release that flips a capability from "no" to "yes" or vice versa). +- A new comparable tool ships its first GA release. +- RelyLoop ships a release that changes its own row. + +Pull requests welcomed from operators using any of the listed tools — corrections preferred over new claims. + +## Sources + +- [OpenSearch Search Relevance Workbench documentation](https://docs.opensearch.org/latest/search-plugins/search-relevance/using-search-relevance-workbench/) +- [OpenSearch SRW repository](https://github.com/opensearch-project/search-relevance) +- [OpenSearch 3.5 release notes (LLM-as-judge GA, scheduled experiments)](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-3.5.0.md) +- [OpenSearch 3.6 release notes (multi-datasource SRW, Relevance Agent)](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-3.6.0.md) +- [OpenSearch Hybrid Search Optimizer docs (grid search, 66 trials)](https://docs.opensearch.org/latest/search-plugins/search-relevance/optimize-hybrid-search/) +- [Hybrid Search Optimization blog post](https://opensearch.org/blog/hybrid-search-optimization/) +- [RFC #17735 — Search Relevance Workbench scope (apply-path out of scope)](https://github.com/opensearch-project/OpenSearch/issues/17735) +- [RFC #934 — Hybrid Search Optimizer Bayesian future work](https://github.com/opensearch-project/neural-search/issues/934) +- [OpenSearch Relevance Agent blog (experimental, 3.6)](https://opensearch.org/blog/introducing-opensearch-relevance-agent-ai-powered-search-tuning/) +- [OpenSearch UBI plugin documentation](https://docs.opensearch.org/latest/search-plugins/ubi/index/) +- [OpenSearch SRW judgments documentation (COEC)](https://docs.opensearch.org/latest/search-plugins/search-relevance/judgments/) +- [UBI specification (o19s/ubi)](https://github.com/o19s/ubi) · [rendered spec](https://o19s.github.io/ubi/) +- [UBI tools and plugins index](https://www.ubisearch.dev/tools/) +- [Apache Solr LTR reference guide](https://solr.apache.org/guide/solr/latest/query-guide/learning-to-rank.html) +- [Sease: Solr 10 — Vector Search and LTR (March 2026)](https://sease.io/2026/03/apache-solr-10-what-is-new-for-vector-search-and-ltr.html) +- [Quepid repository](https://github.com/o19s/quepid) +- [Chorus repository (Solr-centric reference stack)](https://github.com/querqy/chorus) +- [Chorus OpenSearch edition](https://github.com/o19s/chorus-opensearch-edition) +- [Rated Ranking Evaluator (Sease)](https://github.com/SeaseLtd/rated-ranking-evaluator) +- [Splainer](https://github.com/o19s/splainer-search) +- [Elasticsearch 9.0 release notes (Behavioral Analytics + Search Applications deprecation)](https://www.elastic.co/guide/en/elastic-stack/9.0/release-notes-elasticsearch-9.0.0.html) +- [Elasticsearch `_rank_eval` API reference](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-rank-eval) +- [Elasticsearch native LTR docs](https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr) +- [Elastic subscriptions / license tier matrix](https://www.elastic.co/subscriptions) diff --git a/docs/08_guides/tutorial-first-study.md b/docs/08_guides/tutorial-first-study.md index 6927ddf5..80d4708a 100644 --- a/docs/08_guides/tutorial-first-study.md +++ b/docs/08_guides/tutorial-first-study.md @@ -365,7 +365,7 @@ curl -X POST http://localhost:8000/api/v1/config-repos \ [`docs/01_architecture/`](../01_architecture/) — start with [`system-overview.md`](../01_architecture/system-overview.md). - The umbrella product spec is at - [`docs/00_overview/product/relevance-copilot-spec.md`](../00_overview/product/relevance-copilot-spec.md). + [`docs/00_overview/relyloop-spec.md`](../00_overview/relyloop-spec.md). - File feedback or bug reports at [GitHub Discussions](https://github.com/SoundMindsAI/relyloop/discussions). diff --git a/docs/08_guides/workflows-overview.md b/docs/08_guides/workflows-overview.md index 1d6b6802..b4c29dbf 100644 --- a/docs/08_guides/workflows-overview.md +++ b/docs/08_guides/workflows-overview.md @@ -292,7 +292,7 @@ Important framing for new engineers, because the negative space defines the tool - **Has no in-tool approval surface.** Approval is delegated to the config repo's branch protection (CODEOWNERS, required reviewers). - **Single-tenant in MVP1.** Multi-tenancy, SSO, and role enforcement land at MVP4. -If your team needs any of the above, RelyLoop is complementary, not a replacement — the umbrella spec (`docs/00_overview/relevance-copilot-spec.md` §4) is the canonical non-goals list. +If your team needs any of the above, RelyLoop is complementary, not a replacement — the umbrella spec (`docs/00_overview/relyloop-spec.md` §4) is the canonical non-goals list. --- diff --git a/docs/README.md b/docs/README.md index 4958a8f2..d843cb66 100644 --- a/docs/README.md +++ b/docs/README.md @@ -9,7 +9,7 @@ Pick the path that matches your goal: | Goal | Read this first | |---|---| | **Boot the stack on my laptop** | [`03_runbooks/local-dev.md`](03_runbooks/local-dev.md) — `make up` → `/healthz` → debug | -| **Understand what RelyLoop is + the release roadmap** | [`00_overview/product/relevance-copilot-spec.md`](00_overview/product/relevance-copilot-spec.md) (umbrella spec; ~2,800 lines) | +| **Understand what RelyLoop is + the release roadmap** | [`00_overview/relyloop-spec.md`](00_overview/relyloop-spec.md) (umbrella spec; ~2,800 lines) | | **Onboard as a contributor** | [`../state.md`](../state.md) (active branch / focus / debt) → [`../architecture.md`](../architecture.md) (navigation) → [`../CLAUDE.md`](../CLAUDE.md) (conventions + absolute rules) | | **Look up an architectural decision** | [`01_architecture/`](01_architecture/) — topical docs (tech-stack, deployment, adapters, llm-orchestration, etc.) | | **Find the spec for a planned feature** | [`02_product/planned_features//feature_spec.md`](02_product/planned_features/) | diff --git a/state.md b/state.md index 98e30a39..c457e35b 100644 --- a/state.md +++ b/state.md @@ -1,6 +1,8 @@ # RelyLoop — Active State -> Read this first. Snapshots the active branch, what just shipped, what's in flight, what's queued, and where the project currently sits in the MVP1 → GA roadmap. Updated whenever a feature lands or a priority shifts. +> Read this first. Snapshots the active branch, what just shipped, what's in flight, what's queued, and where the project currently sits in the MVP1 → MVP2 → MVP3 → GA roadmap. Updated whenever a feature lands or a priority shifts. + +**Major release-matrix reshuffle landed 2026-05-27** on branch `chore_reframe_positioning_vs_opensearch_srw` (pending PR): positioning rewrite around the verified Bayesian/TPE-over-full-search-space + Git-PR + three-OSS-engine moat (see [`docs/07_research/comparison.md`](docs/07_research/comparison.md) for the citation-backed competitive matrix vs OpenSearch SRW / Quepid / RRE / Chorus / Elastic). Concrete changes: (a) Lucidworks Fusion dropped outright — see [`chore_drop_fusion_scope/idea.md`](docs/02_product/planned_features/chore_drop_fusion_scope/idea.md). (b) Apache Solr promoted to MVP2 — see [`infra_adapter_solr/idea.md`](docs/02_product/planned_features/infra_adapter_solr/idea.md) — bundled with UBI judgments because Solr's first-party `solr.UBIComponent` writes the same UBI schema, so UBI on Solr is free once the adapter ships. (c) Release matrix compressed from six stops to four: MVP1 (shipped) → MVP2 (Three-Engine + Real Signals) → MVP3 (Observable; was MVP2 in prior plan) → GA v1 (polish + governance + hardening; no new product surface). (d) MVP4 (multi-tenant + multi-LLM) + multi-Git provider abstraction + LTR training + Path B moved to backlog. (e) Spec file renamed `docs/00_overview/product/relevance-copilot-spec.md` → `docs/00_overview/relyloop-spec.md` and all 24 active-doc references updated. The umbrella spec §1/§2/§4/§8/§27/§29 + CLAUDE.md + README.md + architecture.md + tech-stack.md + adapters.md + 7 smaller arch docs + feat_ubi_judgments idea + 3 new artifacts (comparison.md, infra_adapter_solr/idea.md, chore_drop_fusion_scope/idea.md) all updated in one bundled PR. **All six differentiators (Bayesian/TPE optimizer, Git-PR apply path, conversational agent that runs the loop, all three OSS engines, hybrid UBI+LLM, local-first observability) are now GA by MVP3.** **Last updated:** 2026-05-26 (after `chore_clone_narrow_bounds_full_roundtrip_e2e` merged into `main` as PR #273 squash `7ecdd171` — 47th MVP1-era artifact. E2E spec extension: extended `ui/tests/e2e/study-clone-narrow-bounds.spec.ts` from the textarea-clamped-only assertion to the full 6-step plan from its own docblock — re-check the narrow-bounds checkbox to put textarea back to clamped state, advance Step 4 → 5, click Create study, capture POST response, GET the new study, assert persisted `search_space.params.boost.low/high` close to 2.0/3.0 (FR-12) AND `parent_study_id === sourceId` (FR-9). Pattern lifted from v1 clone-spec `ui/tests/e2e/study-clone.spec.ts:24`. The full submit round-trip PASSED on both CI runs against the live smoke stack. Cross-model review: 1 Gemini round (1 finding, accepted — URL constructor on the new request.get to defend against trailing-slash PLAYWRIGHT_API_BASE_URL; deferred broader sweep of 4 sibling sites as `chore_e2e_api_base_url_construction`) + 2 GPT-5.5 final-review rounds (2 accepted in round 1: `.ok()` → `.status() === 201` for spec-correct create semantics in BOTH this spec AND v1 clone-spec to avoid divergence; added `expect.poll` wait on `cs-search-space.inputValue()` to defend against fast-click race after re-check; clean round 2). Two tangential ideas captured: `chore_e2e_api_base_url_construction` (sweep 4 string-concat sites) + `bug_smoke_studies_data_table_search_flake` (first CI run showed transient flake on `studies-data-table.spec.ts:20`; second CI run passed → confirmed flake; folder ready if it resurfaces). After: `bug_smoke_dashboard_demo_state_locator_missing` closed via docs-only PR — confirmed fixed as a side effect of PR #268's disclosure-gating change. Empirical evidence: smoke job failure count dropped 4 → 1 across two consecutive main runs (pre-PR#268 main `20f59bc7` had all 3 dashboard tests failing; post-PR#268 main `66244f74` + PR #270 build `4810cfa4` both passed all 3). The closing PR moves `bug_smoke_dashboard_demo_state_locator_missing/` → `implemented_features/2026_05_26_bug_smoke_dashboard_demo_state_locator_missing/` with a `fix_attribution.md` documenting the cross-PR causation chain so a future reader can trace the close-out reasoning. Backlog count drops by 1 with zero additional implementation work — the cheap-win pattern (act on empirical-fix confirmation from CI rather than starting a fresh bug-fix flow). After: `infra_dockerfile_invariant_smoke_in_ci` merged into `main` as PR #270 squash `4810cfa4` — 46th MVP1-era artifact. CI workflow change: adds `load: true` to the existing `docker buildx (relyloop/api)` build step + a new "Verify runtime image invariants" step that `docker run`s the freshly-built image and asserts (a) `/app/.venv` is fully relyloop-owned (the `bug_dockerfile_venv_root_owned_after_user_switch` failure mode), (b) default user is relyloop UID 1000. Fast-fail signal on image-construction regressions (~30s on the PR's checks panel vs ~5 min for the smoke job to even reach the failure surface, which often can't catch venv-ownership at all since prod consumers read the venv). End-to-end self-verified: PR #270's own CI runs the new step against PR #270's build SHA — buildx job concluded green = invariants held. Single-PR ad-hoc ship without `/pipeline` scaffolding (idea's "Proposed fix" was concrete enough to land verbatim; ~42 LOC across pr.yml + 9 LOC of idea-preflight doc edits). Cross-model review: 1 Gemini round (1 finding, rejected as stale — identical to GPT-5.5 round-1 #3, pinned to pre-fix SHA `717828d0` before commit `b08c0502` landed the fix) + 2 GPT-5.5 final-review rounds (3 accepted in round 1: missing `/app/.venv` existence guard so a broken image where the dir is missing doesn't false-pass, fixed dashboard regen link drift by replacing my idea.md's `../../../00_overview/...` link with plain-text slug, softened the inline comment from "fires BEFORE smoke" to "fast-fail signal" matching parallel-job reality; clean round 2). Idea preflight pass (Audit & Patch mode) landed 3 doc edits before the implementation commit: P3 → P2 priority (same dashboard-regen-only-recognizes-P0/P1/P2/Backlog drift as PR #265's chore_clone_narrow_bounds_full_roundtrip_e2e fix), Status line citing merged PR #263 / #264 SHAs, stale `infra_ci_smoke_makeup` (already shipped 2026-05-13) reference rewritten to point at the live smoke job as the actual bundling adjacency. After: the "stuck-stack self-rescue" bundle PR #268 squash `66244f74` — 45th MVP1-era artifact. Bundles two sibling fixes captured in PR #267 + an interactive debugging session earlier the same day: (1) `bug_seed_demo_if_empty_counts_soft_deleted` — `scripts/seed_meaningful_demos.py:824`'s `count_existing_clusters()` ran `SELECT COUNT(*) FROM clusters` without `WHERE deleted_at IS NULL`, so soft-deleted rows (from E2E cleanup) counted as "exists" and permanently false-skipped the auto-seed at `install.sh:95` on every subsequent `make up`. Fix: extract SQL to module-level constant `_COUNT_LIVE_CLUSTERS_SQL`, add the WHERE clause. (2) `bug_dashboard_reset_disclosure_gating_too_strict` — `ui/src/components/dashboard/start-here-checklist.tsx:150` gated the "Reset to demo state" disclosure on a 3-way AND (`!hasClusters && !hasQuerySetsWithJudgments && !hasStudies`), hiding the rescue affordance from operators stuck with orphan data but no live clusters. Fix: tighten predicate to `!hasClusters` only. Together the two fixes restore recovery at both layers: auto-seed on `make up` correctly re-seeds when only soft-deleted clusters exist, AND the in-product disclosure renders whenever the operator has no live clusters. Two-layer regression coverage for the seed-script bug: unit-static guard at `backend/tests/unit/scripts/test_seed_meaningful_demos_sql.py` asserting the SQL includes `WHERE deleted_at IS NULL`, integration-semantic guard at `backend/tests/integration/test_seed_meaningful_demos_if_empty.py` (2 tests using transaction-rollback + baseline-snapshot to avoid wiping operator data — pins both the negative case "soft-deleted row doesn't count" and the mixed case "live counts + deleted doesn't"). Frontend test coverage: 3 new/flipped vitest cases at `start-here-checklist.test.tsx` pinning the new render contract. **Unexpected positive side-effect: smoke CI dropped from 4 → 1 failures.** The 3 dashboard tests previously failing under `bug_smoke_dashboard_demo_state_locator_missing` (dashboard-reseed.spec.ts:77 + dashboard.spec.ts:47,63) all passed on the PR's CI runs — the disclosure-gating fix changed the render-condition into one the smoke stack satisfies, incidentally fixing those tests. Only `followup_run.spec.ts:111` (swap-template `template_id` assertion under `bug_smoke_followup_clone_e2e_flakes`) remains red — unrelated root cause. If the dashboard test passes hold, that bug folder may be closeable. Cross-model review: 1 Gemini round (1 finding, rejected as stale — pinned to a pre-round-1-fix SHA) + 2 GPT-5.5 final-review rounds (2 accepted in round 1: transaction-rollback refactor + mixed-row positive assertion; clean round 2). One CI hiccup: first push failed backend full-suite with `column "updated_at" of relation "clusters" does not exist` — `clusters` schema has only `created_at` + `deleted_at` per `backend/app/db/models/cluster.py:83-88`; fixed in commit `42a39da6`. After: `bug_clone_e2e_seed_template_params_mismatch` merged into `main` as PR #265 squash `20f59bc7` — 44th MVP1-era artifact. Test-fixture rename: 9 occurrences of `title.boost` → `boost` in `backend/app/services/test_seeding.py` to match the e2e `seedTemplate()` helper's `declared_params: { boost: 'float' }` at `ui/tests/e2e/helpers/seed.ts:316`. Pre-fix: both helpers bypassed `validate_against_template`, so the mismatch landed in DB silently — but every E2E spec that opened `CreateStudyModal` against the seeded source hit Step-4's client-side `validateSearchSpaceAgainstTemplate` and stalled with `Param 'title.boost' is not declared by template — Declared params: ['boost']`. Co-located consumer updates in lockstep: integration test's declared_params (1 LOC), `ui/tests/e2e/followup_run.spec.ts` NARROW_SEARCH_SPACE + rationale string, `ui/tests/e2e/study-clone-narrow-bounds.spec.ts` (5 literal references + stale fixture-inconsistency comment block replaced with a one-line pointer to remaining out-of-scope smoke flakes), `ui/tests/e2e/studies.spec.ts:173` assertion text follow-on. Resolves 2 of 6 currently-red smoke-stack failures: `study-clone.spec.ts:24` and `followup_run.spec.ts:28` both turned green in CI (verified — smoke dropped from 6 → 4 failures). Remaining 4 failures continue to fall under `bug_smoke_dashboard_demo_state_locator_missing` (dashboard demo-data testids) + `bug_smoke_followup_clone_e2e_flakes` (swap-template `template_id` assertion in `followup_run.spec.ts:111`) — both need live smoke-stack reproduction in separate focused PRs. Tangential discovery captured: `chore_clone_narrow_bounds_full_roundtrip_e2e/idea.md` (P2 — now that the seed mismatch is fixed, `study-clone-narrow-bounds.spec.ts` can be extended with the full submit + GET-the-new-study round-trip assertion that was deferred per a stale comment). Cross-model review: Gemini Code Assist 0 findings; GPT-5.5 across 3 rounds — 4 accepted (bug_fix.md tangential bullet rewritten + test title corrected to "textarea is clamped and restores on uncheck" + chore idea relative-link path fixed + chore idea priority P3 → P2 to match dashboard's recognized tiers) + 2 rejected with cited counter-evidence (tangential-idea-capture flagged as scope drift — defensible because the sweep is a BLOCKING skill step; dashboard "pending bug" listing flagged — defensible because it accurately reflects pre-merge folder state and post-merge Step 8.7 handles the transition). After: `bug_dockerfile_venv_root_owned_after_user_switch` merged into `main` as PR #263 squash `644b0b80` — 43rd MVP1-era artifact. Dockerfile fix: moved `USER relyloop` BEFORE the runtime-stage `RUN uv sync --frozen --no-dev` so the project-install runs as the unprivileged user and writes `relyloop-0.1.0.dist-info/*` as relyloop:relyloop. Pre-fix: 11 root-owned files in `/app/.venv` blocked `uv run` from the relyloop user in any one-shot container (`make test-worktree` etc.). Decision flipped during Gemini review from the originally-locked Option A (`RUN chown -R relyloop:relyloop /app/.venv` after the sync) to Option B (USER before sync) after empirically measuring a 385MB chown-layer copy-up — final image dropped from 963MB → 577MB (386MB saved). Same PR reverts the `--user root` + `-e PYTHONDONTWRITEBYTECODE=1` workaround in `scripts/run-tests-in-worktree.sh` (Phase 2 of `infra_agent_sibling_worktree_isolation`) that cited this bug as its sole justification; mount count stays at 12 (the reverted flags weren't `-v` mounts). Static regression guard at `backend/tests/unit/test_dockerfile_runtime_stage.py` (3 tests): pin USER-before-sync ordering via `_find_directive()` (skips Dockerfile comments — hardened in GPT-5.5 round 2 against substring-match weakness), pin absence of `RUN chown -R /app/.venv` (image-bloat guard, also skips comments). Cross-model review: 1 Gemini round (3 findings, all accepted) + 3 GPT-5.5 final-review rounds (5 accepted, 1 rejected with cited counter-evidence — the rejected "MVP1_DASHBOARD.md still shows bug as open" finding correctly reflects pre-merge folder location; post-merge Step 8.7 move handles transition). Tangential discovery captured: `infra_dockerfile_invariant_smoke_in_ci/idea.md` (P3 — add a CI runtime smoke step that executes the built image and asserts `find /app/.venv -not -user relyloop | wc -l = 0` + uid=relyloop default; surfaced because the static unit test catches structural regressions but not runtime-state ones, and `bug_fix.md` Decision #3 declined to add it inline to avoid extending the PR into pr.yml). Operator-path verification: `docker build` + `docker run` measured pre-fix/post-fix file counts + image sizes. After: `infra_test_worktree_missing_integration_envs` merged into `main` as PR #257 squash `4ffc83a5` — 42nd MVP1-era artifact. Infra-script fix to `scripts/run-tests-in-worktree.sh` that propagates `POSTGRES_PASSWORD_FILE` (required, fail-loud exit 5 — mirrors the existing DB-secret check shape) and `CLUSTER_CREDENTIALS_FILE` (optional, mount-if-present via `[[ -r && -s ]]` probe with a `--dry-run` stderr hint when skipped) from the main worktree's `secrets/` into the one-shot test container. Closes the silent-skip footgun where every Postgres-touching integration test invoked via `make test-worktree CMD="pytest backend/tests/integration -v"` reported `SKIPPED [..] Postgres not reachable` because the `postgres_reachable()` helper at `backend/tests/conftest.py:50-72` gates on BOTH env vars being present, but the script propagated only `DATABASE_URL_FILE`. Operator-path verification: `make test-worktree CMD="pytest backend/tests/integration/test_studies_api.py"` ran 43 tests to completion (43 passed, 0 skipped) from a sibling worktree at `/private/tmp/relyloop-verify-test-worktree`; pre-PR baseline was all 43 skipping. Smoke-test coverage in `backend/tests/unit/scripts/test_run_tests_in_worktree.py` extended from 6 → 9 tests (the parametrized `test_cluster_credentials_skipped_when_host_file_absent_or_empty` covers absent/empty/unreadable + skips the unreadable subcase when running as root on Windows). Docs synced: CLAUDE.md §"Running tests against a sibling worktree" recipe uses a bash-safe `CLUSTER_CREDS_ARGS=()` array splice (NOT the literal `# Optional` inline comment shape the spec requested — caught at plan-stage GPT-5.5 cycle 1 as broken bash that would orphan the next `-e` flag); `docs/03_runbooks/parallel-worktrees.md` extended with a new "Adding a new `*_FILE` env var to the Compose stack" subsection documenting the durable three-place update contract. Cross-model review: 3-cycle Opus↔GPT-5.5 review at spec stage (9 findings, all accepted) + 3-cycle review at plan stage (5 findings, all accepted including the broken-bash catch) + Gemini Code Assist (3 Medium accepted — Windows portability + inverted docker-compose service↔line mapping in 2 docs files) + final GPT-5.5 review (1 accepted cycle 1 catching the same mapping bug in CLAUDE.md; 2 rejected cycle 2 hallucinations claiming script + runbook diffs were missing when both were demonstrably present). Two tangential discoveries captured: `chore_db_session_skip_reason_disambiguation/idea.md` (the `db_session` fixture's `"Postgres not reachable"` skip reason lies when the actual cause is the env-var presence check failing — same misattribution that surfaced this whole feature; ~10 LOC fix) and `bug_smoke_followup_clone_e2e_flakes/idea.md` (PR #257 CI's `smoke` job is red on 3 followup/clone E2E tests; cross-checked against main commit `9928d763`, same failures pre-existed, so not a PR #257 regression).) diff --git a/ui/public/docs/tutorial-first-study.md b/ui/public/docs/tutorial-first-study.md index 6927ddf5..80d4708a 100644 --- a/ui/public/docs/tutorial-first-study.md +++ b/ui/public/docs/tutorial-first-study.md @@ -365,7 +365,7 @@ curl -X POST http://localhost:8000/api/v1/config-repos \ [`docs/01_architecture/`](../01_architecture/) — start with [`system-overview.md`](../01_architecture/system-overview.md). - The umbrella product spec is at - [`docs/00_overview/product/relevance-copilot-spec.md`](../00_overview/product/relevance-copilot-spec.md). + [`docs/00_overview/relyloop-spec.md`](../00_overview/relyloop-spec.md). - File feedback or bug reports at [GitHub Discussions](https://github.com/SoundMindsAI/relyloop/discussions). diff --git a/ui/public/docs/workflows-overview.md b/ui/public/docs/workflows-overview.md index 1d6b6802..b4c29dbf 100644 --- a/ui/public/docs/workflows-overview.md +++ b/ui/public/docs/workflows-overview.md @@ -292,7 +292,7 @@ Important framing for new engineers, because the negative space defines the tool - **Has no in-tool approval surface.** Approval is delegated to the config repo's branch protection (CODEOWNERS, required reviewers). - **Single-tenant in MVP1.** Multi-tenancy, SSO, and role enforcement land at MVP4. -If your team needs any of the above, RelyLoop is complementary, not a replacement — the umbrella spec (`docs/00_overview/relevance-copilot-spec.md` §4) is the canonical non-goals list. +If your team needs any of the above, RelyLoop is complementary, not a replacement — the umbrella spec (`docs/00_overview/relyloop-spec.md` §4) is the canonical non-goals list. --- From 07c4ceaf528c9ab43ab09332dbfbd5f1d388ba1b Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Wed, 27 May 2026 20:52:54 -0400 Subject: [PATCH 2/2] fix(docs): correct relative paths to planned_features (Gemini PR #289 review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adjudicates all 6 Gemini Code Assist findings on PR #289. Single root cause: I used "../../02_product/planned_features/..." from files under docs/01_architecture/ when the correct path is one level up only — "../02_product/planned_features/...". All 6 findings ACCEPTED (verifiable correct): - docs/01_architecture/adapters.md:3, :130 - docs/01_architecture/mvp1-overview.md:65 - docs/01_architecture/optimization.md:195 - docs/01_architecture/tech-stack.md:18, :74 Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: SoundMindsAI --- docs/01_architecture/adapters.md | 6 +++--- docs/01_architecture/mvp1-overview.md | 2 +- docs/01_architecture/optimization.md | 2 +- docs/01_architecture/tech-stack.md | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/01_architecture/adapters.md b/docs/01_architecture/adapters.md index eceac832..3627ea52 100644 --- a/docs/01_architecture/adapters.md +++ b/docs/01_architecture/adapters.md @@ -1,6 +1,6 @@ # Adapters -**Status:** Adopted for MVP1. ElasticAdapter (handling ES + OpenSearch) is the only implementation in MVP1; SolrAdapter ships at MVP2 alongside UBI judgments. Lucidworks Fusion is explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)) — a community-contributed Fusion adapter remains possible against this Protocol, but the project does not own that direction. Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). +**Status:** Adopted for MVP1. ElasticAdapter (handling ES + OpenSearch) is the only implementation in MVP1; SolrAdapter ships at MVP2 alongside UBI judgments. Lucidworks Fusion is explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../02_product/planned_features/chore_drop_fusion_scope/idea.md)) — a community-contributed Fusion adapter remains possible against this Protocol, but the project does not own that direction. Per-release timing per [`tech-stack.md` §"Canonical release matrix"](tech-stack.md). **Source of truth for product context:** [docs/00_overview/relyloop-spec.md §8](../00_overview/relyloop-spec.md) ("Engine adapter specification") and §11 ("Search space & parameters"). --- @@ -108,7 +108,7 @@ Templates use **unified parameter names**. The adapter pivots them to native nam **When a concept doesn't exist natively**, the adapter either provides a best-effort translation OR raises `UnsupportedParameter` at render time. The search-space validator catches this before a study runs (rejects the study definition rather than failing trials individually). -The earlier `stage_enabled` unified-vocabulary parameter (Fusion-specific pipeline stage toggle) was removed when Fusion was dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md). +The earlier `stage_enabled` unified-vocabulary parameter (Fusion-specific pipeline stage toggle) was removed when Fusion was dropped — see [`chore_drop_fusion_scope/idea.md`](../02_product/planned_features/chore_drop_fusion_scope/idea.md). ## Authentication and credentials @@ -127,7 +127,7 @@ Credentials never live in the database. The `clusters.credentials_ref` column is ### SolrAdapter (MVP2) -Apache Solr ships in MVP2 alongside UBI judgments. Full scope in [`infra_adapter_solr/idea.md`](../../02_product/planned_features/infra_adapter_solr/idea.md). Summary: +Apache Solr ships in MVP2 alongside UBI judgments. Full scope in [`infra_adapter_solr/idea.md`](../02_product/planned_features/infra_adapter_solr/idea.md). Summary: - `search_batch` uses parallel `/select` requests with a connection pool (Solr has no `_msearch` equivalent). - `render` produces a Solr request parameter dict; templates under `templates/solr/` mirror the `templates/elasticsearch/` shape. Supports `edismax` (primary), `dismax`, `lucene` parsers. diff --git a/docs/01_architecture/mvp1-overview.md b/docs/01_architecture/mvp1-overview.md index 12900515..f5f2320c 100644 --- a/docs/01_architecture/mvp1-overview.md +++ b/docs/01_architecture/mvp1-overview.md @@ -62,7 +62,7 @@ These appear in the topical arch docs because the docs cover all releases — bu - LTR training (cross-engine model training; MVP2's LTR support is consume-only) - Path B (production monitoring, bandits, shadow validation) - Helm chart maturity; Kubernetes-native operator -- Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)) +- Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../02_product/planned_features/chore_drop_fusion_scope/idea.md)) ### Reserved for v2+ - `SolrAdapter` (pure Apache Solr support) diff --git a/docs/01_architecture/optimization.md b/docs/01_architecture/optimization.md index e5323812..d3ae8fdd 100644 --- a/docs/01_architecture/optimization.md +++ b/docs/01_architecture/optimization.md @@ -192,6 +192,6 @@ contract is reviewed in [`feat_pr_metric_confidence/feature_spec.md`](../02_prod | CMA-ES sampler (selectable per study) | MVP2 | TPE is sufficient for MVP1's low-dim search spaces; CMA-ES becomes valuable when adopters tune ≥7 continuous parameters. | | Intermediate-step pruning (truly active `MedianPruner`) | MVP2 | Requires multi-step trials (e.g., evaluate after each query batch); MVP1 trials evaluate once per (params, full query set). | | Multi-objective optimization (Pareto fronts via NSGA-II) | v2 | Single scalar objective is sufficient through GA v1; multi-objective adds product complexity (which Pareto trade-off do you ship?). | -| UBI-derived judgments + hybrid UBI+LLM converter | MVP2 | Bundled with the Solr adapter in MVP2 (see [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md)). The judgment `source = 'click'` enum value is reserved from MVP1 forward; the `UbiReader` + `SignalsConverter` land at MVP2. | +| UBI-derived judgments + hybrid UBI+LLM converter | MVP2 | Bundled with the Solr adapter in MVP2 (see [`feat_ubi_judgments/idea.md`](../02_product/planned_features/feat_ubi_judgments/idea.md)). The judgment `source = 'click'` enum value is reserved from MVP1 forward; the `UbiReader` + `SignalsConverter` land at MVP2. | | Counterfactual click models (CCM, DBN) as additional `SignalsConverter` impls | Backlog | Require enough impressions per (query, doc) to be statistically valid; promoted out when post-MVP2 adopter traffic supports it. | | Engine-native click readers (Elastic Behavioral Analytics) | Backlog | UBI covers the engine-neutral path for ES + OpenSearch + Solr. Elastic BA is a residual ES-shop bridge despite Elastic's 9.0 deprecation; landed when an adopter requires it. | diff --git a/docs/01_architecture/tech-stack.md b/docs/01_architecture/tech-stack.md index 46dde55f..59ab265d 100644 --- a/docs/01_architecture/tech-stack.md +++ b/docs/01_architecture/tech-stack.md @@ -15,7 +15,7 @@ This is the source-of-truth release matrix that every other arch doc derives fro | **MVP2 / v0.2** | "Three-Engine + Real Signals" | **Apache Solr adapter** (`auth_kind = solr_basic` and `solr_apikey`) covering Solr 9.x + 10.x via `edismax` + `{!ltr}` rescoring + `solr.UBIComponent` for UBI capture; **UBI judgments** via engine-agnostic `UbiReader` (reads `ubi_queries` + `ubi_events` via any `SearchAdapter`'s `search_batch`); pluggable `SignalsConverter` Protocol (position-bias-corrected CTR, dwell-time threshold, **hybrid UBI+LLM**); `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool; mixed-source judgment lists (`llm` + `human` + `click` rows in the same list — the existing `judgments.source` enum already permits this). After MVP2 ships, RelyLoop runs on all three OSS engines with UBI on every one of them. **No** schema migration for UBI (additive — uses existing `source = 'click'` enum value); **one** small migration extends `engine_type` + `auth_kind` CHECK constraints to accept Solr values. | | **MVP3 / v0.3** | "Observable" | Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters wired; canonical event catalog; **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs); lineage columns (`langfuse_trace_id`, `prompt_version`, `input_hash`) on `judgments`/`digests`/`proposals`; PII redaction; trace context propagation through API → Redis → worker → adapter → engine for all three engines. | | **GA v1 / v1.0** | "Production-ready" | **LangGraph orchestrator** (replaces plain `openai` SDK + function calling); `PostgresSaver` for resumable conversations; full RFC 7807 Problem Details on errors; `Idempotency-Key` header on POST/PATCH/DELETE; full four-layer test pyramid at 90% coverage; complete CI/CD with security gates (Trivy, bandit, pip-audit, npm audit); image signing (cosign keyless OIDC); Helm 3 chart; complete OSS launch infrastructure (docs, ADRs, contributor onboarding, design-partner references); public Optuna-vs-SRW-grid benchmark. **No new product surface** — all six differentiators are GA by MVP3; GA v1 is polish + governance + hardening. | -| **Backlog** | — | Multi-Git provider abstraction (GitLab + Bitbucket); multi-tenancy primitives (`tenants` + `tenant_memberships` + `users` + `api_keys` tables; `tenant_id` columns; roles `viewer`/`runner`/`tenant_admin`/`platform_admin`); SSO via reverse proxy; Argon2id-hashed bearer API keys; native non-OpenAI provider SDKs (Anthropic, Bedrock, Vertex, Azure OpenAI); LTR training; Path B (production-quality monitoring, bandits, shadow validation, manual one-click rollback); Helm chart maturity; Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | +| **Backlog** | — | Multi-Git provider abstraction (GitLab + Bitbucket); multi-tenancy primitives (`tenants` + `tenant_memberships` + `users` + `api_keys` tables; `tenant_id` columns; roles `viewer`/`runner`/`tenant_admin`/`platform_admin`); SSO via reverse proxy; Argon2id-hashed bearer API keys; native non-OpenAI provider SDKs (Anthropic, Bedrock, Vertex, Azure OpenAI); LTR training; Path B (production-quality monitoring, bandits, shadow validation, manual one-click rollback); Helm chart maturity; Lucidworks Fusion adapter (explicitly dropped — see [`chore_drop_fusion_scope/idea.md`](../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | **Audit-without-users design:** MVP3 ships `audit_log` with `actor_id` / `tenant_id` as nullable UUIDs with **no FK constraints**, plus an `actor_type` ENUM constrained to `system` / `agent` / `anonymous`. The FK constraints and the `user` actor type ship when multi-tenancy is promoted from backlog. Pre-multi-tenancy audit rows keep `actor_id = NULL`. See [`data-model.md` §"`audit_log`"](data-model.md) for the schema. @@ -71,7 +71,7 @@ This is the source-of-truth release matrix that every other arch doc derives fro |---|---|---| | Database (app) | Postgres 16 | Single instance. Holds app state + Optuna RDBStorage. | | Cache / queue | Redis 7 | Used by Arq for the worker queue. | -| Search engines (targets) | Elasticsearch 8.11+ / 9.x; OpenSearch 2.x / 3.x (MVP1); Apache Solr 9.x / 10.x (MVP2) | Lucidworks Fusion explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | +| Search engines (targets) | Elasticsearch 8.11+ / 9.x; OpenSearch 2.x / 3.x (MVP1); Apache Solr 9.x / 10.x (MVP2) | Lucidworks Fusion explicitly dropped (see [`chore_drop_fusion_scope/idea.md`](../02_product/planned_features/chore_drop_fusion_scope/idea.md)). | | Reverse proxy | Caddy 2 | NOT in MVP1. Production-style install (TLS via Caddy + Let's Encrypt) lands as GA v1 hardening for trusted-network deployments. SSO (oauth2-proxy or Authelia in front of Caddy) is in the backlog with multi-tenancy. | | Trace storage (LLM) | ClickHouse 24 | NOT in MVP1 (Langfuse is MVP2+). | | Container runtime | Docker 24+ with Compose v2 | MVP1 deployment target. |