feat(inference): multi-route proxy with alias-based model routing#618
feat(inference): multi-route proxy with alias-based model routing#618cosmicnet wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
There was a problem hiding this comment.
Pull request overview
Adds multi-route inference proxying so sandboxes can route inference.local requests to multiple LLM backends by using a model alias in the request body.
Changes:
- Extends the inference proto + gateway storage to support multiple
(alias, provider_name, model_id)entries per route. - Adds alias-first route selection in the router and passes a
model_hintextracted from sandbox request bodies. - Expands sandbox L7 inference patterns and adds an Ollama provider profile + endpoint validation probe.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/inference.proto | Adds InferenceModelEntry and models fields for multi-model inference config. |
| crates/openshell-server/src/inference.rs | Implements multi-model upsert + resolves each alias into separate ResolvedRoute entries. |
| crates/openshell-sandbox/src/proxy.rs | Extracts model from JSON body and forwards it as model_hint to the router. |
| crates/openshell-sandbox/src/l7/inference.rs | Adds Codex + Ollama native API patterns and tests. |
| crates/openshell-router/src/lib.rs | Adds select_route() and extends proxy APIs to accept model_hint. |
| crates/openshell-router/src/backend.rs | Adds Ollama validation probe and changes backend URL construction behavior. |
| crates/openshell-router/tests/backend_integration.rs | Updates tests for new proxy function signatures and /v1 endpoint expectations. |
| crates/openshell-core/src/inference.rs | Adds OLLAMA_PROFILE (protocols/base URL/config keys). |
| crates/openshell-cli/src/run.rs | Adds gateway_inference_set_multi() to send multi-model configs. |
| crates/openshell-cli/src/main.rs | Adds --model-alias ALIAS=PROVIDER/MODEL CLI flag and dispatch. |
| architecture/inference-routing.md | Documents alias-based route selection, new patterns, and multi-model route behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af1748b to
ab71175
Compare
ab71175 to
d887f04
Compare
|
@pimlock Happy to address any feedback or questions. Let me know if you'd like anything restructured or split differently. |
|
The use of I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow. |
| const OLLAMA_PROTOCOLS: &[&str] = &[ | ||
| "ollama_chat", | ||
| "ollama_model_discovery", | ||
| "openai_chat_completions", | ||
| "openai_completions", | ||
| "model_discovery", | ||
| ]; |
There was a problem hiding this comment.
Is there a reason for using ollama inference protocol, rather than OpenAI one? Is there something extra that ollama supports that cannot be accessed through OpenAI one?
There was a problem hiding this comment.
Ollama exposes native endpoints (/api/chat, /api/tags, /api/show) that provide capabilities not available through its OpenAI-compatible layer:
/api/tags lists all locally available models (no OpenAI equivalent)
/api/show returns model metadata: parameters, template, license, quantization info
/api/chat supports Ollama-specific options like num_ctx, num_predict, temperature variants, and raw mode
The OLLAMA_PROTOCOLS list includes both native and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery), so agents can use either interface. The native protocols are there so tools that use the Ollama client library directly (which targets /api/*) work through inference.local without needing to switch to the OpenAI-compat paths.
If you'd prefer to keep it simpler and only support Ollama through its OpenAI-compat layer, I can drop the native patterns and the ollama_chat/ollama_model_discovery protocols. The tradeoff is that model discovery (/api/tags) and agent tooling that uses the Ollama SDK directly wouldn't work.
Thanks for the feedback. This PR follows the approach outlined in #203 (option B: single record with repeated entries, alias-first selection with protocol fallback, model hint from the request body). I appreciate that was closed off citing the replacement issue #207, but that covers a different concern. System vs user inference is about who the route serves, not how many backends it can reach. This PR already accommodates that split through the system route guard and separate sandbox.inference.local endpoint. On the external proxy: it's a valid pattern, but the overhead feels disproportionate here. This is a static alias lookup table. There's no load balancing, retries, rate limiting, or discovery. The maintenance surface is one function, one proto field, and one server method. For users with 2-3 providers, standing up a separate proxy service is a lot of ceremony for a lookup table. More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term. If the team has decided this doesn't belong in the embedded proxy, I can scope this down to just the Ollama native API support and Codex pattern matching (commits 1-2) and drop the multi-model routing. Happy to go either way. |
d887f04 to
db606c1
Compare
|
Latest update is mostly a rebase onto the current main, plus one follow-up fix. While applying the earlier Copilot feedback around URL handling, I ended up introducing a Codex-specific regression. The generic /v1 cleanup was fine for the normal OpenAI-style endpoints, but Codex needs slightly different path handling. This update fixes that by making the Codex rewrite explicit and scoped to the /v1/codex/* pattern instead of changing the behaviour for every provider. I also tightened multi-model route selection so the request model hint can match either the configured alias or the configured model ID before falling back to the first protocol-compatible route. That avoids Codex and other openai_responses requests being routed to the wrong backend when multiple routes share the same protocol. The branch has also been rebased onto the latest main, and I reran the relevant Rust test suites after the rebase. |
Add pattern detection, provider profile, and validation probe for Ollama's native /api/chat, /api/tags, and /api/show endpoints. Proxy changes (l7/inference.rs): - POST /api/chat -> ollama_chat protocol - GET /api/tags -> ollama_model_discovery protocol - POST /api/show -> ollama_model_discovery protocol Provider profile (openshell-core/inference.rs): - New 'ollama' provider type with default endpoint http://host.openshell.internal:11434 - Supports ollama_chat, ollama_model_discovery, and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery) - Credential lookup via OLLAMA_API_KEY, base URL via OLLAMA_BASE_URL Validation (backend.rs): - Ollama validation probe sends minimal /api/chat request with stream:false Tests: 4 new tests for pattern detection (ollama chat, tags, show, and GET /api/chat rejection). Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Proto: add InferenceModelEntry message with alias/provider/model fields; add repeated models field to ClusterInferenceConfig, Set/Get request/response - Server: add upsert_multi_model_route() for storing multiple model entries under a single route slot; update resolve_route_by_name() to expand multi-model configs into per-alias ResolvedRoute entries - Router: add select_route() with alias-first, protocol-fallback strategy; add model_hint parameter to proxy_with_candidates() variants - Sandbox proxy: extract model field from JSON body as routing hint - Tests: 7 new tests covering select_route, multi-model resolution, and bundle expansion; all 291 existing tests continue to pass Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Add --model-alias flag to 'inference set' for multi-model config (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514) - Add gateway_inference_set_multi() handler in run.rs - Update inference get/print to display multi-model entries - Import InferenceModelEntry proto type in CLI - Fix build_backend_url to always strip /v1 prefix for codex paths - Add /v1/codex/* inference pattern for openai_responses protocol - Fix backend tests to use /v1 endpoint suffix Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
…te guard - Add timeout_secs parameter to gateway_inference_set_multi and pass through to SetClusterInferenceRequest - Add print_timeout to multi-model output display - Add timeout field to router test helper make_route (upstream added timeout to ResolvedRoute) - Add system route guard: upsert_multi_model_route rejects route_name == sandbox-system with InvalidArgument - Add timeout_secs: 0 to multi-model test ClusterInferenceConfig structs - Add upsert_multi_model_route_rejects_system_route test Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
…election
When multiple routes share the same protocol (e.g. openai_responses),
select_route() only matched the model hint from the request body against
route aliases (names). If an agent sent the actual model ID (e.g.
"gpt-5.4") instead of the alias ("openai-codex"), the alias lookup
missed and the router fell back to the first protocol-compatible route,
which could be a completely different provider.
Add a second lookup pass that matches the hint against route.model before
falling back to blind protocol selection. Priority order:
1. Alias match (route name == hint) — existing behavior
2. Model ID match (route model == hint) — new
3. First protocol-compatible route — existing fallback
Also add strip_version_prefix field to InferenceApiPattern so the codex
pattern (/v1/codex/*) can strip the /v1 proxy artifact before forwarding,
allowing backends whose base URL omits /v1 to receive the correct path.
Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
db606c1 to
e36f9f5
Compare
|
Hi @cosmicnet, thanks for your patience and for I took some time today to properly review this. This PR touches a few distinct areas: Ollama support, Codex routing, and multi-model routing. For future changes, smaller PRs split by concern would make it easier to review and land incrementally. I would also be helpful to see examples of use-cases that the change is addressing (e.g. without this change X is not possible and here's how this change supports that). Without this it's hard to properly evaluate the solution.
running AI agents securely out of the box The path that goes through the network policy is a more secure path, the Opening up that path to many more upstream endpoints would be considered less secure -> every sandbox gets access to it, there is less control over it. Configuring inference via network policies gives more granular controls and better security. It's configured per-sandbox, so every sandbox can use different configuration.
There are some improvements to how providers work coming, which will make it easier to manage policies with providers (right now they are separate, but we are planning to include a policy with a provider, so adding a provider to a sandbox will automatically add a policy entry). |
| InferenceApiPattern { | ||
| method: "POST".to_string(), | ||
| path_glob: "/v1/codex/*".to_string(), | ||
| protocol: "openai_responses".to_string(), | ||
| kind: "codex_responses".to_string(), | ||
| strip_version_prefix: true, |
There was a problem hiding this comment.
When would this happen? Do you have an example of something that fails and this is a fix for it?
I looked into how the codex is configured and when running inside of the sandbox with the inference.local configured as base_url (docs: https://developers.openai.com/codex/config-advanced, under "Custom model providers"), the request path is /responses.
My guess is that you're trying something like:
- sandbox: codex running with
base_url: inference.local/v1 - inference is configured with
chatgpt.com/backend-api/codex - sandbox: codex makes a request to
inference.local/v1/responses - supervisor: the request gets parsed,
v1/responsesbecomes the path, gets appended tochatgpt.com/backend-api/codex/v1/responses, which is incorrect.
Is this the problem? And the solution is to set the codex inside of the sandbox to inference.local/v1/codex, the v1 gets stripped, and the request goes to chatgpt.com/backend-api/codex/v1/responses?
I think this may be a bug in how we handle inference, it seems like when we intercept the request from the sandbox, we should either:
- match on the path without the version prefix
- match with the version prefix, but not include it in the request upstream
For example:
- request to
inference.local/v1/responseswith inference configured forfoo.com/barshould go tofoo.com/bar/responses
Right now, such thing is not possible, which feels like a real limitation. At least when dealing with OpenAI's client libraries, the base_url is expected to have the /v1 as a suffix, so the request path is just /responses and not v1/responses.
Please let me know what the use-case is for this change.
| }; | ||
|
|
||
| static OLLAMA_PROFILE: InferenceProviderProfile = InferenceProviderProfile { | ||
| provider_type: "ollama", |
There was a problem hiding this comment.
I looked closer into this and to make this actually work you'd need to create ollama provider as well. This code is capturing the ollama-specific inference API out of the sandbox, but there is no upstream to go to.
openshell provider create --type ollama # doesn't exist
Without that, this code will capture ollama-specific requests, but these will get rejected anyways.
Overall, the inference.local path isn't meant to support any potential inference request, these are still possible via network policies. It's very much possible today to route inference to ollama from a sandbox, you'd use host.openshell.internal, add it to the network policy and then you can talk directly to that endpoint.
The original intent of the inference.local was to use a local model that is managed within the cluster. The work on this is still ongoing, but requires better GPU support to be implemented. We added the ability for inference.local to route to external models for cases where the local model could not be deployed (e.g. no GPU is available). I will take a look at our docs and make sure this is clear.
Summary
Adds multi-route inference proxy support, allowing sandboxed agents to reach multiple LLM providers (OpenAI, Anthropic, NVIDIA, Ollama) through a single
inference.localendpoint. Agents select a backend by setting themodelfield to an alias name. Also adds Ollama native API support and Codex URL pattern matching.Related Issue
Closes #203
Changes
InferenceModelEntrymessage (alias,provider_name,model_id); addmodelsrepeated field to set/get request/response messagesupsert_multi_model_route()validates and stores multiple alias→provider mappings; resolves each entry into a separateResolvedRouteat bundle timeselect_route()implements alias-first, protocol-fallback selection;proxy_with_candidates/proxy_with_candidates_streamingaccept optionalmodel_hintmodelfield from request body asmodel_hintfor route selection/v1/codex/*,/api/chat,/api/tags,/api/showinference patternsbuild_backend_url()always strips/v1prefix to support both versioned and non-versioned endpoints (e.g. Codex)OLLAMA_PROFILEprovider profile with native + OpenAI-compat protocols--model-alias ALIAS=PROVIDER/MODELflag (repeatable, conflicts with--provider/--model)inference-routing.mdwith all new sectionsTesting
mise run pre-commitpassesChecklist