feat(inference): multi-route proxy with alias-based model routing by cosmicnet · Pull Request #618 · NVIDIA/OpenShell

cosmicnet · 2026-03-25T23:52:08Z

Summary

Adds multi-route inference proxy support, allowing sandboxed agents to reach multiple LLM providers (OpenAI, Anthropic, NVIDIA, Ollama) through a single inference.local endpoint. Agents select a backend by setting the model field to an alias name. Also adds Ollama native API support and Codex URL pattern matching.

Related Issue

Closes #203

Changes

Proto: Add InferenceModelEntry message (alias, provider_name, model_id); add models repeated field to set/get request/response messages
Server: upsert_multi_model_route() validates and stores multiple alias→provider mappings; resolves each entry into a separate ResolvedRoute at bundle time
Router: select_route() implements alias-first, protocol-fallback selection; proxy_with_candidates/proxy_with_candidates_streaming accept optional model_hint
Sandbox proxy: Extracts model field from request body as model_hint for route selection
Sandbox L7: Add /v1/codex/*, /api/chat, /api/tags, /api/show inference patterns
Backend: build_backend_url() always strips /v1 prefix to support both versioned and non-versioned endpoints (e.g. Codex)
Core: Add OLLAMA_PROFILE provider profile with native + OpenAI-compat protocols
CLI: --model-alias ALIAS=PROVIDER/MODEL flag (repeatable, conflicts with --provider/--model)
Architecture docs: Updated inference-routing.md with all new sections

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

github-actions · 2026-03-25T23:52:18Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

cosmicnet · 2026-03-25T23:53:15Z

I have read the DCO document and I hereby sign the DCO.

Copilot

Pull request overview

Adds multi-route inference proxying so sandboxes can route inference.local requests to multiple LLM backends by using a model alias in the request body.

Changes:

Extends the inference proto + gateway storage to support multiple (alias, provider_name, model_id) entries per route.
Adds alias-first route selection in the router and passes a model_hint extracted from sandbox request bodies.
Expands sandbox L7 inference patterns and adds an Ollama provider profile + endpoint validation probe.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
proto/inference.proto	Adds `InferenceModelEntry` and `models` fields for multi-model inference config.
crates/openshell-server/src/inference.rs	Implements multi-model upsert + resolves each alias into separate `ResolvedRoute` entries.
crates/openshell-sandbox/src/proxy.rs	Extracts `model` from JSON body and forwards it as `model_hint` to the router.
crates/openshell-sandbox/src/l7/inference.rs	Adds Codex + Ollama native API patterns and tests.
crates/openshell-router/src/lib.rs	Adds `select_route()` and extends proxy APIs to accept `model_hint`.
crates/openshell-router/src/backend.rs	Adds Ollama validation probe and changes backend URL construction behavior.
crates/openshell-router/tests/backend_integration.rs	Updates tests for new proxy function signatures and `/v1` endpoint expectations.
crates/openshell-core/src/inference.rs	Adds `OLLAMA_PROFILE` (protocols/base URL/config keys).
crates/openshell-cli/src/run.rs	Adds `gateway_inference_set_multi()` to send multi-model configs.
crates/openshell-cli/src/main.rs	Adds `--model-alias ALIAS=PROVIDER/MODEL` CLI flag and dispatch.
architecture/inference-routing.md	Documents alias-based route selection, new patterns, and multi-model route behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/openshell-server/src/inference.rs

crates/openshell-router/src/lib.rs

crates/openshell-router/src/backend.rs

cosmicnet · 2026-04-01T22:02:13Z

@pimlock Happy to address any feedback or questions. Let me know if you'd like anything restructured or split differently.

johntmyers · 2026-04-03T23:20:42Z

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

pimlock · 2026-04-06T17:39:04Z

crates/openshell-core/src/inference.rs

+const OLLAMA_PROTOCOLS: &[&str] = &[
+    "ollama_chat",
+    "ollama_model_discovery",
+    "openai_chat_completions",
+    "openai_completions",
+    "model_discovery",
+];


Is there a reason for using ollama inference protocol, rather than OpenAI one? Is there something extra that ollama supports that cannot be accessed through OpenAI one?

Ollama exposes native endpoints (/api/chat, /api/tags, /api/show) that provide capabilities not available through its OpenAI-compatible layer:

/api/tags lists all locally available models (no OpenAI equivalent)
/api/show returns model metadata: parameters, template, license, quantization info
/api/chat supports Ollama-specific options like num_ctx, num_predict, temperature variants, and raw mode
The OLLAMA_PROTOCOLS list includes both native and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery), so agents can use either interface. The native protocols are there so tools that use the Ollama client library directly (which targets /api/*) work through inference.local without needing to switch to the OpenAI-compat paths.

If you'd prefer to keep it simpler and only support Ollama through its OpenAI-compat layer, I can drop the native patterns and the ollama_chat/ollama_model_discovery protocols. The tradeoff is that model discovery (/api/tags) and agent tooling that uses the Ollama SDK directly wouldn't work.

cosmicnet · 2026-04-06T21:33:53Z

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

Thanks for the feedback. This PR follows the approach outlined in #203 (option B: single record with repeated entries, alias-first selection with protocol fallback, model hint from the request body). I appreciate that was closed off citing the replacement issue #207, but that covers a different concern. System vs user inference is about who the route serves, not how many backends it can reach. This PR already accommodates that split through the system route guard and separate sandbox.inference.local endpoint.

On the external proxy: it's a valid pattern, but the overhead feels disproportionate here. This is a static alias lookup table. There's no load balancing, retries, rate limiting, or discovery. The maintenance surface is one function, one proto field, and one server method. For users with 2-3 providers, standing up a separate proxy service is a lot of ceremony for a lookup table.

More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

If the team has decided this doesn't belong in the embedded proxy, I can scope this down to just the Ollama native API support and Codex pattern matching (commits 1-2) and drop the multi-model routing. Happy to go either way.

cosmicnet · 2026-04-07T20:08:23Z

Latest update is mostly a rebase onto the current main, plus one follow-up fix.

While applying the earlier Copilot feedback around URL handling, I ended up introducing a Codex-specific regression. The generic /v1 cleanup was fine for the normal OpenAI-style endpoints, but Codex needs slightly different path handling. This update fixes that by making the Codex rewrite explicit and scoped to the /v1/codex/* pattern instead of changing the behaviour for every provider.

I also tightened multi-model route selection so the request model hint can match either the configured alias or the configured model ID before falling back to the first protocol-compatible route. That avoids Codex and other openai_responses requests being routed to the wrong backend when multiple routes share the same protocol.

The branch has also been rebased onto the latest main, and I reran the relevant Rust test suites after the rebase.

Add pattern detection, provider profile, and validation probe for Ollama's native /api/chat, /api/tags, and /api/show endpoints. Proxy changes (l7/inference.rs): - POST /api/chat -> ollama_chat protocol - GET /api/tags -> ollama_model_discovery protocol - POST /api/show -> ollama_model_discovery protocol Provider profile (openshell-core/inference.rs): - New 'ollama' provider type with default endpoint http://host.openshell.internal:11434 - Supports ollama_chat, ollama_model_discovery, and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery) - Credential lookup via OLLAMA_API_KEY, base URL via OLLAMA_BASE_URL Validation (backend.rs): - Ollama validation probe sends minimal /api/chat request with stream:false Tests: 4 new tests for pattern detection (ollama chat, tags, show, and GET /api/chat rejection). Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

- Proto: add InferenceModelEntry message with alias/provider/model fields; add repeated models field to ClusterInferenceConfig, Set/Get request/response - Server: add upsert_multi_model_route() for storing multiple model entries under a single route slot; update resolve_route_by_name() to expand multi-model configs into per-alias ResolvedRoute entries - Router: add select_route() with alias-first, protocol-fallback strategy; add model_hint parameter to proxy_with_candidates() variants - Sandbox proxy: extract model field from JSON body as routing hint - Tests: 7 new tests covering select_route, multi-model resolution, and bundle expansion; all 291 existing tests continue to pass Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

- Add --model-alias flag to 'inference set' for multi-model config (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514) - Add gateway_inference_set_multi() handler in run.rs - Update inference get/print to display multi-model entries - Import InferenceModelEntry proto type in CLI - Fix build_backend_url to always strip /v1 prefix for codex paths - Add /v1/codex/* inference pattern for openai_responses protocol - Fix backend tests to use /v1 endpoint suffix Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

…te guard - Add timeout_secs parameter to gateway_inference_set_multi and pass through to SetClusterInferenceRequest - Add print_timeout to multi-model output display - Add timeout field to router test helper make_route (upstream added timeout to ResolvedRoute) - Add system route guard: upsert_multi_model_route rejects route_name == sandbox-system with InvalidArgument - Add timeout_secs: 0 to multi-model test ClusterInferenceConfig structs - Add upsert_multi_model_route_rejects_system_route test Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

…election When multiple routes share the same protocol (e.g. openai_responses), select_route() only matched the model hint from the request body against route aliases (names). If an agent sent the actual model ID (e.g. "gpt-5.4") instead of the alias ("openai-codex"), the alias lookup missed and the router fell back to the first protocol-compatible route, which could be a completely different provider. Add a second lookup pass that matches the hint against route.model before falling back to blind protocol selection. Priority order: 1. Alias match (route name == hint) — existing behavior 2. Model ID match (route model == hint) — new 3. First protocol-compatible route — existing fallback Also add strip_version_prefix field to InferenceApiPattern so the codex pattern (/v1/codex/*) can strip the /v1 proxy artifact before forwarding, allowing backends whose base URL omits /v1 to receive the correct path. Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

copy-pr-bot · 2026-04-09T13:50:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pimlock · 2026-04-11T00:07:14Z

Hi @cosmicnet, thanks for your patience and for

I took some time today to properly review this. This PR touches a few distinct areas: Ollama support, Codex routing, and multi-model routing. For future changes, smaller PRs split by concern would make it easier to review and land incrementally. I would also be helpful to see examples of use-cases that the change is addressing (e.g. without this change X is not possible and here's how this change supports that). Without this it's hard to properly evaluate the solution.

More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

running AI agents securely out of the box

The path that goes through the network policy is a more secure path, the inference.local is still in it's early stage with plans for having it backed by a model running within the cluster, but there are some prerequisites needed to support this that are being worked on.

Opening up that path to many more upstream endpoints would be considered less secure -> every sandbox gets access to it, there is less control over it.

Configuring inference via network policies gives more granular controls and better security. It's configured per-sandbox, so every sandbox can use different configuration.

If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

There are some improvements to how providers work coming, which will make it easier to manage policies with providers (right now they are separate, but we are planning to include a policy with a provider, so adding a provider to a sandbox will automatically add a policy entry).

pimlock · 2026-04-10T23:37:58Z

crates/openshell-sandbox/src/l7/inference.rs

+        InferenceApiPattern {
+            method: "POST".to_string(),
+            path_glob: "/v1/codex/*".to_string(),
+            protocol: "openai_responses".to_string(),
+            kind: "codex_responses".to_string(),
+            strip_version_prefix: true,


When would this happen? Do you have an example of something that fails and this is a fix for it?

I looked into how the codex is configured and when running inside of the sandbox with the inference.local configured as base_url (docs: https://developers.openai.com/codex/config-advanced, under "Custom model providers"), the request path is /responses.

My guess is that you're trying something like:

sandbox: codex running with base_url: inference.local/v1

inference is configured with chatgpt.com/backend-api/codex

sandbox: codex makes a request to inference.local/v1/responses

supervisor: the request gets parsed, v1/responses becomes the path, gets appended to chatgpt.com/backend-api/codex/v1/responses, which is incorrect.

Is this the problem? And the solution is to set the codex inside of the sandbox to inference.local/v1/codex, the v1 gets stripped, and the request goes to chatgpt.com/backend-api/codex/v1/responses?

I think this may be a bug in how we handle inference, it seems like when we intercept the request from the sandbox, we should either:

match on the path without the version prefix

match with the version prefix, but not include it in the request upstream

For example:

request to inference.local/v1/responses with inference configured for foo.com/bar should go to foo.com/bar/responses

Right now, such thing is not possible, which feels like a real limitation. At least when dealing with OpenAI's client libraries, the base_url is expected to have the /v1 as a suffix, so the request path is just /responses and not v1/responses.

Please let me know what the use-case is for this change.

pimlock · 2026-04-10T23:53:11Z

crates/openshell-core/src/inference.rs

 };

+static OLLAMA_PROFILE: InferenceProviderProfile = InferenceProviderProfile {
+    provider_type: "ollama",


I looked closer into this and to make this actually work you'd need to create ollama provider as well. This code is capturing the ollama-specific inference API out of the sandbox, but there is no upstream to go to.

openshell provider create --type ollama # doesn't exist

Without that, this code will capture ollama-specific requests, but these will get rejected anyways.

Overall, the inference.local path isn't meant to support any potential inference request, these are still possible via network policies. It's very much possible today to route inference to ollama from a sandbox, you'd use host.openshell.internal, add it to the network policy and then you can talk directly to that endpoint.

The original intent of the inference.local was to use a local model that is managed within the cluster. The work on this is still ongoing, but requires better GPU support to be implemented. We added the ability for inference.local to route to external models for cases where the local model could not be deployed (e.g. no GPU is available). I will take a look at our docs and make sure this is clear.

cosmicnet requested a review from a team as a code owner March 25, 2026 23:52

Copilot AI review requested due to automatic review settings March 25, 2026 23:52

Copilot started reviewing on behalf of cosmicnet March 25, 2026 23:52 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

crates/openshell-server/src/inference.rs Show resolved Hide resolved

crates/openshell-server/src/inference.rs Show resolved Hide resolved

crates/openshell-router/src/lib.rs Outdated Show resolved Hide resolved

crates/openshell-router/src/backend.rs Outdated Show resolved Hide resolved

cosmicnet force-pushed the 203-multi-route-inference/lh branch from af1748b to ab71175 Compare March 26, 2026 00:36

pimlock self-assigned this Mar 30, 2026

cosmicnet force-pushed the 203-multi-route-inference/lh branch from ab71175 to d887f04 Compare April 1, 2026 19:44

pimlock reviewed Apr 6, 2026

View reviewed changes

cosmicnet force-pushed the 203-multi-route-inference/lh branch from d887f04 to db606c1 Compare April 7, 2026 20:03

cosmicnet added 5 commits April 9, 2026 14:32

cosmicnet force-pushed the 203-multi-route-inference/lh branch from db606c1 to e36f9f5 Compare April 9, 2026 13:50

pimlock reviewed Apr 11, 2026

View reviewed changes

Conversation

cosmicnet commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicnet commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cosmicnet commented Apr 1, 2026

Uh oh!

johntmyers commented Apr 3, 2026

Uh oh!

pimlock Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cosmicnet Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cosmicnet commented Apr 6, 2026

Uh oh!

cosmicnet commented Apr 7, 2026

Uh oh!

copy-pr-bot bot commented Apr 9, 2026

Uh oh!

pimlock commented Apr 11, 2026

Uh oh!

pimlock Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

pimlock Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cosmicnet commented Mar 25, 2026 •

edited

Loading

github-actions bot commented Mar 25, 2026 •

edited

Loading