Skip to content

feat(inference): multi-route proxy with alias-based model routing#618

Open
cosmicnet wants to merge 5 commits intoNVIDIA:mainfrom
cosmicnet:203-multi-route-inference/lh
Open

feat(inference): multi-route proxy with alias-based model routing#618
cosmicnet wants to merge 5 commits intoNVIDIA:mainfrom
cosmicnet:203-multi-route-inference/lh

Conversation

@cosmicnet
Copy link
Copy Markdown

@cosmicnet cosmicnet commented Mar 25, 2026

Summary

Adds multi-route inference proxy support, allowing sandboxed agents to reach multiple LLM providers (OpenAI, Anthropic, NVIDIA, Ollama) through a single inference.local endpoint. Agents select a backend by setting the model field to an alias name. Also adds Ollama native API support and Codex URL pattern matching.

Related Issue

Closes #203

Changes

  • Proto: Add InferenceModelEntry message (alias, provider_name, model_id); add models repeated field to set/get request/response messages
  • Server: upsert_multi_model_route() validates and stores multiple alias→provider mappings; resolves each entry into a separate ResolvedRoute at bundle time
  • Router: select_route() implements alias-first, protocol-fallback selection; proxy_with_candidates/proxy_with_candidates_streaming accept optional model_hint
  • Sandbox proxy: Extracts model field from request body as model_hint for route selection
  • Sandbox L7: Add /v1/codex/*, /api/chat, /api/tags, /api/show inference patterns
  • Backend: build_backend_url() always strips /v1 prefix to support both versioned and non-versioned endpoints (e.g. Codex)
  • Core: Add OLLAMA_PROFILE provider profile with native + OpenAI-compat protocols
  • CLI: --model-alias ALIAS=PROVIDER/MODEL flag (repeatable, conflicts with --provider/--model)
  • Architecture docs: Updated inference-routing.md with all new sections

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@cosmicnet cosmicnet requested a review from a team as a code owner March 25, 2026 23:52
Copilot AI review requested due to automatic review settings March 25, 2026 23:52
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@cosmicnet
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-route inference proxying so sandboxes can route inference.local requests to multiple LLM backends by using a model alias in the request body.

Changes:

  • Extends the inference proto + gateway storage to support multiple (alias, provider_name, model_id) entries per route.
  • Adds alias-first route selection in the router and passes a model_hint extracted from sandbox request bodies.
  • Expands sandbox L7 inference patterns and adds an Ollama provider profile + endpoint validation probe.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
proto/inference.proto Adds InferenceModelEntry and models fields for multi-model inference config.
crates/openshell-server/src/inference.rs Implements multi-model upsert + resolves each alias into separate ResolvedRoute entries.
crates/openshell-sandbox/src/proxy.rs Extracts model from JSON body and forwards it as model_hint to the router.
crates/openshell-sandbox/src/l7/inference.rs Adds Codex + Ollama native API patterns and tests.
crates/openshell-router/src/lib.rs Adds select_route() and extends proxy APIs to accept model_hint.
crates/openshell-router/src/backend.rs Adds Ollama validation probe and changes backend URL construction behavior.
crates/openshell-router/tests/backend_integration.rs Updates tests for new proxy function signatures and /v1 endpoint expectations.
crates/openshell-core/src/inference.rs Adds OLLAMA_PROFILE (protocols/base URL/config keys).
crates/openshell-cli/src/run.rs Adds gateway_inference_set_multi() to send multi-model configs.
crates/openshell-cli/src/main.rs Adds --model-alias ALIAS=PROVIDER/MODEL CLI flag and dispatch.
architecture/inference-routing.md Documents alias-based route selection, new patterns, and multi-model route behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cosmicnet cosmicnet force-pushed the 203-multi-route-inference/lh branch from af1748b to ab71175 Compare March 26, 2026 00:36
@pimlock pimlock self-assigned this Mar 30, 2026
@cosmicnet cosmicnet force-pushed the 203-multi-route-inference/lh branch from ab71175 to d887f04 Compare April 1, 2026 19:44
@cosmicnet
Copy link
Copy Markdown
Author

@pimlock Happy to address any feedback or questions. Let me know if you'd like anything restructured or split differently.

@johntmyers
Copy link
Copy Markdown
Collaborator

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

Comment on lines +59 to +65
const OLLAMA_PROTOCOLS: &[&str] = &[
"ollama_chat",
"ollama_model_discovery",
"openai_chat_completions",
"openai_completions",
"model_discovery",
];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using ollama inference protocol, rather than OpenAI one? Is there something extra that ollama supports that cannot be accessed through OpenAI one?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ollama exposes native endpoints (/api/chat, /api/tags, /api/show) that provide capabilities not available through its OpenAI-compatible layer:

/api/tags lists all locally available models (no OpenAI equivalent)
/api/show returns model metadata: parameters, template, license, quantization info
/api/chat supports Ollama-specific options like num_ctx, num_predict, temperature variants, and raw mode
The OLLAMA_PROTOCOLS list includes both native and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery), so agents can use either interface. The native protocols are there so tools that use the Ollama client library directly (which targets /api/*) work through inference.local without needing to switch to the OpenAI-compat paths.

If you'd prefer to keep it simpler and only support Ollama through its OpenAI-compat layer, I can drop the native patterns and the ollama_chat/ollama_model_discovery protocols. The tradeoff is that model discovery (/api/tags) and agent tooling that uses the Ollama SDK directly wouldn't work.

@cosmicnet
Copy link
Copy Markdown
Author

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

Thanks for the feedback. This PR follows the approach outlined in #203 (option B: single record with repeated entries, alias-first selection with protocol fallback, model hint from the request body). I appreciate that was closed off citing the replacement issue #207, but that covers a different concern. System vs user inference is about who the route serves, not how many backends it can reach. This PR already accommodates that split through the system route guard and separate sandbox.inference.local endpoint.

On the external proxy: it's a valid pattern, but the overhead feels disproportionate here. This is a static alias lookup table. There's no load balancing, retries, rate limiting, or discovery. The maintenance surface is one function, one proto field, and one server method. For users with 2-3 providers, standing up a separate proxy service is a lot of ceremony for a lookup table.

More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

If the team has decided this doesn't belong in the embedded proxy, I can scope this down to just the Ollama native API support and Codex pattern matching (commits 1-2) and drop the multi-model routing. Happy to go either way.

@cosmicnet cosmicnet force-pushed the 203-multi-route-inference/lh branch from d887f04 to db606c1 Compare April 7, 2026 20:03
@cosmicnet
Copy link
Copy Markdown
Author

Latest update is mostly a rebase onto the current main, plus one follow-up fix.

While applying the earlier Copilot feedback around URL handling, I ended up introducing a Codex-specific regression. The generic /v1 cleanup was fine for the normal OpenAI-style endpoints, but Codex needs slightly different path handling. This update fixes that by making the Codex rewrite explicit and scoped to the /v1/codex/* pattern instead of changing the behaviour for every provider.

I also tightened multi-model route selection so the request model hint can match either the configured alias or the configured model ID before falling back to the first protocol-compatible route. That avoids Codex and other openai_responses requests being routed to the wrong backend when multiple routes share the same protocol.

The branch has also been rebased onto the latest main, and I reran the relevant Rust test suites after the rebase.

Add pattern detection, provider profile, and validation probe for
Ollama's native /api/chat, /api/tags, and /api/show endpoints.

Proxy changes (l7/inference.rs):
- POST /api/chat -> ollama_chat protocol
- GET /api/tags -> ollama_model_discovery protocol
- POST /api/show -> ollama_model_discovery protocol

Provider profile (openshell-core/inference.rs):
- New 'ollama' provider type with default endpoint
  http://host.openshell.internal:11434
- Supports ollama_chat, ollama_model_discovery, and OpenAI-compatible
  protocols (openai_chat_completions, openai_completions, model_discovery)
- Credential lookup via OLLAMA_API_KEY, base URL via OLLAMA_BASE_URL

Validation (backend.rs):
- Ollama validation probe sends minimal /api/chat request with stream:false

Tests: 4 new tests for pattern detection (ollama chat, tags, show,
and GET /api/chat rejection).

Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Proto: add InferenceModelEntry message with alias/provider/model fields;
  add repeated models field to ClusterInferenceConfig, Set/Get request/response
- Server: add upsert_multi_model_route() for storing multiple model entries
  under a single route slot; update resolve_route_by_name() to expand
  multi-model configs into per-alias ResolvedRoute entries
- Router: add select_route() with alias-first, protocol-fallback strategy;
  add model_hint parameter to proxy_with_candidates() variants
- Sandbox proxy: extract model field from JSON body as routing hint
- Tests: 7 new tests covering select_route, multi-model resolution, and
  bundle expansion; all 291 existing tests continue to pass

Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Add --model-alias flag to 'inference set' for multi-model config
  (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514)
- Add gateway_inference_set_multi() handler in run.rs
- Update inference get/print to display multi-model entries
- Import InferenceModelEntry proto type in CLI
- Fix build_backend_url to always strip /v1 prefix for codex paths
- Add /v1/codex/* inference pattern for openai_responses protocol
- Fix backend tests to use /v1 endpoint suffix

Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
…te guard

- Add timeout_secs parameter to gateway_inference_set_multi and pass
  through to SetClusterInferenceRequest
- Add print_timeout to multi-model output display
- Add timeout field to router test helper make_route (upstream added
  timeout to ResolvedRoute)
- Add system route guard: upsert_multi_model_route rejects
  route_name == sandbox-system with InvalidArgument
- Add timeout_secs: 0 to multi-model test ClusterInferenceConfig structs
- Add upsert_multi_model_route_rejects_system_route test

Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
…election

When multiple routes share the same protocol (e.g. openai_responses),
select_route() only matched the model hint from the request body against
route aliases (names). If an agent sent the actual model ID (e.g.
"gpt-5.4") instead of the alias ("openai-codex"), the alias lookup
missed and the router fell back to the first protocol-compatible route,
which could be a completely different provider.

Add a second lookup pass that matches the hint against route.model before
falling back to blind protocol selection. Priority order:

  1. Alias match (route name == hint) — existing behavior
  2. Model ID match (route model == hint) — new
  3. First protocol-compatible route — existing fallback

Also add strip_version_prefix field to InferenceApiPattern so the codex
pattern (/v1/codex/*) can strip the /v1 proxy artifact before forwarding,
allowing backends whose base URL omits /v1 to receive the correct path.

Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
@cosmicnet cosmicnet force-pushed the 203-multi-route-inference/lh branch from db606c1 to e36f9f5 Compare April 9, 2026 13:50
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 11, 2026

Hi @cosmicnet, thanks for your patience and for

I took some time today to properly review this. This PR touches a few distinct areas: Ollama support, Codex routing, and multi-model routing. For future changes, smaller PRs split by concern would make it easier to review and land incrementally. I would also be helpful to see examples of use-cases that the change is addressing (e.g. without this change X is not possible and here's how this change supports that). Without this it's hard to properly evaluate the solution.

More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

running AI agents securely out of the box

The path that goes through the network policy is a more secure path, the inference.local is still in it's early stage with plans for having it backed by a model running within the cluster, but there are some prerequisites needed to support this that are being worked on.

Opening up that path to many more upstream endpoints would be considered less secure -> every sandbox gets access to it, there is less control over it.

Configuring inference via network policies gives more granular controls and better security. It's configured per-sandbox, so every sandbox can use different configuration.

If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

There are some improvements to how providers work coming, which will make it easier to manage policies with providers (right now they are separate, but we are planning to include a policy with a provider, so adding a provider to a sandbox will automatically add a policy entry).

Comment on lines +47 to +52
InferenceApiPattern {
method: "POST".to_string(),
path_glob: "/v1/codex/*".to_string(),
protocol: "openai_responses".to_string(),
kind: "codex_responses".to_string(),
strip_version_prefix: true,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen? Do you have an example of something that fails and this is a fix for it?

I looked into how the codex is configured and when running inside of the sandbox with the inference.local configured as base_url (docs: https://developers.openai.com/codex/config-advanced, under "Custom model providers"), the request path is /responses.

My guess is that you're trying something like:

  • sandbox: codex running with base_url: inference.local/v1
  • inference is configured with chatgpt.com/backend-api/codex
  • sandbox: codex makes a request to inference.local/v1/responses
  • supervisor: the request gets parsed, v1/responses becomes the path, gets appended to chatgpt.com/backend-api/codex/v1/responses, which is incorrect.

Is this the problem? And the solution is to set the codex inside of the sandbox to inference.local/v1/codex, the v1 gets stripped, and the request goes to chatgpt.com/backend-api/codex/v1/responses?

I think this may be a bug in how we handle inference, it seems like when we intercept the request from the sandbox, we should either:

  • match on the path without the version prefix
  • match with the version prefix, but not include it in the request upstream

For example:

  • request to inference.local/v1/responses with inference configured for foo.com/bar should go to foo.com/bar/responses

Right now, such thing is not possible, which feels like a real limitation. At least when dealing with OpenAI's client libraries, the base_url is expected to have the /v1 as a suffix, so the request path is just /responses and not v1/responses.


Please let me know what the use-case is for this change.

};

static OLLAMA_PROFILE: InferenceProviderProfile = InferenceProviderProfile {
provider_type: "ollama",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked closer into this and to make this actually work you'd need to create ollama provider as well. This code is capturing the ollama-specific inference API out of the sandbox, but there is no upstream to go to.

openshell provider create --type ollama # doesn't exist

Without that, this code will capture ollama-specific requests, but these will get rejected anyways.

Overall, the inference.local path isn't meant to support any potential inference request, these are still possible via network policies. It's very much possible today to route inference to ollama from a sandbox, you'd use host.openshell.internal, add it to the network policy and then you can talk directly to that endpoint.

The original intent of the inference.local was to use a local model that is managed within the cluster. The work on this is still ongoing, but requires better GPU support to be implemented. We added the ability for inference.local to route to external models for cases where the local model could not be deployed (e.g. no GPU is available). I will take a look at our docs and make sure this is clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support multiple models with alias-based routing in inference

4 participants