Urgent PHI-221 Revised — Alaska-aware

Alaska Multi-Provider Routing

Wiring Alaska’s three model lanes to real backends — Studio local, Bedrock confidential, cloud fallback.
Alaska’s runtime-router.ts already has lane routing, fallback chains, and per-lane env vars. This is a config + sidecar job, not a rewrite.

1 What Alaska Already Has

runtime-router.ts — 700 lines, fully built
Alaska’s router supports three model lanes, each with independent base URL, API key, model ID, and provider type. Fallback cascading is already implemented: Cloud Premium → Cloud Balanced → Local Fast.

The router emits run.started (routing decision), run.fallback (lane failed, switching), and run.completed (metrics + citations). The UI already shows which lane served the response.

Today’s problem: All three lanes point to the same oMLX endpoint. One backend goes down → all lanes fail → Alaska is dead.
Local Fast
Today: oMLX / thurin-v1.1
Proposed: Same — no change

ALASKA_OPENAI_BASE_URL_LOCAL_FAST
=http://192.168.87.243:8000/v1
Cloud Balanced
Today: oMLX / thurin-v1.1 (duplicate)
Proposed: Bedrock Qwen3-Next-80B

ALASKA_OPENAI_BASE_URL_CLOUD_BALANCED
=http://localhost:4000/v1
(LiteLLM sidecar → Bedrock)
Cloud Premium
Today: oMLX / thurin-v1.1 (duplicate)
Proposed: Bedrock Claude Sonnet 4.6

ALASKA_OPENAI_BASE_URL_CLOUD_PREMIUM
=http://localhost:4000/v1
(LiteLLM sidecar → Bedrock)

2 Lane → Backend Mapping

LaneBackendModelQualityLatencyCost/1M tokUse Case
Local Fast Studio oMLX Thurin v1.1 (80B SFT) 4.25 ~32s $0 Primary for everything. PE-tuned.
Cloud Balanced AWS Bedrock Qwen3-Next-80B-A3B ~4.2* ~5-10s $0.15 / $1.20 Studio down fallback. Same arch, faster.
Cloud Premium AWS Bedrock Claude Sonnet 4.6 4.65 ~8s $3 / $15 Heavy reasoning. Memo lane. Second opinions.
* Bedrock Qwen not yet evaled. Same base as Thurin v1.1 but without PE fine-tuning. Needs v5 eval run.
Fallback chain: Cloud Premium → Cloud Balanced → Local Fast (already coded in runtime-router.ts)

3 The One Missing Piece: Bedrock Auth

Why we can’t just set env vars and go
Alaska’s router speaks OpenAI API format — it sends POST /v1/chat/completions with a Bearer token. Bedrock uses AWS SigV4 request signing, not API keys. The request format is also slightly different.

Solution: LiteLLM proxy sidecar
A lightweight Python process that accepts OpenAI-format requests on localhost:4000 and translates them to Bedrock’s SigV4 format. Alaska’s router sends to LiteLLM like any other OpenAI endpoint — zero changes to Alaska code.
# docker-compose addition (or standalone process) litellm: image: ghcr.io/berriai/litellm:main-latest ports: ["4000:4000"] environment: AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID} AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY} AWS_DEFAULT_REGION: us-east-1 command: > --model bedrock/us.anthropic.claude-sonnet-4-6-20250514 --model bedrock/us.amazon.qwen3-next-80b-a3b-v1:0 --port 4000
Alternative: If running Alaska locally (not Docker), just pip install litellm && litellm --model bedrock/... as a background process. Same result, no Docker needed.

4 Alaska .env Changes

# ─── Local Fast (unchanged) ─── ALASKA_OPENAI_BASE_URL_LOCAL_FAST=http://192.168.87.243:8000/v1 ALASKA_OPENAI_MODEL_LOCAL_FAST=/Users/nb/.cache/mlx-models/thurin-v1.1 ALASKA_OPENAI_PROVIDER_LOCAL_FAST=MLX # ─── Cloud Balanced (NEW: Bedrock Qwen via LiteLLM) ─── ALASKA_OPENAI_BASE_URL_CLOUD_BALANCED=http://127.0.0.1:4000/v1 ALASKA_OPENAI_API_KEY_CLOUD_BALANCED=sk-litellm ALASKA_OPENAI_MODEL_CLOUD_BALANCED=bedrock/us.amazon.qwen3-next-80b-a3b-v1:0 ALASKA_OPENAI_PROVIDER_CLOUD_BALANCED=Bedrock # ─── Cloud Premium (NEW: Bedrock Claude via LiteLLM) ─── ALASKA_OPENAI_BASE_URL_CLOUD_PREMIUM=http://127.0.0.1:4000/v1 ALASKA_OPENAI_API_KEY_CLOUD_PREMIUM=sk-litellm ALASKA_OPENAI_MODEL_CLOUD_PREMIUM=bedrock/us.anthropic.claude-sonnet-4-6-20250514 ALASKA_OPENAI_PROVIDER_CLOUD_PREMIUM=Bedrock
That’s it. Alaska’s router reads these env vars, routes to LiteLLM on :4000, LiteLLM translates to Bedrock SigV4. Zero code changes to Alaska.

5 What the User Sees

Normal Operation (Studio healthy)
User picks any lane — all work. Local Fast is the default for all PE lanes (Search, VDR, DD, etc.).

Cloud Premium available via model picker for Memo lane or when frontier quality needed. User consciously opts into it.

Thread header shows Local badge. Zero cost.
Studio Down (automatic fallback)
Local Fast fails → router emits run.fallback event → retries on Cloud Balanced (Bedrock Qwen).

User sees a brief Fallback: Cloud badge in thread header. Response comes from Bedrock instead. May lack PE fine-tuning nuance but functionally works.

No interruption. No error screen.

6 Alaska vs. Workstream Context Proxy

The old workstream_context_proxy.py (290KB) handled routing for Open WebUI. Alaska replaces Open WebUI entirely — it has its own routing in runtime-router.ts.

What still matters from the old proxy:
PE context injection — Alaska handles this natively via knowledge-context.ts + project-notes-store.ts. Already built.
Budget tracking — Not yet in Alaska. Could be a lightweight middleware in the chat API route. Phase 2 work.
Studio file linking — Already in Alaska via studio-files.ts.

The old proxy becomes unnecessary once Alaska’s routing is wired. It was glue between Open WebUI and the inference backend. Alaska IS the UI + router.

7 Implementation Phases

01
Wire Bedrock
~1-2 hours
  • Create AWS account + IAM user
  • Enable Bedrock model access (Qwen, Claude)
  • Install LiteLLM sidecar on Mini
  • Update Alaska .env with lane configs
  • Test: kill oMLX, confirm Bedrock fallback works
  • Test: manually pick Cloud Premium, confirm Claude responds
02
Eval + Polish
~2-3 hours
  • Run v5 eval on Bedrock Qwen3-Next-80B
  • Compare: vanilla Bedrock vs fine-tuned Studio
  • Add cost tracking middleware to chat route
  • Daily spend alert (Telegram notification)
  • Tune fallback timeout thresholds
03
Bedrock RAG (Optional)
~4 hours • Only if Bedrock quality is close
  • S3 bucket for deal documents
  • Bedrock Knowledge Base + S3 Vectors
  • Wire RAG into Alaska’s context injection
  • Eval: Bedrock + RAG vs fine-tuned Thurin
  • If competitive: enables fully cloud-based mode

8 Cost Impact

ScenarioStudioBedrockLiteLLMTotal/mo
Studio healthy, no cloud use $0 $0 $0 (idle) $0
Occasional Premium lane use $0 ~$5-15 $0 ~$5-15
Studio down 2 days + Premium use $0 ~$15-30 $0 ~$15-30
LiteLLM is open-source, runs locally, zero cost. AWS has no baseline fee for on-demand Bedrock — pure pay-per-token.

Decisions Needed

1. AWS account: Existing AWS, or fresh account for isolation?
2. Daily cap: Max Bedrock spend before alert? ($10? $25?)
3. Default lane: Should PE lanes (VDR, DD, Memo) default to Local Fast or Cloud Premium?
4. Phase scope: Phase 1 only, or 1+2?
5. Eval first: Want to eval Bedrock Qwen before wiring, or wire first and eval after?