Urgent PHI-221 Revised — Alaska-aware

Alaska Multi-Provider Routing

Wiring Alaska’s three model lanes to real backends — Studio local, Bedrock confidential, cloud fallback.
Alaska’s runtime-router.ts already has lane routing, fallback chains, and per-lane env vars. This is a config + sidecar job, not a rewrite.

1 What Alaska Already Has

runtime-router.ts — 700 lines, fully built

Alaska’s router supports three model lanes, each with independent base URL, API key, model ID, and provider type. Fallback cascading is already implemented: Cloud Premium → Cloud Balanced → Local Fast.

The router emits run.started (routing decision), run.fallback (lane failed, switching), and run.completed (metrics + citations). The UI already shows which lane served the response.

Today’s problem: All three lanes point to the same oMLX endpoint. One backend goes down → all lanes fail → Alaska is dead.

Local Fast

Today: oMLX / thurin-v1.1
Proposed: Same — no change

ALASKA_OPENAI_BASE_URL_LOCAL_FAST
=http://192.168.87.243:8000/v1

Cloud Balanced

Today: oMLX / thurin-v1.1 (duplicate)
Proposed: Bedrock Qwen3-Next-80B

ALASKA_OPENAI_BASE_URL_CLOUD_BALANCED
=http://localhost:4000/v1
(LiteLLM sidecar → Bedrock)

Cloud Premium

Today: oMLX / thurin-v1.1 (duplicate)
Proposed: Bedrock Claude Sonnet 4.6

ALASKA_OPENAI_BASE_URL_CLOUD_PREMIUM
=http://localhost:4000/v1
(LiteLLM sidecar → Bedrock)

2 Lane → Backend Mapping

Lane	Backend	Model	Quality	Latency	Cost/1M tok	Use Case
Local Fast	Studio oMLX	Thurin v1.1 (80B SFT)	4.25	~32s	$0	Primary for everything. PE-tuned.
Cloud Balanced	AWS Bedrock	Qwen3-Next-80B-A3B	~4.2*	~5-10s	$0.15 / $1.20	Studio down fallback. Same arch, faster.
Cloud Premium	AWS Bedrock	Claude Sonnet 4.6	4.65	~8s	$3 / $15	Heavy reasoning. Memo lane. Second opinions.

* Bedrock Qwen not yet evaled. Same base as Thurin v1.1 but without PE fine-tuning. Needs v5 eval run.
Fallback chain: Cloud Premium → Cloud Balanced → Local Fast (already coded in runtime-router.ts)

3 The One Missing Piece: Bedrock Auth

Why we can’t just set env vars and go

Alaska’s router speaks OpenAI API format — it sends POST /v1/chat/completions with a Bearer token. Bedrock uses AWS SigV4 request signing, not API keys. The request format is also slightly different.

Solution: LiteLLM proxy sidecar
A lightweight Python process that accepts OpenAI-format requests on localhost:4000 and translates them to Bedrock’s SigV4 format. Alaska’s router sends to LiteLLM like any other OpenAI endpoint — zero changes to Alaska code.

# docker-compose addition (or standalone process)
litellm:
  image: ghcr.io/berriai/litellm:main-latest
  ports: ["4000:4000"]
  environment:
    AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
    AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    AWS_DEFAULT_REGION: us-east-1
  command: >
    --model bedrock/us.anthropic.claude-sonnet-4-6-20250514
    --model bedrock/us.amazon.qwen3-next-80b-a3b-v1:0
    --port 4000

Alternative: If running Alaska locally (not Docker), just pip install litellm && litellm --model bedrock/... as a background process. Same result, no Docker needed.

4 Alaska .env Changes

# ─── Local Fast (unchanged) ───
ALASKA_OPENAI_BASE_URL_LOCAL_FAST=http://192.168.87.243:8000/v1
ALASKA_OPENAI_MODEL_LOCAL_FAST=/Users/nb/.cache/mlx-models/thurin-v1.1
ALASKA_OPENAI_PROVIDER_LOCAL_FAST=MLX

# ─── Cloud Balanced (NEW: Bedrock Qwen via LiteLLM) ───
ALASKA_OPENAI_BASE_URL_CLOUD_BALANCED=http://127.0.0.1:4000/v1
ALASKA_OPENAI_API_KEY_CLOUD_BALANCED=sk-litellm
ALASKA_OPENAI_MODEL_CLOUD_BALANCED=bedrock/us.amazon.qwen3-next-80b-a3b-v1:0
ALASKA_OPENAI_PROVIDER_CLOUD_BALANCED=Bedrock

# ─── Cloud Premium (NEW: Bedrock Claude via LiteLLM) ───
ALASKA_OPENAI_BASE_URL_CLOUD_PREMIUM=http://127.0.0.1:4000/v1
ALASKA_OPENAI_API_KEY_CLOUD_PREMIUM=sk-litellm
ALASKA_OPENAI_MODEL_CLOUD_PREMIUM=bedrock/us.anthropic.claude-sonnet-4-6-20250514
ALASKA_OPENAI_PROVIDER_CLOUD_PREMIUM=Bedrock

That’s it. Alaska’s router reads these env vars, routes to LiteLLM on :4000, LiteLLM translates to Bedrock SigV4. Zero code changes to Alaska.

5 What the User Sees

Normal Operation (Studio healthy)

User picks any lane — all work. Local Fast is the default for all PE lanes (Search, VDR, DD, etc.).

Cloud Premium available via model picker for Memo lane or when frontier quality needed. User consciously opts into it.

Thread header shows Local badge. Zero cost.

Studio Down (automatic fallback)

Local Fast fails → router emits run.fallback event → retries on Cloud Balanced (Bedrock Qwen).

User sees a brief Fallback: Cloud badge in thread header. Response comes from Bedrock instead. May lack PE fine-tuning nuance but functionally works.

No interruption. No error screen.

6 Alaska vs. Workstream Context Proxy

The old workstream_context_proxy.py (290KB) handled routing for Open WebUI. Alaska replaces Open WebUI entirely — it has its own routing in runtime-router.ts.

What still matters from the old proxy:
• PE context injection — Alaska handles this natively via knowledge-context.ts + project-notes-store.ts. Already built.
• Budget tracking — Not yet in Alaska. Could be a lightweight middleware in the chat API route. Phase 2 work.
• Studio file linking — Already in Alaska via studio-files.ts.

The old proxy becomes unnecessary once Alaska’s routing is wired. It was glue between Open WebUI and the inference backend. Alaska IS the UI + router.

7 Implementation Phases

Wire Bedrock

~1-2 hours

Create AWS account + IAM user
Enable Bedrock model access (Qwen, Claude)
Install LiteLLM sidecar on Mini
Update Alaska .env with lane configs
Test: kill oMLX, confirm Bedrock fallback works
Test: manually pick Cloud Premium, confirm Claude responds

Eval + Polish

~2-3 hours

Run v5 eval on Bedrock Qwen3-Next-80B
Compare: vanilla Bedrock vs fine-tuned Studio
Add cost tracking middleware to chat route
Daily spend alert (Telegram notification)
Tune fallback timeout thresholds

Bedrock RAG (Optional)

~4 hours • Only if Bedrock quality is close

S3 bucket for deal documents
Bedrock Knowledge Base + S3 Vectors
Wire RAG into Alaska’s context injection
Eval: Bedrock + RAG vs fine-tuned Thurin
If competitive: enables fully cloud-based mode

8 Cost Impact

Scenario	Studio	Bedrock	LiteLLM	Total/mo
Studio healthy, no cloud use	$0	$0	$0 (idle)	$0
Occasional Premium lane use	$0	~$5-15	$0	~$5-15
Studio down 2 days + Premium use	$0	~$15-30	$0	~$15-30

LiteLLM is open-source, runs locally, zero cost. AWS has no baseline fee for on-demand Bedrock — pure pay-per-token.

Decisions Needed

1. AWS account: Existing AWS, or fresh account for isolation?
2. Daily cap: Max Bedrock spend before alert? ($10? $25?)
3. Default lane: Should PE lanes (VDR, DD, Memo) default to Local Fast or Cloud Premium?
4. Phase scope: Phase 1 only, or 1+2?
5. Eval first: Want to eval Bedrock Qwen before wiring, or wire first and eval after?