Docs

Two routes. One header. Done.

The TES proxy speaks Anthropic Messages and OpenAI Chat Completions wire-compatibly. Point your existing client at it, pass X-TES-Token as a default header, and you're done. This page is the complete reference.

Quickstart

The integration footprint is two environment variables and one default header. Your existing call sites do not change.

.env
# Pick one base URL for the upstream you call.
ANTHROPIC_BASE_URL=https://llm.api.pentatonic.com
OPENAI_BASE_URL=https://llm.api.pentatonic.com/v1

# Your existing upstream key — TES forwards it unchanged.
ANTHROPIC_API_KEY=sk-ant-...
# or:
OPENAI_API_KEY=sk-...

# Your TES token — issued at /signup.
TES_API_KEY=tes_<clientId>_<random>

The Anthropic SDK appends /v1/messages to its base URL, so set ANTHROPIC_BASE_URL to https://llm.api.pentatonic.com (no path). The OpenAI SDK appends /chat/completions to its base URL, so set OPENAI_BASE_URL to https://llm.api.pentatonic.com/v1 (note the /v1).

Then call the model as normal

import Anthropic from "@anthropic-ai/sdk";

// Two env vars + one default header is the entire integration.
const client = new Anthropic({
  baseURL: process.env.ANTHROPIC_BASE_URL,        // https://llm.api.pentatonic.com
  defaultHeaders: { "X-TES-Token": process.env.TES_API_KEY },
});

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Refactor the auth middleware" }],
});

// Same response shape. Token count drops because TES injected the relevant
// context as a system preamble before the model ran.

Providers

The deployed proxy at llm.api.pentatonic.com exposes two routes today.

  • Route

    POST /v1/messages

    Forwards to

    api.anthropic.com

    Use this when…

    You're using the Anthropic SDK or anything that speaks Anthropic Messages.
  • Route

    POST /v1/chat/completions

    Forwards to

    api.openai.com

    Use this when…

    You're using the OpenAI SDK or any OpenAI-compatible client.
  • Route

    GET /health

    Forwards to

    Use this when…

    Liveness check. Returns { ok: true }.

Other upstreams (MiniMax, vLLM, llama.cpp, your own inference) are supported on the Enterprise tier as dedicated upstream routing — the worker forwards /v1/chat/completions to a configured base URL per tenant. Talk to us. For self-hosted setups you can clone the SDK and point the worker's ANTHROPIC_BASE_URL / OPENAI_BASE_URL env vars at any compatible endpoint.

Authentication

Two credentials per request: your TES tenant token, and your upstream provider key. The proxy never forwards your TES token to the upstream, and never forwards your upstream key to TES.

Heads-up: API keys only — not subscription auth

Anthropic and OpenAI bind their subscription OAuth tokens (the kind Claude Code or Cursor mints from a Claude Pro / Team / Enterprise subscription) to the originating client. Forwarding those through any third-party proxy will return a 401 at the upstream's edge — by their design, not ours. Use a workspace API key for proxy traffic. Subscription users who want memory injection in Claude Code should use the Claude Code plugin instead — it goes direct to the upstream and only routes the memory-search calls through TES.

  • X-TES-Token: tes_<clientId>_<random>

    Identifies your tenant. Token state (active, revoked, rotated) is enforced by TES on every memory call — the proxy itself stores nothing. Get a token at /signup or rotate from your dashboard.

  • x-api-key: <your-anthropic-key>

    Required when calling /v1/messages. Forwarded to api.anthropic.com unchanged. Rate limits, billing, and errors remain your direct relationship with Anthropic.

  • Authorization: Bearer <your-openai-key>

    Required when calling /v1/chat/completions. Forwarded to api.openai.com unchanged.

Request headers

  • Header

    X-TES-Token

    Required

    Required

    Value

    tes_<clientId>_<random>

    Notes

    Issued at /signup. Identifies your tenant. TES authoritatively validates and authorises every memory call — the proxy itself is stateless.
  • Header

    x-api-key

    Required

    Required for /v1/messages

    Value

    Your Anthropic key

    Notes

    Forwarded to api.anthropic.com unchanged. Rate limits, billing, and errors are still your direct relationship with Anthropic.
  • Header

    Authorization

    Required

    Required for /v1/chat/completions

    Value

    Bearer <openai-key>

    Notes

    Forwarded to api.openai.com unchanged.
  • Header

    anthropic-version

    Required

    Optional

    Value

    2023-06-01 (default)

    Notes

    Whatever you set is forwarded to Anthropic. Pin a specific schema if your client requires it.
  • Header

    X-TES-Mode

    Required

    Optional

    Value

    passthrough

    Notes

    Skip retrieval entirely on this request. The proxy still emits a CHAT_TURN event so you have a comparison turn in your dashboard. Useful for A/B'ing the bill.

Response headers

In addition to whatever the upstream provider returns, the proxy attaches the following headers so you can audit what happened.

  • Header

    X-TES-Provider

    Value

    anthropic | openai

    Notes

    Which upstream the proxy routed to. Always set.
  • Header

    X-TES-Memories-Injected

    Value

    <integer>

    Notes

    How many memories the proxy folded into the system preamble for this request. 0 if the retrieval found nothing useful or the request used passthrough mode.
  • Header

    X-TES-Skipped

    Value

    <reason>

    Notes

    Present only when memory injection didn't happen. Reason values listed below. The upstream call itself is unaffected — your request still went through and you got a real model response.

X-TES-Skipped reasons

  • Reason

    passthrough_mode

    When you'll see it

    You sent X-TES-Mode: passthrough.
  • Reason

    no_user_message

    When you'll see it

    The request had no user message to retrieve against (e.g. system-only prompt).
  • Reason

    tes_timeout

    When you'll see it

    TES memory search exceeded the per-request timeout (currently 10s while the SDK ships an ANN-prefilter; target ceiling 800ms).
  • Reason

    tes_unreachable

    When you'll see it

    Network error reaching the TES tenant.
  • Reason

    tes_http_<status>

    When you'll see it

    TES returned a non-2xx HTTP status (e.g. tes_http_503).
  • Reason

    tes_graphql:<reason>

    When you'll see it

    TES returned a GraphQL error envelope. Reason is truncated.
  • Reason

    tes_api_base_not_configured

    When you'll see it

    Internal config error. Reach out — should never reach prod.

Errors

Errors raised by the proxy itself (auth, validation, routing) use this shape:

{
  "error": {
    "message": "missing_tes_token"
  }
}

Errors raised by the upstream provider (model not found, rate limit, invalid key) are forwarded with the upstream's status code and body unchanged. The proxy never rewrites the upstream's error envelope.

  • Status

    401

    Message

    missing_tes_token

    Cause

    X-TES-Token header was absent.
  • Status

    401

    Message

    invalid_tes_token_format

    Cause

    Token didn't start with tes_.
  • Status

    401

    Message

    invalid_tes_token_layout

    Cause

    Token didn't match tes_<clientId>_<random>.
  • Status

    401

    Message

    invalid_client_id_in_token

    Cause

    ClientId portion contained characters outside [a-zA-Z0-9-].
  • Status

    400

    Message

    missing_x_api_key (Anthropic upstream key required)

    Cause

    Hitting /v1/messages without x-api-key.
  • Status

    400

    Message

    missing_authorization_bearer (OpenAI upstream key required)

    Cause

    Hitting /v1/chat/completions without Authorization: Bearer …
  • Status

    400

    Message

    Invalid JSON body

    Cause

    Request body wasn't parseable JSON.
  • Status

    404

    Message

    Not Found

    Cause

    Path other than /v1/messages, /v1/chat/completions, or /health.
  • Status

    405

    Message

    Method Not Allowed

    Cause

    GET on a POST route, or anything other than POST/GET.
  • Status

    502

    Message

    Upstream provider unreachable: <message>

    Cause

    Network error reaching api.anthropic.com or api.openai.com. The upstream's own non-2xx is forwarded verbatim with the upstream's status.

Passthrough mode

Send X-TES-Mode: passthrough on any request and the proxy will skip the retrieval step and forward your request to the upstream untouched. The CHAT_TURN event still fires so you have a comparison row in your dashboard.

curl https://llm.api.pentatonic.com/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "X-TES-Token: $TES_API_KEY" \
  -H "X-TES-Mode: passthrough" \
  -d '{ "model": "claude-sonnet-4-5", "max_tokens": 1024,
        "messages": [{"role":"user","content":"Hello"}] }'

# Response will include: X-TES-Skipped: passthrough_mode

Useful for A/B'ing your invoice — flip half your traffic to passthrough for a day, compare the two columns in the dashboard.

Streaming

Both routes support SSE streaming. Set stream: true in the request body exactly as you would calling the upstream provider directly. The proxy forwards bytes immediately — there is no buffering on the response path.

const stream = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Stream this" }],
  stream: true,
});

for await (const event of stream) {
  // identical to streaming directly from Anthropic
}

The CHAT_TURN ingest fires once after the upstream stream closes (collected in parallel via a TransformStream), so it never blocks the response.

Limits & timeouts

  • Memory-search timeout

    Currently 10s in production while the SDK ships an ANN-prefilter-then-rerank refactor — cold-tenant searches against live TES can hit 5–10s on tenants without an HNSW index. Target ceiling is 800ms once that lands. If TES doesn't respond in time the proxy sets X-TES-Skipped: tes_timeout and forwards your request unmodified — your call still goes through, just without the preamble.

  • Retrieval shape

    Top-6 memories with similarity ≥ 0.55 are folded into the system preamble. Tunables (limit, minScore) ship in the SDK; the deployed proxy uses these defaults.

  • Tier quotas

    Free tier: 1M proxied input tokens / month. Pro: $0.50 per 1M proxied input tokens, $20/mo minimum. See /pricing for the full table.

  • Soft-fail contract

    Any TES-side error is non-fatal to your request. The proxy sets X-TES-Skipped with a reason and forwards the upstream call as if TES weren't there. We never break your request because TES is unhealthy.

JS SDK — wrap your client, get memory injection

If you'd rather not change a base URL, the JS SDK can wrap your existing Anthropic or OpenAI client and do retrieval + preamble injection in-process before forwarding to the upstream. Same behaviour as the proxy — same retrieval, same injection format, same fail-soft semantics — just running inside your app instead of at our edge.

Memory injection is on by default from SDK 0.5.8 — you don't write any memory code.

Wrap an Anthropic client
import Anthropic from "@anthropic-ai/sdk";
import { TESClient } from "@pentatonic-ai/ai-agent-sdk";

const tes = new TESClient({
  endpoint: "https://your-co.api.pentatonic.com",
  clientId: "your-co",
  apiKey:   process.env.TES_API_KEY,
});

// Wrap your existing client. Memory retrieval + injection happen
// automatically before each .messages.create call.
const anthropic = tes.wrap(new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }));

const res = await anthropic.messages.create({
  model: "claude-sonnet-4-5",
  messages: [{ role: "user", content: "What did we decide about JWT expiry?" }],
});

// res is the unmodified Anthropic response shape — your code is unchanged.

Disable per-call with tes.wrap(client, { memory: false }). Override search knobs with memoryOpts: { limit, minScore, timeoutMs } — defaults match the proxy (6 results, 0.55 min score, 800ms timeout).

Why three integration paths if they all do the same thing?

The retrieval + injection logic lives in one shared module (@pentatonic-ai/ai-agent-sdk/memory), so the proxy, this SDK wrapper, and the OpenClaw plugin all produce the same preamble for the same query. Pick the path that fits your auth model — they're functionally identical.

Self-hosted SDK

If you don't want to use the hosted proxy, the same retrieval logic ships in the open-source SDK. Run a local memory stack (PostgreSQL + pgvector + Ollama in Docker), and your agent gets persistent memory and the same auto-retrieval behaviour without ever calling out to Pentatonic.

Local memory stack
# Set up local memory — PostgreSQL + pgvector + Ollama in Docker.
# Pulls embedding + chat models, writes config, no account required.
npx @pentatonic-ai/ai-agent-sdk memory

Or use the SDK directly to wrap your existing OpenAI / Anthropic / Workers AI client and emit observability events without any proxy in the loop. See /sdk for the wrap API and provider matrix.

Claude Code plugin

If you use Claude Code, install the plugin to get persistent memory and automatic session capture without touching any code. The plugin works against either the hosted TES platform or a local memory stack you run yourself.

In Claude Code
/plugin marketplace add Pentatonic-Ltd/ai-agent-sdk
/plugin install tes-memory@pentatonic-ai
/tes-memory:tes-setup

Setup writes config to ~/.claude/tes-memory.local.md (or ~/.claude-pentatonic/tes-memory.local.md on aliased installs). Hooks fire on SessionStart, UserPromptSubmit, PostToolUse, Stop, and SessionEnd — every turn is searchable. Full reference at /sdk/claude-code.

FAQ

  • Does TES change my response shape?

    No. The proxy returns the upstream provider's bytes unmodified. Buffered responses are forwarded as a single body; streaming responses are forwarded chunk-by-chunk via SSE with no reordering.

  • What happens if TES is unhealthy?

    Your request is forwarded to the upstream provider unmodified, with X-TES-Skipped describing the reason. We never break a customer request because TES is down — you get the upstream behaviour you were already paying for.

  • Where does the injected context come from?

    From your tenant's memory layer. The proxy calls semanticSearchMemories on TES with the last user message, and folds the top hits into the system preamble. Cold-start tenants get nothing useful injected until you've populated the memory — see the SDK below for ways to feed memory in.

  • Can I use the proxy with MiniMax, vLLM, or another OpenAI-compatible upstream?

    Today the deployed proxy at llm.api.pentatonic.com forwards /v1/messages to api.anthropic.com and /v1/chat/completions to api.openai.com. Other upstreams are on the Enterprise tier as a dedicated routing config — talk to us. The proxy is OSS, so for self-hosted use you can point ANTHROPIC_BASE_URL / OPENAI_BASE_URL env vars at any compatible endpoint.

  • Does the proxy log my prompts?

    The last user message is sent to TES for the memory search. The full request body is forwarded to your upstream — TES doesn't store the upstream response body in plain form. Ingest writes a CHAT_TURN event for analytics with the user message and assistant text against your tenant.

  • Can I rotate my TES token?

    Yes — from your dashboard. Old tokens are invalidated immediately because TES is the sole authority on token state, not the proxy.

  • Why retrieve and inject instead of caching?

    Prompt caching only helps when the same exact prefix repeats. Retrieval-then-inject works on every turn — including the first turn of a new session — because it reaches into your durable memory layer rather than relying on prefix locality.

  • Can I use my Claude Pro / Team / Cursor / Codex subscription key?

    No — and the failure is on the upstream's side, not ours. Anthropic and OpenAI bind subscription OAuth tokens to the device that minted them. Routing them through any third-party proxy returns a 401 at the upstream's edge. Use a workspace API key, or install the Claude Code plugin which goes direct to the upstream and only sends the memory search to TES.

Ship it

Two env vars away from a smaller bill.

Free tier covers 1M proxied input tokens per month. No credit card.