AI agent memory infrastructure
that saves tokens and keeps context

Your agent starts cold every session — wasting tokens. With TES, your agent starts ready. Built on persistent memory across every source you use.

Start now:

npx @pentatonic-ai/ai-agent-sdk login

Why TES

Lower token bill. Same model. Same answer.

Three reasons the proxy approach earns its keep against the way things work today.

01

Same SDK, same model

Keep your existing client, your model, your prompt code. The integration is one env var or one Claude Code plugin — no rewrite, no lock-in.

02

Memory injected up front

Before the model runs, TES pulls the context it would otherwise have to re-derive — and folds it into the system preamble. The answer is in front of the model.

03

You can see exactly what we did

Audited methodology. Reproducible benchmark. Per-workload split published. Compare two invoices to see the gap.

How it works

One env var. Same SDK. Half the bill.

You already pay for an LLM. You probably also pay for the same context to be re-derived every turn. TES intercepts the request, fetches the context the model would have asked for, injects it as a preamble, and forwards the call.

01

Intercept

Point your existing Anthropic, OpenAI, or MiniMax client at llm.api.pentatonic.com. We receive the request unchanged — same SDK, same model, same response shape.

// .env — point base URL at TES, add your token
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_BASE_URL=https://llm.api.pentatonic.com
TES_API_KEY=tes_<clientId>_<random>

// Wire it on your client (one extra line)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
  defaultHeaders: { "X-TES-Token": process.env.TES_API_KEY },
});
await client.messages.create({ model, messages });
02

Retrieve & inject

TES retrieves the context your model would have re-derived this turn — files, prior memory, tool results — and injects it as a preamble. The answer is now in front of the model instead of being tool-called for.

// What TES does on the wire (you don't write this)
const context = await retrieve({
  session: req.session,
  files: req.referenced_files,
  memory: req.user_memory,
});

req.messages = [preamble(context), ...req.messages];
// e.g. on a code-lookup turn from the published benchmark:
// → 10,050 tokens of tool/file work avoided
// → 17 follow-up tool calls eliminated
03

Forward & return

We forward to your chosen upstream — Anthropic, OpenAI, MiniMax, or your own endpoint — and return the response untouched. Your code reads the same shape it always has. The bill goes down.

// Response is identical to direct upstream call
{
  id: "msg_01...",
  model: "claude-sonnet-4",
  content: [{ type: "text", text: "..." }],
  usage: { input_tokens: 450 /* was 10500 OFF */ },
  // overall benchmark median: 27.2% reduction.
  // code-lookup category median: 95.6%. See /benchmarks.
}

Use cases

Who this is for

Any team paying a meaningful LLM bill where the model is re-deriving context it's already seen. Three workloads where retrieval-before-answer cuts token spend materially.

Stop paying to re-read the same repo every turn

Codex, Claude Code, Cursor, and every in-house coding agent burn most of their input tokens re-grepping and re-reading files the model already saw 30 seconds ago. TES caches that context the first time and injects it as a preamble on the next turn. 95% median input-token reduction on code-bound workloads in our benchmark. One env var. Same SDK. Same model. Same answer.

  • 95% median input-token reduction on code queries
  • Works with Anthropic, OpenAI, MiniMax, and any OpenAI-compatible endpoint
  • Per-dev Pro tier from $20/mo
Learn more →

Pricing

Per-token. Lower than upstream. Audit by comparing two invoices.

We charge a per-token rate that's lower than going direct to Anthropic or OpenAI — and your total token count drops because the preamble compresses every turn. Customer wins twice. Both wins are visible on the bill.

Free

$0

Solo devs, weekend projects, and Claude Code / Cursor / Codex subscribers using the memory plugin.

  • 1M proxied input tokens / month
  • Anthropic Messages + OpenAI Chat Completions
  • Claude Code plugin — unlimited memory + sessions
  • Bring-your-own retrieval source (URLs, files)
  • Token-savings dashboard + request log
  • Soft-fail to upstream on TES error
  • Discord support
Get API key

Pro

$0.50per 1M input tokens

$20/mo minimum

Devs and small teams paying $50–$500/mo to LLM providers.

  • Per-token meter from $0 — minimum covers light use
  • Same routes + per-tenant memory layer
  • Per-project breakdown, exportable CSV
  • X-TES-Mode: passthrough for A/B comparisons
  • Email support, 1-business-day
Start with $20/mo

Enterprise

$0.30per 1M at volume

$1k/mo minimum, annual commit

Companies with a six- or seven-figure annual AI bill.

  • Volume per-token rate, predictable ceiling
  • Custom upstreams (MiniMax, vLLM, llama.cpp, your own inference)
  • Custom KGs / vector stores / private corpora
  • SLA, dedicated regions
  • Per-team breakdown, SSO, audit log
  • Slack channel, dedicated engineer
Talk to sales

Reference: Anthropic lists Sonnet input at $3 / 1M tokens direct. Our Pro per-token rate is $0.50 / 1M — about a sixth of that, before the compression saving. You can audit the gap by putting our invoice next to your direct upstream invoice.

Get started

Lower your AI bill without changing your code

Free tier covers 1M proxied input tokens per month. No credit card. Pro from $20/mo, enterprise with annual commit.