AI agent memory infrastructure
that saves tokens and keeps context
Your agent starts cold every session — wasting tokens. With TES, your agent starts ready. Built on persistent memory across every source you use.
Start now with a single command:
npx @pentatonic-ai/ai-agent-sdk loginWhy TES
Lower token bill. Same model. Same answer.
Three reasons the proxy approach earns its keep against the way things work today.
01
Same SDK, same model
Keep your existing client, your model, your prompt code. The integration is one env var or one Claude Code plugin — no rewrite, no lock-in.
02
Memory injected up front
Before the model runs, TES pulls the context it would otherwise have to re-derive — and folds it into the system preamble. The answer is in front of the model.
03
You can see exactly what we did
Audited methodology. Reproducible benchmark. Per-workload split published. Compare two invoices to see the gap.
How it works
One env var. Same SDK. Half the bill.
You already pay for an LLM. You probably also pay for the same context to be re-derived every turn. TES intercepts the request, fetches the context the model would have asked for, injects it as a preamble, and forwards the call.
Intercept
Point your existing Anthropic, OpenAI, or MiniMax client at llm.api.pentatonic.com. We receive the request unchanged — same SDK, same model, same response shape.
// .env — point base URL at TES, add your token
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_BASE_URL=https://llm.api.pentatonic.com
TES_API_KEY=tes_<clientId>_<random>
// Wire it on your client (one extra line)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
defaultHeaders: { "X-TES-Token": process.env.TES_API_KEY },
});
await client.messages.create({ model, messages });Retrieve & inject
TES retrieves the context your model would have re-derived this turn — files, prior memory, tool results — and injects it as a preamble. The answer is now in front of the model instead of being tool-called for.
// What TES does on the wire (you don't write this)
const context = await retrieve({
session: req.session,
files: req.referenced_files,
memory: req.user_memory,
});
req.messages = [preamble(context), ...req.messages];
// e.g. on a code-lookup turn from the published benchmark:
// → 10,050 tokens of tool/file work avoided
// → 17 follow-up tool calls eliminatedForward & return
We forward to your chosen upstream — Anthropic, OpenAI, MiniMax, or your own endpoint — and return the response untouched. Your code reads the same shape it always has. The bill goes down.
// Response is identical to direct upstream call
{
id: "msg_01...",
model: "claude-sonnet-4",
content: [{ type: "text", text: "..." }],
usage: { input_tokens: 450 /* was 10500 OFF */ },
// overall benchmark median: 27.2% reduction.
// code-lookup category median: 95.6%. See /benchmarks.
}Use cases
Who this is for
Any team paying a meaningful LLM bill where the model is re-deriving context it's already seen. Three workloads where retrieval-before-answer cuts token spend materially.
Stop paying to re-read the same repo every turn
Codex, Claude Code, Cursor, and every in-house coding agent burn most of their input tokens re-grepping and re-reading files the model already saw 30 seconds ago. TES caches that context the first time and injects it as a preamble on the next turn. 95% median input-token reduction on code-bound workloads in our benchmark. One env var. Same SDK. Same model. Same answer.
- 95% median input-token reduction on code queries
- Works with Anthropic, OpenAI, MiniMax, and any OpenAI-compatible endpoint
- Per-dev Pro tier from $20/mo
Two ways in
Pick the path that matches how you pay
Same memory layer underneath. Different transport.
Per-token API key
Anthropic or OpenAI workspace key. Drop-in proxy — change one env var. Bill drops directly with every compressed turn.
ANTHROPIC_BASE_URL=https://llm.api.pentatonic.comClaude Code, Cursor, or Codex on a subscription
Hooks-based plugin runs locally. Same retrieve-and-inject as the proxy — but never touches your auth path. Your subscription terms unchanged.
/plugin install tes-memory@pentatonic-aiPricing
Per-token. Lower than upstream. Audit by comparing two invoices.
We charge a per-token rate that's lower than going direct to Anthropic or OpenAI — and your total token count drops because the preamble compresses every turn. Customer wins twice. Both wins are visible on the bill.
Free
Solo devs, weekend projects, and Claude Code / Cursor / Codex subscribers using the memory plugin.
- 1M proxied input tokens / month
- Anthropic Messages + OpenAI Chat Completions
- Claude Code plugin — unlimited memory + sessions
- Bring-your-own retrieval source (URLs, files)
- Token-savings dashboard + request log
- Soft-fail to upstream on TES error
- Discord support
Pro
$20/mo minimum
Devs and small teams paying $50–$500/mo to LLM providers.
- Per-token meter from $0 — minimum covers light use
- Same routes + per-tenant memory layer
- Per-project breakdown, exportable CSV
- X-TES-Mode: passthrough for A/B comparisons
- Email support, 1-business-day
Enterprise
$1k/mo minimum, annual commit
Companies with a six- or seven-figure annual AI bill.
- Volume per-token rate, predictable ceiling
- Custom upstreams (MiniMax, vLLM, llama.cpp, your own inference)
- Custom KGs / vector stores / private corpora
- SLA, dedicated regions
- Per-team breakdown, SSO, audit log
- Slack channel, dedicated engineer
Reference: Anthropic lists Sonnet input at $3 / 1M tokens direct. Our Pro per-token rate is $0.50 / 1M — about a sixth of that, before the compression saving. You can audit the gap by putting our invoice next to your direct upstream invoice.
Get started
Lower your AI bill without changing your code
Free tier covers 1M proxied input tokens per month. No credit card. Pro from $20/mo, enterprise with annual commit.