The library behind the proxy.
Wrap, observe, retrieve.
If the env-var-swap proxy isn't enough — you need fine-grained retrieval control, you want to run the memory layer locally, or you're building observability on top of an LLM client the proxy doesn't speak — install the SDK directly. Same retrieval shape, in your process.
Just want a smaller bill? Skip the SDK and use the proxy — two env vars, one default header.
import { TESClient } from "@pentatonic-ai/ai-agent-sdk";
import Anthropic from "@anthropic-ai/sdk";
// Local mode — no API key. Memory stack runs on your machine.
const tes = new TESClient();
// One line — wrap your existing client.
const ai = tes.wrap(new Anthropic());
const res = await ai.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
});
// Captured locally:
// CHAT_TURN { tokens, model, ... }
// TOOL_USE { name, args, result }
// Session linkage across turns
// 4-layer memory: episodic, semantic, procedural, workingInstall
One package, four entry points
npm / pip
Library install. Wrap your existing OpenAI / Anthropic / Workers AI client, emit observability events, manage sessions manually if you need to.
npx … init
Interactive setup for the hosted TES platform. Creates a tenant, verifies your email, mints an API key, and prints the three env vars you need.
npx … memory
Runs the full memory stack locally — PostgreSQL + pgvector + Ollama in Docker. No account, no API key, no egress. Pulls embedding + chat models, writes config, and you're done.
npm install @pentatonic-ai/ai-agent-sdkTwo modes
Local for privacy. Hosted for scale.
The same SDK runs against a local Docker stack or against the hosted TES platform. Switch by changing config — the call sites don't change.
Local memory
Free foreverPostgreSQL + pgvector + Ollama on your own machine. Pi 5 with 8GB RAM is enough for the default models.
- 4-layer memory — episodic, semantic, procedural, working
- Multi-signal retrieval (vector + BM25 + recency + frequency)
- HyDE query expansion
- Decay and consolidation cycles
- Wrap any LLM client — OpenAI, Anthropic, Workers AI
- Token usage, tool calls, sessions
- MIT licensed, no telemetry, no egress
Hosted TES platform
Per tokenSame SDK pointed at the hosted platform. Higher-dimensional embeddings, conversation analytics, team-wide shared memory. Same per-token billing as the proxy.
- Everything in local, plus:
- NV-Embed-v2 (4096-dim) embeddings + HNSW index
- Conversation analytics + dead-end detection
- Team-wide shared memory + admin dashboard
- Multi-tenancy with per-client data residency
- Same token surface as the proxy — no parallel keys
Local — no config
// No API key needed — runs locally.
const tes = new TESClient();
const ai = tes.wrap(new OpenAI());Hosted — three env vars
const tes = new TESClient({
endpoint: process.env.TES_ENDPOINT,
clientId: process.env.TES_CLIENT_ID,
apiKey: process.env.TES_API_KEY,
});
const ai = tes.wrap(new OpenAI()); // same call sitesProvider agnostic
Three providers, one wrap
tes.wrap() detects the client type and intercepts the right method. Switch providers without changing tracking code.
JavaScript / TypeScript
import { TESClient } from "@pentatonic-ai/ai-agent-sdk";
import OpenAI from "openai";
const tes = new TESClient({
endpoint: process.env.TES_ENDPOINT, // https://<clientId>.api.pentatonic.com
clientId: process.env.TES_CLIENT_ID,
apiKey: process.env.TES_API_KEY,
});
const ai = tes.wrap(new OpenAI(), { sessionId: "conv-123" });
const res = await ai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});
// CHAT_TURN event emitted in the background — token usage, tool calls,
// session linkage — without changing the call site.Python
import os
from pentatonic_ai_agent_sdk import TESClient
from openai import OpenAI
tes = TESClient(
endpoint=os.environ["TES_ENDPOINT"],
client_id=os.environ["TES_CLIENT_ID"],
api_key=os.environ["TES_API_KEY"],
)
ai = tes.wrap(OpenAI(), session_id="conv-123")
res = ai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)| Provider | Detection | Intercepted method | Available in |
|---|---|---|---|
| OpenAI | client.chat.completions.create | chat.completions.create() | JS + Python |
| Anthropic | client.messages.create | messages.create() | JS + Python |
| Workers AI | client.run | run() | JS only |
All other methods on the wrapped client pass through unchanged.
Two patterns
Auto-wrap or manual sessions
tes.wrap() for zero boilerplate. tes.session() for full control over multi-turn loops. Both emit the same CHAT_TURN events.
Auto-wrap
Pass your client to tes.wrap(). Tracking happens automatically — no changes to your existing call sites.
const tes = new TESClient({ endpoint, clientId, apiKey });
const ai = tes.wrap(new OpenAI());
const res = await ai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});
// CHAT_TURN emitted in the background.Manual session
Create a session for multi-round tool-calling loops where you want to attach metadata to individual turns.
const session = tes.session({
sessionId: "conv-123",
metadata: { userId: "u_1", channel: "api" },
});
session.record(rawLLMResponse);
await session.emitChatTurn({
userMessage: "Find items under $50",
assistantResponse: res.content[0].text,
});
console.log(session.totalUsage);
// { prompt_tokens: 142, completion_tokens: 87, ai_rounds: 3 }Reference
Public surface
Zero runtime dependencies. Full source on GitHub.
new TESClient({ endpoint, clientId, apiKey, ... })Create a client. endpoint/clientId/apiKey are required for hosted mode. Optional: userId, captureContent, maxContentLength, waitUntil.
tes.wrap(client, { sessionId?, metadata? })Return a transparent proxy around your LLM client. Detects provider, intercepts the right method, emits CHAT_TURN events without changing your code.
tes.session({ sessionId?, metadata? })Create a Session manually for full control over multi-round tool-calling loops.
session.record(rawResponse)Normalise a raw provider response and accumulate token usage on the session.
session.emitChatTurn({ userMessage, assistantResponse, ... })Emit a CHAT_TURN event with accumulated data, then reset for the next turn.
session.emitToolUse({ name, args, result? })Emit a TOOL_USE event linked to the current session.
normalizeResponse(raw)Standalone utility — converts any provider response into { content, model, usage, toolCalls }.
createMemorySystem({ db, embedding, llm })Library entry point — run the local 4-layer memory stack against your own pg + Ollama. Imported from @pentatonic-ai/ai-agent-sdk/memory.
Where to go from here
Pick the surface that fits
Proxy quickstart
If you don't need fine-grained control, the env-var-swap proxy is the fastest path. Two routes, one header, done.
Read →Claude Code plugin
Drop-in marketplace plugin that wires the SDK into Claude Code's hook system. Search and store memories without writing code.
Read →OpenClaw plugin
Context-engine plugin for OpenClaw. Memory ingest, retrieval, decay, and consolidation fire on every lifecycle event — no agent rewrites.
Read →Memory layer
What the SDK actually retrieves from. Episodic, semantic, procedural, working — bio-inspired layers with decay and consolidation.
Read →Open source
Start local. Scale with TES.
Install free, run locally on your own infra. When you need higher-dimensional embeddings, conversation analytics, and team-shared memory — point the same client at the hosted platform.