DRTYLABS
WorkServicesBlogAboutContact
contact@drtylabs.ca
Back to Blog
AI IntegrationFeb 11, 202610 min read

Building a Streaming Chatbot with the Claude API: A Production Guide

Why Claude for Production Chatbots

If you are building a customer-facing chatbot in 2026, Claude is the model we default to. Not because it is the best at everything — GPT-5 still wins on multimodal, Gemini on long context — but because Claude is the model that follows instructions consistently, returns structured output reliably, and stays in character across long conversations. For brand-voice chatbots, support agents, and product assistants, those are exactly the qualities that matter.

The Anthropic API is also genuinely well-designed. Streaming is first-class. Tool use is straightforward. Prompt caching is built in and dramatically cuts costs for chatbots with consistent system prompts. The SDK is small, the docs are accurate, and the pricing is predictable. We have shipped streaming chatbots on Claude in production for over two years now, and the developer experience is the reason we keep choosing it.

Your First Streaming Call

The minimum-viable Claude streaming integration looks like this in TypeScript: import the Anthropic SDK, instantiate the client with your API key, call client.messages.stream() with your model and messages array. The returned stream is an async iterable of typed events. You handle text_delta events to append tokens to the UI as they arrive. You handle message_stop to know the response is complete.

The reason streaming matters for chatbots is perceived latency. A non-streaming response feels slow even if the actual tokens-per-second is fast — the user is staring at a loading spinner for 3-5 seconds. A streaming response feels instant because the first character appears within 500ms and the response unspools in real time. For any chat UI, streaming is non-negotiable. Build it from the start, not as a post-launch optimization.

System Prompt as Brand Voice Foundation

The system prompt is where you define the chatbot's personality, knowledge, and operating rules. It is also the most frequently underestimated component of a production chatbot. A weak system prompt produces a generic chatbot that sounds like ChatGPT. A strong system prompt produces a chatbot that sounds like your brand.

The pattern we ship: a system prompt that establishes the brand voice (specific words to use, specific words to avoid), the chatbot's role and limits (what it can help with, what it should redirect), the user context it knows (the company's services, locations, FAQs), and the format expectations (response length, tone, level of detail). Production system prompts in our codebases are typically 1,500-3,000 tokens. They are version-controlled, tested with eval suites, and updated deliberately. Treat the system prompt as code, not config.

Prompt Caching for Cost Control

Anthropic's prompt caching feature is the single biggest cost optimization for any chatbot with a long system prompt. The mechanism: mark sections of your prompt as cacheable, and on subsequent requests with the same cached content, you pay 10 percent of the normal input cost for those tokens. For a chatbot with a 3,000-token system prompt, this is the difference between a sustainable unit economics story and a runaway API bill.

The implementation: pass a cache_control: { type: 'ephemeral' } marker on the system prompt block. The cache is keyed on the exact content of the prompt up to that marker, with a TTL of around five minutes. Subsequent calls within that window pay the cached rate. For a high-traffic chatbot, the hit rate is essentially 100 percent — every conversation reuses the same system prompt. We have seen production deployments cut their Anthropic bill by 70-80 percent just by enabling caching on the system prompt. Do this from day one.

Tool Use for Grounded Answers

For chatbots that need to answer questions about your product's data — current pricing, availability, account status, recent orders — you do not want the model to hallucinate. You want it to call your APIs and return real data. This is what tool use is for.

The pattern: define tools as JSON schemas describing the functions the model can call (get_product_info, check_inventory, lookup_order, etc.). Pass the tool definitions to the API alongside your prompt. The model decides when to call a tool, returns a structured tool_use block with the function name and arguments, your code executes the function and returns the result, and the model incorporates the result into its response. The user sees a single coherent answer that happens to be backed by real data.

The discipline matters. Tools need clear descriptions so the model knows when to call them. Arguments need strict schemas so the model passes valid data. Errors need to be returned as part of the tool result so the model can recover gracefully. A well-designed tool layer is the difference between a chatbot that pretends to know things and a chatbot that actually knows them.

Conversation Memory and Context Limits

Chatbots that hold long conversations need a memory strategy. Every turn appends to the messages array, and eventually you hit the model's context window. For Claude Sonnet 4.6, that is 200K tokens — enormous, but not infinite for a chatbot used for hours at a time.

The simplest pattern is sliding window: keep the system prompt, the last N user/assistant turns, and discard older messages. This works for most chatbots where context is local. For chatbots that need to reference earlier parts of long conversations, the right pattern is summarization — periodically summarize older turns into a compressed memory block that lives in the system prompt, freeing up context for the recent conversation. Anthropic has a memory tool API in beta as of 2026 that handles this automatically; for production today, building the summarization yourself is straightforward and gives you full control.

Production Hardening — Errors, Retries, Rate Limits

Putting Claude into production means treating it like any other external dependency. Network errors happen. Rate limits get hit. Models occasionally return malformed output even with strong system prompts. Your code has to handle these cases gracefully.

The hardening pattern: wrap API calls in retry logic with exponential backoff for 5xx errors. Implement client-side rate limiting that respects the API's rate limit headers (anthropic-ratelimit-requests-remaining and friends). Validate model output against expected shapes before using it (Zod is great for this). Stream errors should fall back to a generic friendly error message in the UI — never expose API errors directly to the user. Log enough context that you can reproduce issues from your logs alone. Build observability for token usage, latency, error rates, and cost per session. These are the boring engineering details that separate a chatbot demo from a chatbot you can run in production for thousands of users a day. Build them in. Then, when the chatbot just works in front of real users, you will know exactly why.

Ready to put this into action?

We build the digital infrastructure that turns strategy into revenue. Let's talk about what DRTYLABS can do for your business.

Get in Touch