AI/ML
Token (AI Context)
What tokens are and why they matter in AI
Definition
In the context of AI and language models, a token is the fundamental unit of text that a model reads and generates. A token can be a whole word, a part of a word, or a punctuation mark. Token counts determine API pricing (cost per input/output token), context window limits (how much text the model can consider at once), and maximum response length.
How it works
Language models do not process text as characters or whole words. They use a tokenizer that breaks text into subword units. The word "embeddings" might become two tokens: "embed" and "dings." Common words like "the" are usually one token. On average, one token is roughly 3/4 of an English word, so 1,000 tokens is approximately 750 words.
Context windows define the total number of tokens a model can process in a single request - both the input (your prompt plus any context) and the output (the model's response) combined. GPT-4 Turbo supports 128K tokens. Claude supports 200K tokens. This matters because if your RAG pipeline retrieves 50 document chunks, the total token count of those chunks plus the prompt must fit within the context window.
Tokens directly impact cost. API providers charge per token - separately for input tokens (what you send) and output tokens (what the model generates). Output tokens are typically 3-4x more expensive. Optimizing token usage through better prompts, smarter context selection, and response length limits is one of the most effective ways to reduce AI operating costs.
How 1Raft uses Token
We track token usage from day one of every AI project. We build dashboards that show per-request and per-user token consumption, which lets us forecast costs accurately and identify optimization opportunities. In a commerce application, restructuring prompts to reduce input tokens by 40% saved the client $8,000/month at scale. We design prompts and retrieval strategies with token efficiency as a first-class constraint.
Related terms
AI/ML
Large Language Model (LLM)
A large language model is a neural network trained on massive text datasets to understand and generate human language. LLMs power chatbots, content generation, code assistants, and most modern AI products.
AI/ML
Prompt Engineering
Prompt engineering is the practice of crafting and optimizing the instructions given to a language model to get consistent, high-quality outputs. It is the most accessible and cost-effective way to improve AI application behavior without modifying the underlying model.
AI/ML
Model Inference
Inference is the process of using a trained AI model to generate predictions or outputs from new inputs. When you send a prompt to an LLM and get a response, that is inference. It is where compute costs, latency, and user experience are determined.
AI/ML
Transformer Architecture
The transformer is the neural network architecture behind virtually all modern language models. Introduced in 2017, it uses a mechanism called self-attention to process entire sequences of text in parallel, making it far more efficient and capable than previous approaches.
AI/ML
Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation is a technique that combines a language model with a searchable knowledge base. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents first, then generates answers grounded in that specific data.
Related services
Next Step
Need help with Token?
We apply this in production across industries. Tell us what you are building and we will show you how it fits.