Articles

Context Windows, Explained for Humans

See how your prompt becomes tokens, how the model predicts the next token, and why long chats hit a memory limit.

Peregian Digital Hub
AI Skills Context Windows LLMs

You type a prompt into ChatGPT and hit send. It feels like the model is holding the whole conversation in its head. It is not.

It reads a slice of text called the context window. That window is its working memory, and it is measured in tokens instead of words. Every message you send and every response it generates has to fit inside that window.

This article shows the full path from prompt to tokens to predictions, with small demos you can play with.

From prompt to tokens

Start with the thing you already know: a normal chat prompt. The model does not see letters or words. It sees tokens, which are small pieces of text.

Tokens are not always whole words. A long word might become two or three tokens. Punctuation and spaces count too.

If you want the exact token splits for real models, try the tiktokenizer demo.

Step 1
Type a prompt. Watch it become tokens.
Chat input

This is a normal prompt. The model will split it into tokens before it does anything else.

Tokens (simplified)
Token count: 0 Characters: 0
Token IDs (integer sequence)

The model only sees these integers. The vocabulary maps each ID back to a text fragment.

 word leading space inside a token token word piece , punctuation
0 / 128 tokens in the window

This demo uses js-tiktoken (lite) with the o200k_base vocabulary (GPT-4o family). Special tokens like <|endoftext|> are allowed if you type them. Colors are based on token IDs.

Notice the tokens that begin with a blank space. Real tokenizers often include that space inside the token instead of splitting it out. Chat role wrappers are added by the API, not typed directly.

Behind the scenes, every token is an integer ID. The model never sees the letters, only a long list of numbers. That is what a context window really is: a sequence of integers that map back to word fragments.

How the model predicts the next token

Inference is the step where the model predicts the next token. It looks at the whole sequence so far and assigns a probability to many possible next tokens.

In this demo, the model always picks the highest probability token so the behavior is easy to see.

Step 2
Predict the next token, then feed the full sequence back in.
Step 0

Each step reuses the entire sequence as input. The model is not guessing in isolation.

Every new token becomes part of the next input. That is why the model feels consistent as it talks – it is always reading the full context window again.

Every message adds to the window

The context window is not just your prompt. It includes system instructions, your messages, the assistant’s replies, and any tool outputs. It all becomes one long sequence of tokens.

Click through the turns below and watch the window fill up.

Step 3
Each new message adds tokens to the same window.
Total tokens: 0 Max: 200

Tool calls add tokens too

When you ask for a web search, the assistant pauses and calls a tool. The tool returns text, and that text becomes tokens inside the same window. Then the assistant continues predicting the next tokens.

Step 4
Tool calls insert new tokens into the same sequence.
1. User message

"Find two cafe options and opening hours."

2. Assistant decides

It chooses a search tool instead of guessing.

3. Tool call

Search request sent to the web.

4. Tool result

Web page text returns to the model.

5. Tokens added

The tool text becomes tokens in the window.

6. Assistant continues

It generates the response using the expanded context.

Context stream
Context tokens: 0

Other modalities still become tokens

PDFs, images, and voice are not magic inputs. They get converted into text or multimodal tokens before the model reasons. Those tokens count toward the same context window.

Step 5
Different inputs, same token budget.
Text prompt
->
Tokenizer
->
Tokens

What this means for your prompts

Once you understand the window, you can work with it instead of fighting it.

  • Keep your main goal in one short sentence near the top.
  • Use bullet points for constraints so the model can re-read them quickly.
  • Summarize long threads before asking for new work.
  • Ask for compact outputs when the conversation gets long.

The practical takeaway

You do not need to count tokens by hand. Just use these three habits:

  1. Start with a short goal sentence.
  2. Put constraints in a clean list.
  3. If the chat gets long, ask for a short summary and continue from that summary.
Share this article
Articles