Every LLM has a "Limit." GPT-4o has 128k tokens, but smaller models might have only 4k. If you go over the limit, the API fails. Managing this 'Real Estate' is critical for long conversations.
Deleting the oldest messages when you near the limit. **Pros:** Fast, zero cost. **Cons:** The AI "forgets" how the conversation started. In a support chatbot, the AI might forget the user's name or the problem they are trying to solve.
When the window is 80% full, ask a cheap model (GPT-3.5) to "Summarize the conversation so far in 100 words." Then, replace the old messages with this summary. This allows the AI to maintain context for sessions that last for hours or days.
**Scientific Fact:** LLMs are better at remembering the **Beginning** and **End** of a prompt than the middle. If you put the most important fact in the center of a 100k token prompt, the AI might miss it. Always put your most critical instructions and data at the very end of the prompt.
Q: "Which is more important: A huge context window or a highly accurate RAG system?"
Architect Answer: "A highly accurate RAG system. Even with expensive 1-million token windows, sending 'Too much' noise makes the model less accurate. You should always aim to provide the **Minimum Viable Context**. It is faster, cheaper, and yields higher quality answers than dumping a whole book into the window."