You can't send a 1,000-page PDF to an LLM all at once. You must break it into smaller pieces called Chunks. Choosing the wrong chunking strategy can destroy the AI's ability to find the right answer.
Splitting every 500 tokens. **Pros:** Simple, fast. **Cons:** It might cut a sentence or a paragraph in half, losing context. We solve this with **Overlap** (e.g., each chunk contains 10% of the previous chunk), ensuring no information is lost at the boundaries.
The "Industry Standard." It tries to split at the largest possible boundary (Double newline), then single newline, then space. This keeps paragraphs and sentences together as single units of meaning.
Using an AI model to detect when the **Topic** changes. Instead of splitting based on character count, it splits when the meaning shifts. This creates the highest quality RAG context but is more expensive to generate.
Q: "How do you handle 'Tables' in PDFs for RAG?"
Architect Answer: "Tables are a nightmare for standard chunkers. We use **Layout-Aware Parsing** (like Azure AI Document Intelligence). It converts tables into Markdown format. Markdown preserves the row/column relationship in text form, which LLMs are excellent at reading. Simply stripping the text from a table destroys the meaning."