Cloud compute is cheap. AI compute is Expensive. A single GPT-4 call can cost $0.03. If you have 1 million users, that's $30,000 per request. You must be an architect of costs as much as code.
Always track tokens programmatically. Use libraries like Tiktoken to calculate the token count *before* sending the request. If a user tries to upload a 500-page PDF, block it at the gateway to prevent a $50 bill for a single call.
Don't use GPT-4 for everything.
Q: "What is 'Prompt Caching' and how does it save money?"
Architect Answer: "In a RAG system, you often send the same 10,000-word 'Knowledge Base' with every request. Prompt caching (available in Azure OpenAI) allows the provider to 'Keep' that prefix in memory. You only pay for the full 10,000 tokens once; subsequent calls only pay for the 'New' user message. This can reduce your AI bill by up to **90%** for repetitive enterprise tasks."