When you have 5,000 microservices, they generate Terabytes of logs every hour. A standard ELK stack will crash under this load. You need a Stream-Processing Architecture.
We don't keep every log in Elasticsearch forever—it's too expensive. - **Tier 1 (Hot):** Last 7 days in Elasticsearch for instant search. - **Tier 2 (Warm):** Last 30 days in a compressed format (e.g., Parquet) on S3 for long-term troubleshooting. - **Tier 3 (Cold):** 1 year in Glacier for compliance.
Never send logs directly to the DB. Every microservice ships logs to Kafka. Kafka acts as a giant 'Buffer' that can absorb massive spikes in traffic (e.g., during an outage). The log indexing workers then pull from Kafka at their own speed, ensuring the logging system never becomes the cause of a production crash.
At FAANG scale, you don't log 'Success' messages at 100%. You use **Dynamic Sampling**. You log 1% of success messages but 100% of errors. This gives you enough data to see trends while reducing infrastructure costs by 90%.
Q: "How do you find 'Silent Failures' in a microservices pipeline?"
Architect Answer: "**Synthetic Monitoring and Dead Letter Analysis**. We periodically inject a 'Probe' message into the system and track it from the Gateway to the DB. If it disappears anywhere, an alert is triggered. We also monitor our **Dead Letter Queues (DLQ)**. A spike in DLQ messages is the earliest signal that a service is failing silently, even if its 'Health Check' is still green."
You are now ready to architect systems for the largest companies on earth. Go build the future.