Understanding Thundering Herd Problem

1. What is a Thundering Herd Problem?
Imagine you are watching a movie in the cinema theatre, suddenly the fire alarms hits and everyone in the hall get's panicked, there been a short circuit which has caused the fire, as a result 3 Electronic gates in the Hall remains closed as the power had cutt off, there is one 1 Fire Exit in the hall and everyone in the hall rushed towards that Fire Exit. Imagine 400-500 People tries to run away from a Single Door which eventually creates more panic among the people and also for the staff as well ,because they will face a lot of difficulty to moving the people out of the hall.
So here a when a Lot of people tries to exit from a Single door simultaneously at the same time, it creates a lot of panic situation. This out of control suitation in generally knows as thundering herd problem in distributed systems.
The Thundering Herd Problem arises when large number of client requests tries to access the same resource from the server simultaneously, increase the load and causing the server to crash finally.
It usually happens when:
Cache expires
A lock is released
A service becomes available
A popular event triggers traffic
Instead of spreading load gradually, all requests fire at once.
2. Simple Achitecture of Working Application
In Normal Architecture,
Client Request the data -> Server check for the Data in the Cache -> Cache Hit: Reponse returns very Fast. (In case of Cache Miss -> Fetches the data from DB and then also stores in Cache).
Everything works normally, till the Time Cache TTL expires.
3. Where it occurs Commonly
3.1 Caching Systems (Most Common)
Redis
API response caching
Happens specifically when TTL Expires then the Entire Load increases to DB.
3.2 Databases
Many clients querying same row
Expensive aggregation query
3.3. Load Balancers
- When servers recover and all traffic is routed back instantly
4. Cache Expiry(Real Life Example)
Let's say you are caching a IPL Match data for 5 minutes. So, TTL = 5 minutes = 300 seconds.
Lets suppose 1,00,000 Users are watching the match score. At the instant of 300th second, cache data expires so all 1,00,000 request will hit a Cache Miss simultaneously, and all these requests will hit the Database, which suddenly increase the DB Load causing it to crash, here we can say we hit a Thundering Herd Problem.
5. What actually happens (How Traffic overloads the system)
6. Why is it dangerous for Distributed System
In a distributed system:
Multiple app servers
Auto-scaling
Shared DB
Shared cache
6.1 Problems:
Every app server independently sees the Cache Miss.
Each server triggers the DB Call.
Loads on the DB get multiplied suddenly.
For an Instance let's say you have 10 servers: where each server get approx. 1,00,000 requests. So here in case of Cache Miss you don't get 1,00,000 DB hits you actually get 10,00,000 DB hits simultaneously. This sudden amplications in distributed sytem is very dangerous.
7. Difference between normal spike vs thundering herd
Normal Traffic Spike | Thundering Herd |
1. Traffic increases gradually. | 1. Sudden synchronized Traffic Burst on the system. |
2. Caching works normally and handle the User Requests. | 2. All the Caches expires/fail at the same time resulting in sudden increase in DB Hits. |
3. System scales easily. | 3. System collapses. |
4. This is predictable. | 4. Very hard to Predict. |
Key Difference:
Normal Traffic Spike = Load gradually increases.
Thundering Herd = Synchronized burst caused by system behaviour.
8. Impact on System Components
8.1 CPU
Thread pool exhausted
Context switching increases
System thrashes
8.2 Database
Too many connections
Slow queries
Lock contention
Deadlocks possible
8.3 Cache
Stampede on rebuild
Memory pressure
Increased miss rate
8.4 Latency
Response time spikes
Timeouts
Cascading failures across services
9. Techiniques to prevent or reduce it
9.1 Request Coalescing (Single Flight)
If the system experience sudden 1000 Request for the same data, then Only the First request get access to fetch the data from DB, while other 999 requests wait for the data.
Here rather then 1000 sudden DB Request, you only get 1 DB Request.
9.2 Cache Locking and Mutex
If particular request wants to fetch the data, then first:
It need to acquire a lock.
Only the Lock holder can rebuild the cache.
While other waits for the cache to get updated and being served the stale data.
This mechanism is commonly used in Redis-Based System.
9.3 Staggered Expiry (Jitter / Randomized TTL)
Normally all the Cache sytem have same TTL Expiry as a result they all expired at the same thus increases the DB Load.
Instead of using TTL = 300 seconds for all Keys,
We use TTL = 300 + random(0-30) seconds
Now all the system will have different expiry this preventing the Synchronized expiry. Makes the sytem stable.
9.4 Exponential Backoff
If the client request fail at the first instant, then client retries after:
- 1 sec, 2 sec, 4 sec, 8 sec and so on.
This prevents sudden increase in the System Load for Retrying mechanism.
9.5 Rate Limiting
Apply Limits on:
Request per user.
Request per IP.
Global System Throughput.
This prevents the sudden increase in Traffic which will in System Collapse.