We had a background job that processed thousands of records in parallel. Each batch ran concurrently, and we kept track of total successful and failed records.
Everything worked perfectly.
For almost a year.
Then one day, the totals started becoming… wrong.
No exceptions. No crashes. Just incorrect numbers.
The Setup
- Records processed in chunks
- Multiple chunks running concurrently
- Shared counters tracking totals
- Periodic database updates with progress
All standard parallel batch processing. And yet — totals drifted.
The Symptom
- Some runs showed fewer successful records than expected
- Re-running the same data produced different counts
- The issue appeared only in one environment
What Was Actually Happening
Two workers read the same value, each increment locally, and one overwrites the other — a lost update.
Both workers read 10, both increment to 11 — one write is lost.
This is a classic lost update race condition.
The Buggy Code
int totalSuccess = 0;
Parallel.ForEach(records, record =>
{
if (Process(record))
{
totalSuccess++; // not atomic
}
});
Why volatile Alone Doesn't Fix It
private static volatile int totalSuccess = 0;
This ensures visibility, but not atomicity.
The Fix: Atomic Counters
int totalSuccess = 0;
Parallel.ForEach(records, record =>
{
if (Process(record))
{
Interlocked.Increment(ref totalSuccess);
}
});
Snapshot-Based Progress Reporting
var finished = Interlocked.Increment(ref completedChunks);
if (finished % maxConcurrency == 0)
{
var successSnapshot = Volatile.Read(ref totalSuccess);
var failureSnapshot = Volatile.Read(ref totalFailed);
job.TotalSuccessfulRecords = successSnapshot;
job.TotalFailedRecords = failureSnapshot;
await UpdateJobProgress(job);
}
Lessons Learned
- Thread-safe collections ≠ thread-safe logic
- ++ is not atomic
- volatile ensures visibility, not correctness
- Use Interlocked for counters
- Use snapshot reads for reporting
- Reduce shared mutable state
- Concurrency bugs are timing dependent
Takeaway
If you're running parallel batch jobs and tracking totals:
- Use atomic counters
- Take snapshot reads for reporting
- Avoid frequent shared writes
Otherwise, everything may look fine… until it doesn't.