This Concurrency Bug Stayed Hidden for a Year

We had a background job that processed thousands of records in parallel. Each batch ran concurrently, and we kept track of total successful and failed records.

Everything worked perfectly.

For almost a year.

Then one day, the totals started becoming… wrong.

No exceptions. No crashes. Just incorrect numbers.

The Setup

Records processed in chunks
Multiple chunks running concurrently
Shared counters tracking totals
Periodic database updates with progress

All standard parallel batch processing. And yet — totals drifted.

The Symptom

Some runs showed fewer successful records than expected
Re-running the same data produced different counts
The issue appeared only in one environment

What Was Actually Happening

Two workers read the same value, each increment locally, and one overwrites the other — a lost update.

Both workers read 10, both increment to 11 — one write is lost.

This is a classic lost update race condition.

The Buggy Code

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        totalSuccess++; // not atomic
    }
});

Why volatile Alone Doesn't Fix It

private static volatile int totalSuccess = 0;

This ensures visibility, but not atomicity.

The Fix: Atomic Counters

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        Interlocked.Increment(ref totalSuccess);
    }
});

Snapshot-Based Progress Reporting

var finished = Interlocked.Increment(ref completedChunks);

if (finished % maxConcurrency == 0)
{
    var successSnapshot = Volatile.Read(ref totalSuccess);
    var failureSnapshot = Volatile.Read(ref totalFailed);

    job.TotalSuccessfulRecords = successSnapshot;
    job.TotalFailedRecords = failureSnapshot;

    await UpdateJobProgress(job);
}

Lessons Learned

Thread-safe collections ≠ thread-safe logic
++ is not atomic
volatile ensures visibility, not correctness
Use Interlocked for counters
Use snapshot reads for reporting
Reduce shared mutable state
Concurrency bugs are timing dependent

Takeaway

If you're running parallel batch jobs and tracking totals:

Use atomic counters
Take snapshot reads for reporting
Avoid frequent shared writes

Otherwise, everything may look fine… until it doesn't.

#concurrency #multithreading #csharp #dotnet #performance