SSE vs WebSocket Under CPU Limits: What Actually Breaks First in Production

Contents

Executive Framing
Setup Characteristics
Runtime & Failure Mechanics
Resource Cost Analysis
Behavior Under Scale
Operational Risks
Decision Matrix
Monitoring & SLO
Code Pattern
Conclusion

Executive Framing

Most push benchmarks fail before the system fails: they optimize for throughput charts and miss failure mechanics. In production, the decisive signal is not how fast the service can publish at steady state, but how it degrades when scheduling contention, queue amplification, and reconnect pressure hit at the same time.

These conclusions are drawn from controlled experimental runs under CPU-capped container deployments with consistent workload generation. The focus here is on behavioral patterns under saturation rather than absolute throughput numbers.

Key insight

The most costly mistake is optimizing for maximum throughput while ignoring p95 and p99 delivery behavior during throttling and reconnect storms.

Experimental Setup Characteristics

Constrained CPU limits: deliberate cgroup throttling to force scheduler contention.
Controlled load phases: both steady-state and burst scenarios exercised under identical control settings.
Primary observability lens: latency distribution, queue depth evolution, and reconnect behavior under pressure.
Explicit objective: observe failure mechanics under saturation, not compare protocol feature sets.

Runtime Architecture and Failure Mechanics

In async Python services, push delivery is bounded by event-loop scheduling fairness and per-connection write-backpressure. Under CPU throttling, cooperative tasks lose cadence; heartbeat handlers, flush loops, and reconnect logic contend for loop time and widen tail latency.

Directionally, SSE tended to expose saturation earlier through visible queue pressure and flush jitter, while WebSocket paths were more prone to state-heavy contention once connection management and keepalive work accumulated. The operational result is a different failure shape: SSE usually bends with increasing latency, while WebSocket is more likely to step into abrupt instability if backpressure discipline is weak.

Connection lifecycle: setup, steady-state writes, slow-consumer detection, and teardown must remain explicit.
Event-loop behavior: small scheduler slips compound into p99 expansion when large connection sets share one loop.
Memory model: bounded queues cap resident memory; unbounded buffering converts transient bursts into sustained instability.

These effects are directional and repeatable across stacks, even when absolute numbers differ by runtime or host profile.

Resource Cost Analysis: CPU, Memory, Connection State

CPU and memory costs are tightly coupled through backpressure pathways. As CPU availability tightens, drain rate falls. If ingress remains unchanged, queue depth amplifies, memory pressure rises, allocator overhead increases, and tail latency degrades in a feedback loop.

CPU-bound phase: serialization and network write scheduling dominate.
Memory-bound phase: buffered messages and per-connection state dominate.
State overhead: protocols with heavier lifecycle bookkeeping amplify both costs at high connection counts.

In containerized environments, this transition can happen quickly because cgroup limits make backlog growth visible as OOM risk instead of a slow degradation.

Behavior Under Scale

Scale failures are usually phased. Systems pass through three distinct stages before full collapse.

Phase 1

p95 drift

Stable median latency. Tail begins to expand. System looks healthy in dashboards.

Phase 2

p99 expansion

Reconnect churn begins. Write timeouts appear. Median still acceptable.

Phase 3

Partial collapse

Memory instability, write timeouts, selected pod failures. Health checks may still pass.

A recurring failure narrative was pods reporting acceptable CPU utilization while p99 delivery latency expanded sharply due to event-loop scheduling contention and queue drain lag. Health checks still passed, but user-visible latency was already outside safe bounds.

Scaling caveat

Autoscaling helps only if metrics expose saturation visibility. Scaling on CPU alone is insufficient for push systems; latency-tail and queue-pressure signals must participate or scaling reacts after degradation is already user-visible.

Operational Risks and When Not to Use This Approach

A benchmarking strategy centered on one-way push behavior is inappropriate when product requirements are truly duplex and low-latency in both directions.

Do not generalize one-way benchmark conclusions to interactive bidirectional workloads.
Do not run protocol tests without reconnection and slow-consumer scenarios.
Do not rely on averages; tail-latency and drop behavior must be primary signals.
Do not keep queue depth unbounded to preserve nominal throughput during bursts.

Decision Matrix

Production Context	Primary Risk	Recommended Bias	Operational Reason
One-way event fanout under tight memory limits	Buffer growth and OOM	SSE with bounded queues	Simpler lifecycle and easier memory control
Interactive duplex control channel	Round-trip semantics and protocol mismatch	WebSocket	Native bidirectional framing
Bursty traffic with strict p99 SLO	Scheduler jitter and backlog cascades	Protocol + explicit shedding policy	Tail protection matters more than peak throughput
Kubernetes with aggressive CPU limits	Throttle-induced latency amplification	Scale on latency + queue depth	CPU-only signals miss early degradation

Monitoring and SLO Implications

Push workloads need saturation-aware observability with explicit tail-latency guardrails.

Latency

Delivery p50, p95, p99
Segmented by protocol and pod
Tracked at client-visible boundary

Backpressure

Queue depth per connection
Enqueue drops and write timeout rate
Slow-consumer counts

Connection Health

Active sockets and reconnect rate
Disconnect reasons
Handshake failure rate

Container Pressure

Memory working set and OOM events
CPU throttled time
Restart loops

SLO policy should encode degradation behavior as first-class control logic: acceptable drop strategy, max reconnect churn, and escalation triggers when p99 rises with queue amplification.

Minimal Async Pattern for Bounded Delivery Under CPU Pressure

This pattern favors bounded memory and controlled shedding instead of unbounded buffering.

Python · FastAPI · SSE

import asyncio
import time
from collections import deque
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

app = FastAPI()

class DeliveryState:
    def __init__(self, max_queue: int = 512):
        self.q = asyncio.Queue(maxsize=max_queue)
        self.dropped = 0
        self.delivery_ms = deque(maxlen=2048)

state = DeliveryState()

async def publish(event: str) -> None:
    if state.q.full():
        _ = state.q.get_nowait()  # Drop oldest to cap memory.
        state.dropped += 1
    state.q.put_nowait((time.perf_counter(), event))

async def stream(request: Request):
    while True:
        if await request.is_disconnected():
            break
        ts, event = await state.q.get()
        state.delivery_ms.append((time.perf_counter() - ts) * 1000)
        yield f"data: {event}

"

@app.get("/events")
async def events(request: Request):
    return StreamingResponse(
        stream(request),
        media_type="text/event-stream"
    )

Engineering Conclusion

Benchmarking under CPU limits is only useful when it exposes failure mechanics: tail-latency expansion, backpressure collapse, and memory instability under reconnect pressure. Systems that look efficient at median latency can still fail operationally once saturation begins.

Treat protocol choice as part of runtime control strategy. Prefer architectures that make saturation visible, keep buffers bounded, and recover predictably after pressure drops.