Executive Framing
Most push benchmarks fail before the system fails: they optimize for throughput charts and miss failure mechanics. In production, the decisive signal is not how fast the service can publish at steady state, but how it degrades when scheduling contention, queue amplification, and reconnect pressure hit at the same time.
These conclusions are drawn from controlled experimental runs under CPU-capped container deployments with consistent workload generation. The focus here is on behavioral patterns under saturation rather than absolute throughput numbers.
The most costly mistake is optimizing for maximum throughput while ignoring p95 and p99 delivery behavior during throttling and reconnect storms.
Experimental Setup Characteristics
- Constrained CPU limits: deliberate cgroup throttling to force scheduler contention.
- Controlled load phases: both steady-state and burst scenarios exercised under identical control settings.
- Primary observability lens: latency distribution, queue depth evolution, and reconnect behavior under pressure.
- Explicit objective: observe failure mechanics under saturation, not compare protocol feature sets.
Runtime Architecture and Failure Mechanics
In async Python services, push delivery is bounded by event-loop scheduling fairness and per-connection write-backpressure. Under CPU throttling, cooperative tasks lose cadence; heartbeat handlers, flush loops, and reconnect logic contend for loop time and widen tail latency.
Directionally, SSE tended to expose saturation earlier through visible queue pressure and flush jitter, while WebSocket paths were more prone to state-heavy contention once connection management and keepalive work accumulated. The operational result is a different failure shape: SSE usually bends with increasing latency, while WebSocket is more likely to step into abrupt instability if backpressure discipline is weak.
- Connection lifecycle: setup, steady-state writes, slow-consumer detection, and teardown must remain explicit.
- Event-loop behavior: small scheduler slips compound into p99 expansion when large connection sets share one loop.
- Memory model: bounded queues cap resident memory; unbounded buffering converts transient bursts into sustained instability.
These effects are directional and repeatable across stacks, even when absolute numbers differ by runtime or host profile.
Resource Cost Analysis: CPU, Memory, Connection State
CPU and memory costs are tightly coupled through backpressure pathways. As CPU availability tightens, drain rate falls. If ingress remains unchanged, queue depth amplifies, memory pressure rises, allocator overhead increases, and tail latency degrades in a feedback loop.
- CPU-bound phase: serialization and network write scheduling dominate.
- Memory-bound phase: buffered messages and per-connection state dominate.
- State overhead: protocols with heavier lifecycle bookkeeping amplify both costs at high connection counts.
In containerized environments, this transition can happen quickly because cgroup limits make backlog growth visible as OOM risk instead of a slow degradation.
Behavior Under Scale
Scale failures are usually phased. Systems pass through three distinct stages before full collapse.
A recurring failure narrative was pods reporting acceptable CPU utilization while p99 delivery latency expanded sharply due to event-loop scheduling contention and queue drain lag. Health checks still passed, but user-visible latency was already outside safe bounds.
Autoscaling helps only if metrics expose saturation visibility. Scaling on CPU alone is insufficient for push systems; latency-tail and queue-pressure signals must participate or scaling reacts after degradation is already user-visible.
Operational Risks and When Not to Use This Approach
A benchmarking strategy centered on one-way push behavior is inappropriate when product requirements are truly duplex and low-latency in both directions.
- Do not generalize one-way benchmark conclusions to interactive bidirectional workloads.
- Do not run protocol tests without reconnection and slow-consumer scenarios.
- Do not rely on averages; tail-latency and drop behavior must be primary signals.
- Do not keep queue depth unbounded to preserve nominal throughput during bursts.
Decision Matrix
| Production Context | Primary Risk | Recommended Bias | Operational Reason |
|---|---|---|---|
| One-way event fanout under tight memory limits | Buffer growth and OOM | SSE with bounded queues | Simpler lifecycle and easier memory control |
| Interactive duplex control channel | Round-trip semantics and protocol mismatch | WebSocket | Native bidirectional framing |
| Bursty traffic with strict p99 SLO | Scheduler jitter and backlog cascades | Protocol + explicit shedding policy | Tail protection matters more than peak throughput |
| Kubernetes with aggressive CPU limits | Throttle-induced latency amplification | Scale on latency + queue depth | CPU-only signals miss early degradation |
Monitoring and SLO Implications
Push workloads need saturation-aware observability with explicit tail-latency guardrails.
- Delivery p50, p95, p99
- Segmented by protocol and pod
- Tracked at client-visible boundary
- Queue depth per connection
- Enqueue drops and write timeout rate
- Slow-consumer counts
- Active sockets and reconnect rate
- Disconnect reasons
- Handshake failure rate
- Memory working set and OOM events
- CPU throttled time
- Restart loops
SLO policy should encode degradation behavior as first-class control logic: acceptable drop strategy, max reconnect churn, and escalation triggers when p99 rises with queue amplification.
Minimal Async Pattern for Bounded Delivery Under CPU Pressure
This pattern favors bounded memory and controlled shedding instead of unbounded buffering.
import asyncio
import time
from collections import deque
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
app = FastAPI()
class DeliveryState:
def __init__(self, max_queue: int = 512):
self.q = asyncio.Queue(maxsize=max_queue)
self.dropped = 0
self.delivery_ms = deque(maxlen=2048)
state = DeliveryState()
async def publish(event: str) -> None:
if state.q.full():
_ = state.q.get_nowait() # Drop oldest to cap memory.
state.dropped += 1
state.q.put_nowait((time.perf_counter(), event))
async def stream(request: Request):
while True:
if await request.is_disconnected():
break
ts, event = await state.q.get()
state.delivery_ms.append((time.perf_counter() - ts) * 1000)
yield f"data: {event}
"
@app.get("/events")
async def events(request: Request):
return StreamingResponse(
stream(request),
media_type="text/event-stream"
)
Engineering Conclusion
Benchmarking under CPU limits is only useful when it exposes failure mechanics: tail-latency expansion, backpressure collapse, and memory instability under reconnect pressure. Systems that look efficient at median latency can still fail operationally once saturation begins.
Treat protocol choice as part of runtime control strategy. Prefer architectures that make saturation visible, keep buffers bounded, and recover predictably after pressure drops.