Async Batch Processing for High-Volume Logs

When a station group pushes a full broadcast day of traffic — millions of spot placements, avails, billing records, and make-good adjustments — through a synchronous, one-record-at-a-time ingestion loop, the pipeline stalls on network latency, exhausts memory on eager loads, and misses the scheduling-submission window before playout. Non-blocking batch processing is the phase of the Avion & Avstar Ingestion Pipelines that turns that fragile choke point into a deterministic, auditable component: it decouples I/O-bound file reads and API dispatch from CPU-bound validation, applies explicit concurrency limits, and guarantees that a retried or replayed push never double-books a break. This guide is written for the Python automation builders who own the pipeline and the traffic managers who need to trust its throughput and audit trail.

Concept & Data Model

Async batch processing sits between the parsing and validation stages upstream and the scheduling-submission handoff downstream. It never mutates the meaning of a record; it controls how fast and how safely validated records reach the automation platform. Three entities define the domain.

Record — a single normalized traffic line already resolved to the canonical spot schema: one airing of one creative, carrying spot_id, timezone-aware air_datetime, duration_frames, break_id, isci, and a normalized billing_code. Records enter the processor as an async stream, never as a fully materialized list.
Batch — a bounded chunk of records (typically 2,000–10,000) assembled for a single API submission. The batch is the unit of concurrency, idempotency, and retry. It carries a deterministic batch_key derived from the source_hash of its members so the same batch can be replayed without creating duplicate placements.
Dead-letter entry — a batch (or a single record) that failed validation, exhausted its retries, or arrived while the circuit breaker was open. It is serialized with full context — original payload, error path, retry count, correlation ID, timestamp — and routed to a durable dead-letter queue (DLQ) rather than silently dropped.

The batch envelope submitted to the automation platform is a strict contract. The fields below are the minimum the processor must attach before dispatch; the ingestion boundary guarantees the per-record types via Pydantic traffic-data validators before a record is eligible to join a batch.

Field	Type	Constraint	Role in dispatch
`batch_key`	`str`	SHA-256 hex, deterministic	Idempotency key; deduplicates a replayed batch
`records`	`list[dict]`	1 ≤ len ≤ `batch_size`	The validated payloads to schedule
`record_count`	`int`	`> 0`, equals `len(records)`	Cross-check against server-side accept count
`broadcast_day`	`date`	single day per batch	Prevents cross-day collisions on retry
`retry_count`	`int`	`>= 0`	Drives backoff and DLQ routing
`submitted_at`	`datetime`	UTC, timezone-aware	Audit receipt timestamp
`source_digest`	`str`	SHA-256 hex over member hashes	Lineage back to the originating export lines

A producer-consumer topology backed by a bounded queue is what makes this safe under load. The file reader (producer) streams records into an asyncio.Queue; a fixed pool of worker coroutines (consumers), gated by a semaphore, drains the queue and dispatches batches. When consumers fall behind, the bounded queue fills and naturally applies backpressure to the producer instead of letting memory grow without limit.

Figure — Producer-consumer topology where the file reader feeds a bounded queue that applies backpressure, a semaphore-limited worker pool drains it, and validated batches are dispatched to the Avstar API.

Implementation Approach

The design turns on four decisions, each a trade-off between throughput and safety.

Bounded queue vs. unbounded gather. The naive approach collects every batch coroutine and awaits one large asyncio.gather. On a multi-station day that schedules thousands of coroutines and holds every batch in memory at once. A bounded asyncio.Queue caps in-flight work: producers block when the queue is full, so memory stays flat regardless of file size. For multi-station aggregation the same back-pressure is better provided by a durable message broker, as described in the ingestion overview — the broker owns the buffer and lets validation and submission workers scale independently.

Semaphore-bounded concurrency, not “as many as possible.” Broadcast automation APIs publish tight connection-pool and rate limits. An asyncio.Semaphore sized to match the target’s concurrency ceiling prevents connection exhaustion and 429 storms. Session reuse, timeout tuning, and multipart upload strategy for the underlying HTTP client are covered in the child guide, Optimizing Asyncio for Traffic File Uploads; credential lifecycle and limit-aware dispatch live in Avstar API Authentication and Rate Limits.

Streaming (lazy) evaluation over eager loading. Records are yielded from disk or broker one at a time, accumulated only until the batch threshold is met, dispatched, and released. When legacy cleaning forces pandas into the path, chunked iteration (pd.read_csv(chunksize=...)) with explicit dtype downcasting — or a PyArrow-backed frame — keeps the heap bounded. Raw tokenization and header handling belong upstream in parsing Avion export formats, not inside the dispatch loop.

Idempotency as the default, not an add-on. Every batch carries a deterministic batch_key. A retried push, a replayed broadcast day, or a duplicated worker can therefore never create a second billing record — the safeguard that protects revenue attribution and, by extension, billing-code normalization downstream.

Production Python Implementation

The following module is deployable as-is. It streams records, assembles bounded batches, validates each record with Pydantic at the boundary, gates concurrency with a semaphore, retries with exponential backoff and jitter, trips a circuit breaker on sustained failure, and routes exhausted batches to a DLQ. Every log line follows the traffic-ops pattern timestamp | level | module | message and carries the spot_id or batch_key so an incident traces back to a single airing.

Figure — Batch lifecycle state machine: a validated record is queued and dispatched under the semaphore, committing to a receipt on success or looping through retry-with-backoff until max_retries sends it to the dead-letter queue. The circuit breaker (lower panel) cycles closed → open → half-open → closed and, while open, short-circuits dispatch straight to the DLQ.

python

import asyncio
import hashlib
import json
import logging
import random
import time
from datetime import date, datetime, timezone
from pathlib import Path
from typing import AsyncIterator

import aiohttp
from pydantic import BaseModel, Field, ValidationError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("avstar.batch")


class TrafficRecord(BaseModel):
    """Canonical spot record accepted into a dispatch batch."""

    spot_id: str = Field(min_length=1)
    air_datetime: datetime                       # must be timezone-aware (UTC)
    duration_frames: int = Field(gt=0)           # zero-duration avails are rejected
    break_id: str = Field(min_length=1)
    isci: str = Field(min_length=8, max_length=20)
    billing_code: str = Field(min_length=4)
    source_hash: str = Field(min_length=64, max_length=64)  # SHA-256 hex


class BatchConfig(BaseModel):
    max_concurrency: int = 15                    # match the Avstar connection ceiling
    batch_size: int = 5000
    queue_maxsize: int = 40                       # bounded => backpressure
    max_retries: int = 4
    breaker_threshold: int = 8                    # consecutive failures before OPEN
    breaker_cooldown: float = 30.0
    timeout_total: float = 30.0
    timeout_connect: float = 5.0
    endpoint: str = "https://api.traffic-system.example/v2/ingest"


class CircuitBreaker:
    """Halts dispatch to a degraded endpoint instead of hammering it."""

    def __init__(self, threshold: int, cooldown: float) -> None:
        self.threshold = threshold
        self.cooldown = cooldown
        self.failures = 0
        self.opened_at = 0.0
        self.state = "closed"

    def allow(self) -> bool:
        if self.state == "open" and time.monotonic() - self.opened_at >= self.cooldown:
            self.state = "half-open"
            logger.info("circuit breaker HALF-OPEN | admitting probe batch")
        return self.state != "open"

    def record_success(self) -> None:
        self.failures = 0
        self.state = "closed"

    def record_failure(self) -> None:
        self.failures += 1
        if self.failures >= self.threshold:
            self.state = "open"
            self.opened_at = time.monotonic()
            logger.warning("circuit breaker OPEN | failures=%d", self.failures)


class AsyncBatchProcessor:
    def __init__(self, config: BatchConfig, dlq_path: Path = Path("dlq/")) -> None:
        self.cfg = config
        self.dlq_path = dlq_path
        self.dlq_path.mkdir(parents=True, exist_ok=True)
        self.semaphore = asyncio.Semaphore(config.max_concurrency)
        self.breaker = CircuitBreaker(config.breaker_threshold, config.breaker_cooldown)
        self.session: aiohttp.ClientSession | None = None

    async def __aenter__(self) -> "AsyncBatchProcessor":
        timeout = aiohttp.ClientTimeout(
            total=self.cfg.timeout_total, connect=self.cfg.timeout_connect
        )
        self.session = aiohttp.ClientSession(timeout=timeout)
        return self

    async def __aexit__(self, exc_type: object, exc: object, tb: object) -> None:
        if self.session:
            await self.session.close()

    @staticmethod
    def _batch_key(records: list[TrafficRecord]) -> str:
        """Deterministic key so a replayed batch cannot double-book a break."""
        seed = "|".join(sorted(r.source_hash for r in records))
        return hashlib.sha256(seed.encode("utf-8")).hexdigest()

    async def process_stream(self, records: AsyncIterator[dict[str, object]]) -> None:
        """Producer + consumers with a bounded queue for backpressure."""
        queue: asyncio.Queue[list[TrafficRecord] | None] = asyncio.Queue(
            maxsize=self.cfg.queue_maxsize
        )
        workers = [
            asyncio.create_task(self._worker(queue))
            for _ in range(self.cfg.max_concurrency)
        ]
        await self._produce(records, queue)
        for _ in workers:
            await queue.put(None)             # poison pill per worker
        await asyncio.gather(*workers)

    async def _produce(
        self,
        records: AsyncIterator[dict[str, object]],
        queue: asyncio.Queue[list[TrafficRecord] | None],
    ) -> None:
        batch: list[TrafficRecord] = []
        async for raw in records:
            try:
                batch.append(TrafficRecord.model_validate(raw))
            except ValidationError as err:
                # Reject at the boundary; never let a malformed record enter a batch.
                await self._to_dlq({"raw": raw}, reason=str(err), spot_id=str(raw.get("spot_id")))
                continue
            if len(batch) >= self.cfg.batch_size:
                await queue.put(batch)        # blocks when the queue is full -> backpressure
                batch = []
        if batch:
            await queue.put(batch)

    async def _worker(self, queue: asyncio.Queue[list[TrafficRecord] | None]) -> None:
        while True:
            batch = await queue.get()
            if batch is None:
                queue.task_done()
                return
            try:
                await self._dispatch(batch)
            finally:
                queue.task_done()

    async def _dispatch(self, batch: list[TrafficRecord]) -> None:
        key = self._batch_key(batch)
        for attempt in range(1, self.cfg.max_retries + 1):
            if not self.breaker.allow():
                await self._to_dlq(self._envelope(batch, key, attempt),
                                   reason="circuit_breaker_open", spot_id=batch[0].spot_id)
                return
            async with self.semaphore:        # bound concurrent connections
                try:
                    await self._submit(self._envelope(batch, key, attempt))
                    self.breaker.record_success()
                    logger.info("dispatched batch_key=%s records=%d spot_id=%s",
                                key[:12], len(batch), batch[0].spot_id)
                    return
                except (aiohttp.ClientError, asyncio.TimeoutError) as err:
                    self.breaker.record_failure()
                    backoff = min(2 ** attempt, 30) + random.uniform(0, 1)  # jitter
                    logger.warning("retry batch_key=%s attempt=%d err=%s backoff=%.1fs",
                                   key[:12], attempt, err, backoff)
                    await asyncio.sleep(backoff)
        await self._to_dlq(self._envelope(batch, key, self.cfg.max_retries),
                           reason="max_retries_exhausted", spot_id=batch[0].spot_id)

    def _envelope(self, batch: list[TrafficRecord], key: str, attempt: int) -> dict[str, object]:
        return {
            "batch_key": key,
            "record_count": len(batch),
            "broadcast_day": batch[0].air_datetime.astimezone(timezone.utc).date().isoformat(),
            "retry_count": attempt - 1,
            "submitted_at": datetime.now(timezone.utc).isoformat(),
            "records": [r.model_dump(mode="json") for r in batch],
        }

    async def _submit(self, envelope: dict[str, object]) -> None:
        if self.session is None:
            raise RuntimeError("HTTP session not initialized")
        headers = {"Idempotency-Key": str(envelope["batch_key"])}
        async with self.session.post(self.cfg.endpoint, json=envelope, headers=headers) as resp:
            resp.raise_for_status()

    async def _to_dlq(self, payload: dict[str, object], reason: str, spot_id: str) -> None:
        entry = {"reason": reason, "quarantined_at": datetime.now(timezone.utc).isoformat(), **payload}
        target = self.dlq_path / f"{spot_id}-{int(time.time()*1000)}.json"
        target.write_text(json.dumps(entry, indent=2, default=str))
        logger.error("dead-letter spot_id=%s reason=%s file=%s", spot_id, reason, target)

The processor is intentionally small and typed end to end: Pydantic rejects malformed records before they can join a batch, the semaphore bounds outbound connections, the circuit breaker protects a degraded endpoint, and every terminal failure lands in the DLQ with enough context to replay it. Wrap it in an async with AsyncBatchProcessor(BatchConfig()) as proc: block and feed it an async record iterator sourced from the parsing stage.

Validation & Edge Cases

Batch boundaries are where broadcast-specific correctness quietly breaks. Handle these before they reach playout.

Timezone offsets and DST. air_datetime must be timezone-aware and normalized to UTC upstream; the processor rejects naive datetimes so a batch cannot mix a pre- and post-DST local time and silently reorder a break. The broadcast_day on the envelope is computed in UTC to keep one day per batch.
Sports overruns and late revisions. A live overrun produces revised placements that arrive after the initial day was batched. Never overwrite a submitted batch; emit the revision as an appended batch with its own batch_key, and let the automation platform reconcile by idempotency key. Full-log replacement on a revision is how stations lose as-run integrity.
Preemption tiers and make-good. Preempted spots must not be dropped from a batch — they carry contractual lineage. Flag the displaced record and hand it to make-good routing for preemptions rather than letting it vanish between chunks.
Competitive separation across chunks. Separation is an ordering constraint evaluated over a full break. Splitting a break across two batches can hide a violation, so chunk on break boundaries — never mid-break. Avail and break modeling follows avails mapping strategies for linear TV.
Zero-duration and degenerate avails. duration_frames is constrained > 0; a zero- or negative-duration record is rejected to the DLQ instead of overrunning or collapsing a break.
Partial batch failure. If the endpoint accepts N−1 of N records, treat the whole batch as failed and retry under the same idempotency key; a correctly idempotent server discards the already-accepted members. Never re-submit a manually trimmed batch — that changes the batch_key and breaks lineage.

Integration Points

This stage is one link in a linear chain, and its contracts on both sides are explicit.

Upstream (ingestion). The record iterator is produced by parsing Avion export formats and gated by the Pydantic validators; the processor re-validates at the boundary as a defensive check rather than trusting the caller. Where a message broker sits between stages, the processor consumes a topic instead of a file and the queue’s backpressure is replaced by broker back-pressure.

Downstream (scheduling submission). The batch envelope is the API contract with the automation platform. A minimal accepted response confirms the idempotency key and the server-side accept count so the processor can cross-check record_count:

json

{
  "batch_key": "9f2c1a…",
  "accepted": 5000,
  "rejected": 0,
  "receipt_id": "avstar-2026-07-03-000412",
  "status": "committed"
}

Credential rotation, token caching, and 429 handling on this leg are owned by Avstar API Authentication and Rate Limits; session-timeout recovery under sustained load is detailed in handling Avstar session timeouts in Python. When any direct database read is needed for reconciliation instead of the API, it must pass through the controls in security boundaries for traffic database access.

Compliance & Audit Considerations

Batching is where much of the audit evidence is created, so compliance is a first-class requirement, not a downstream concern.

FCC political file integrity. Political and issue-advertising spots must reach the public inspection file with advertiser, purchaser, rate, and schedule intact. The processor must never normalize those attributes away during batching; billing handling for political inventory follows standardizing billing codes across traffic systems.
Immutable, hash-chained receipts. Each batch’s batch_key, source_digest, submission timestamp, and the server receipt_id are appended to an append-only ledger. Chaining each receipt’s hash to the prior one lets an operator prove months later that the log that aired matched the log that was booked.
Idempotency for billing reconciliation. The deterministic batch_key guarantees a retried push cannot create a duplicate billing record — the guarantee that makes as-run reconciliation and revenue recognition trustworthy.
SOC 2 / structured observability. Every batch emits structured timestamp | level | module | message logs carrying spot_id and batch_key into centralized aggregation, satisfying the traceability SOC 2 and ISO 27001 audits require. DLQ entries are retained for the financial audit window, then archived to cold storage.

Troubleshooting & Common Errors

Error pattern	Root cause	Remediation
Unbounded memory growth / OOM	Eager load of the full day or an unbounded `gather` of every batch coroutine	Use the bounded `asyncio.Queue`; stream records lazily and cap `queue_maxsize`
Event loop stalls, throughput collapses	A blocking, synchronous call (sync DB driver, `pd.read_csv` on the loop) inside a coroutine	Move blocking work to `run_in_executor` or a thread pool; keep the loop I/O-only
`429 Too Many Requests` storms	Concurrency exceeds the Avstar connection ceiling; all workers retry in lockstep	Size the `Semaphore` to the ceiling; add exponential backoff with jitter (already in `_dispatch`)
Duplicate billing records after a retry	Batch re-submitted without a stable idempotency key	Always send the deterministic `batch_key` / `Idempotency-Key` header; never trim a batch before retry
Circuit breaker stuck OPEN, DLQ filling	Endpoint genuinely degraded, or `breaker_threshold` set too low for normal jitter	Inspect DLQ context and endpoint health; tune `breaker_threshold` / `breaker_cooldown`, then replay the DLQ once the probe batch succeeds

Recoverable failures always produce a structured DLQ entry — original payload, reason, correlation timestamp, spot_id — so an operator can diagnose and replay without reconstructing state from logs alone.

Optimizing Asyncio for Traffic File Uploads — production aiohttp session reuse, timeout tuning, and multipart upload strategy for the dispatch leg.
Schema Validation with Pydantic for Traffic Data — the boundary validators that guarantee record types before a record can join a batch.
Parsing Avion Export Formats — tokenization and header handling that produce the async record stream this stage consumes.
Avstar API Authentication and Rate Limits — credential lifecycle, token caching, and throttling for the scheduling-submission handoff.
Avion & Avstar Ingestion Pipelines — the parent workflow this batching stage plugs into, from extraction through archival.

Async Batch Processing for High-Volume Logs

Concept & Data Model #

Implementation Approach #

Production Python Implementation #

Validation & Edge Cases #

Integration Points #

Compliance & Audit Considerations #

Troubleshooting & Common Errors #

Related #

Explore this section

Related content