Step-by-Step Avstar CSV to JSON Conversion

Q: Why does the converter default to utf-8-sig instead of plain utf-8?

Avstar's Windows exporter prepends a Byte Order Mark. Plain utf-8 decodes the BOM into a stray character on the first header, garbling the first column; utf-8-sig strips it transparently, so it leads the fallback list.

Q: How do I stop the async dispatch from tripping the Avstar API rate limit?

Cap concurrency with an asyncio.Semaphore, keep batch size around 500, honor Retry-After on any HTTP 429, and add exponential backoff around the dispatch call.

Q: Can I run this converter on a multi-gigabyte daily log without exhausting RAM?

Yes. The pipeline is generator-based end to end: rows are yielded one at a time, validation is lazy, and only one 500-record batch is materialized before dispatch.

This guide walks through the exact operational task of turning a raw Avstar CSV export into a strictly typed, API-ready JSON payload — encoding-safe, schema-validated, and rate-limit-aware — without a traffic desk touching a spreadsheet. It is the extraction-and-serialization step inside Parsing Avion Export Formats, which itself sits within the broader Avion & Avstar Ingestion Pipelines architecture. Getting this conversion deterministic and auditable matters because every downstream airing, billing reconciliation, and as-run report inherits whatever the converter produces: a silently truncated cp1252 read or a misaligned duration here becomes a missed spot or a revenue gap later. The workflow below is written for broadcast traffic managers who need to trust the output, and for the Python engineers who build and operate the converter under legacy-format constraints.

The conversion resolves three recurring realities in one pass: inconsistent legacy CSV encodings, strict broadcast compliance rules (duration alignment, timecode format, billing integrity), and memory limits when a daily run-of-station log runs to multiple gigabytes. Decoupling the work into streaming, validation, and serialization stages keeps a malformed row from ever blocking a valid one, and hands every failure a durable audit record.

Operational Architecture & Data Flow

Before writing conversion logic, map the transformation onto your ingestion topology. Avstar exports are comma-delimited and carry spot IDs, advertiser metadata, airtime codes, duration, clearance flags, and billing rates. Piping those columns straight into a downstream API without normalization causes schema drift, silent truncation, and rate-limit violations. The converter therefore isolates four discrete stages so each class of failure routes to a different queue:

Stream-based CSV ingestion with explicit encoding fallbacks and constant-memory chunking.
Schema validation that enforces broadcast rules through a Pydantic validator for traffic data.
Quarantine and retry logic that isolates bad rows as JSON Lines with a full audit trail.
Async batch serialization and dispatch aligned with Avstar API authentication and rate limits.

Figure — Four-stage CSV-to-JSON conversion: streaming the CSV with encoding fallbacks, Pydantic validation, then async batching and API dispatch for valid records while invalid rows are quarantined as JSON Lines.

Prerequisites

Confirm the following before running the converter in a live traffic environment:

Python 3.12+ (the code uses datetime.UTC, added in 3.11, and 3.12-era typing).
pydantic==2.9.* — v2 field constraints, field_validator, and model_dump(mode="json").
httpx==0.27.* — async HTTP client with connection pooling.
A writable quarantine directory for JSON Lines audit output (e.g. quarantine/).
A downstream scheduler API endpoint plus a bearer token scoped to spot ingestion — see Avstar API authentication and rate limits for token lifetime and Retry-After handling.
Read access to the source Avstar export share; treat those files as untrusted input.
Agreement on the canonical target shape with the downstream owner, mapped to the broadcast spot schema and metadata.

Step-by-Step Implementation

Step 1 — Define the compliance schema

Goal: reject non-compliant rows at the moment of construction so no downstream system ever sees a malformed spot. Durations must align to broadcast increment boundaries, airtime must match HH:MM:SS or the frame-accurate HH:MM:SS:FF, and identifier fields must obey fixed length and character-set rules. Deeper duration-boundary logic (SMPTE frame rates, sub-second tolerances) lives in validating spot durations against broadcast standards.

python

from decimal import Decimal
from pydantic import BaseModel, Field, field_validator, ConfigDict
import re

class AvstarSpotRecord(BaseModel):
    model_config = ConfigDict(extra="forbid", populate_by_name=True, str_strip_whitespace=True)

    spot_id: str = Field(..., min_length=6, max_length=16, pattern=r"^[A-Z0-9]+$")
    advertiser_id: str = Field(..., min_length=4, max_length=12)
    campaign_code: str = Field(..., min_length=3, max_length=10)
    airtime: str = Field(..., pattern=r"^\d{2}:\d{2}:\d{2}(:\d{2})?$")
    duration_sec: int = Field(..., ge=1, le=300)
    clearance_flag: bool = Field(default=False)
    billing_rate: Decimal = Field(..., ge=0, max_digits=12, decimal_places=2)
    market_syndication: str = Field(default="LOCAL", pattern=r"^(LOCAL|NATIONAL|SYNDICATED)$")

    @field_validator("airtime")
    @classmethod
    def validate_timecode_format(cls, v: str) -> str:
        # Reject impossible clock values (25:99:99) that a loose regex would pass.
        if not re.match(r"^(?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d(?::[0-5]\d)?$", v):
            raise ValueError("Invalid airtime format. Expected HH:MM:SS or HH:MM:SS:FF")
        return v

    @field_validator("duration_sec")
    @classmethod
    def enforce_broadcast_alignment(cls, v: int) -> int:
        # Broadcast logs align spot lengths to 5-second increments (:15, :30, :60).
        if v % 5 != 0:
            raise ValueError("Duration must align to 5-second broadcast boundaries")
        return v

extra="forbid" is deliberate: when Avstar silently adds an undocumented column in a future export, the row fails loudly here instead of leaking an unmapped field into the scheduler. Normalizing the raw billing_rate and codes to the canonical set is handled downstream by standardizing billing codes across traffic systems.

Step 2 — Stream the CSV with encoding fallbacks

Goal: consume a multi-gigabyte log at constant memory while surviving the Windows-1252 and ISO-8859-1 artifacts legacy exporters inject. Python’s csv module yields rows lazily, which pairs cleanly with an ordered encoding-fallback list.

python

import csv
from pathlib import Path
from typing import Iterator

ENCODING_FALLBACKS: tuple[str, ...] = ("utf-8-sig", "utf-8", "cp1252", "iso-8859-1")

def stream_avstar_csv(file_path: Path) -> Iterator[dict[str, str]]:
    """Yield raw CSV rows as dicts, resolving the encoding automatically."""
    for encoding in ENCODING_FALLBACKS:
        try:
            with open(file_path, "r", encoding=encoding, newline="") as fh:
                reader = csv.DictReader(fh)
                if not reader.fieldnames:
                    raise ValueError("Empty or malformed CSV header")
                for row in reader:
                    yield row
            return  # decoded cleanly to EOF — stop trying fallbacks
        except (UnicodeDecodeError, UnicodeError):
            continue
    raise RuntimeError(f"Failed to decode CSV with fallbacks: {file_path}")

utf-8-sig leads the list so the Byte Order Mark that Avstar’s export utility prepends is stripped rather than parsed into the first header name. Expected result: the generator holds one row in memory at a time regardless of file size.

Step 3 — Validate and quarantine

Goal: turn each raw dict into a validated AvstarSpotRecord, and divert every rejection to a durable JSON Lines quarantine keyed by spot_id — never dropping a row on the floor.

python

import logging
import json
import sys
from datetime import UTC, datetime
from pathlib import Path
from typing import Generator, Iterator

def configure_audit_logger() -> logging.Logger:
    """Pipe-delimited traffic-ops format: timestamp | level | module | spot_id."""
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(name)s | %(message)s"))
    logger = logging.getLogger("avstar_converter")
    logger.setLevel(logging.INFO)
    logger.addHandler(handler)
    return logger

logger = configure_audit_logger()

def validate_and_quarantine(
    raw_rows: Iterator[dict[str, str]],
    quarantine_path: Path,
) -> Generator[AvstarSpotRecord, None, None]:
    """Yield validated models; append failures to a JSON Lines quarantine file."""
    with open(quarantine_path, "a", encoding="utf-8") as qf:
        for idx, row in enumerate(raw_rows, start=1):
            spot_id = row.get("spot_id", "UNKNOWN")
            try:
                record = AvstarSpotRecord.model_validate(row)
                logger.info("spot=%s row=%d validated ok", record.spot_id, idx)
                yield record
            except Exception as exc:  # ValidationError and coercion failures
                audit_entry = {
                    "timestamp": datetime.now(UTC).isoformat(),
                    "row_index": idx,
                    "spot_id": spot_id,
                    "raw_data": row,
                    "error_type": type(exc).__name__,
                    "error_message": str(exc),
                    "pipeline_stage": "validation",
                }
                qf.write(json.dumps(audit_entry, ensure_ascii=False) + "\n")
                logger.warning("spot=%s row=%d quarantined: %s", spot_id, idx, exc)

Each quarantine line captures the exact raw payload, an error classification, and a UTC timestamp, so a post-mortem can replay a corrected row deterministically. Expected log line:

text

2026-07-03 09:14:02,118 | INFO | avstar_converter | spot=KTLA0007 row=1 validated ok
2026-07-03 09:14:02,119 | WARNING | avstar_converter | spot=KTLA0008 row=2 quarantined: Duration must align to 5-second broadcast boundaries

Step 4 — Serialize and dispatch in async batches

Goal: serialize validated records to JSON and POST them to the scheduler with bounded concurrency so a large log never exhausts the API rate ceiling. Tuning batch size and concurrency for sustained throughput is covered in async batch processing for high-volume logs.

python

import asyncio
import httpx
from typing import Iterator

async def dispatch_batch(
    records: list[AvstarSpotRecord],
    api_endpoint: str,
    client: httpx.AsyncClient,
    semaphore: asyncio.Semaphore,
) -> dict:
    """Serialize one batch and POST it under a concurrency cap."""
    payload = [r.model_dump(mode="json") for r in records]
    async with semaphore:
        response = await client.post(api_endpoint, json=payload)
        response.raise_for_status()
        logger.info("spot_batch=%d dispatched status=%d", len(records), response.status_code)
        return response.json()

async def run_async_pipeline(
    validated_records: Iterator[AvstarSpotRecord],
    api_endpoint: str,
    auth_token: str,
    batch_size: int = 500,
    max_concurrency: int = 4,
) -> list[dict]:
    semaphore = asyncio.Semaphore(max_concurrency)
    headers = {"Authorization": f"Bearer {auth_token}", "Content-Type": "application/json"}
    tasks: list[asyncio.Task] = []
    batch: list[AvstarSpotRecord] = []

    async with httpx.AsyncClient(timeout=30.0, headers=headers) as client:
        for record in validated_records:
            batch.append(record)
            if len(batch) >= batch_size:
                # Rebind (not .clear()) so the in-flight task keeps its own list.
                tasks.append(asyncio.create_task(dispatch_batch(batch, api_endpoint, client, semaphore)))
                batch = []
        if batch:
            tasks.append(asyncio.create_task(dispatch_batch(batch, api_endpoint, client, semaphore)))
        return await asyncio.gather(*tasks)

A single reused httpx.AsyncClient keeps the connection pool warm across batches while the asyncio.Semaphore caps in-flight POSTs at typical ad-tech rate ceilings. Reassigning batch = [] rather than calling .clear() hands each scheduled task its own list, so later appends never mutate an in-flight payload — a classic source of duplicated or dropped spots. Batching at 500 balances network overhead against memory. Expected result: run_async_pipeline returns one response dict per dispatched batch.

Verification & Testing

Prove the converter end-to-end with a small fixture that exercises one clean row and one boundary violation. The clean row must round-trip to canonical JSON; the bad row must land in quarantine, not in the output.

python

import io
import csv
from decimal import Decimal

FIXTURE = (
    "spot_id,advertiser_id,campaign_code,airtime,duration_sec,clearance_flag,billing_rate,market_syndication\n"
    "KTLA0007,ACME01,SUMMER24,06:30:00,30,true,1250.00,LOCAL\n"   # valid
    "KTLA0008,ACME01,SUMMER24,06:30:00,32,true,1250.00,LOCAL\n"   # 32s breaks 5s alignment
)

def _rows() -> list[dict[str, str]]:
    return list(csv.DictReader(io.StringIO(FIXTURE)))

# 1) A compliant row deserializes to canonical, JSON-safe output.
good = AvstarSpotRecord.model_validate(_rows()[0])
assert good.spot_id == "KTLA0007"
assert good.duration_sec == 30
assert good.model_dump(mode="json")["billing_rate"] == "1250.00"  # Decimal -> string

# 2) The misaligned duration is rejected before it can reach the scheduler.
import pytest
with pytest.raises(Exception):
    AvstarSpotRecord.model_validate(_rows()[1])

For an integration check, run validate_and_quarantine(stream_avstar_csv(path), Path("quarantine.jsonl")) against a fixture file and assert the quarantine file contains exactly one line whose spot_id is KTLA0008. Because dispatch is deterministic given fixed input, snapshot the serialized batch and diff it in CI to catch schema drift early.

Edge Cases & Failure Handling

Hidden BOM or mixed encoding. A file that opens with a Byte Order Mark or switches encoding mid-stream raises UnicodeDecodeError. Keep utf-8-sig first in ENCODING_FALLBACKS; if a file is genuinely mixed, wrap the handle in io.TextIOWrapper(errors="replace") and route affected rows to quarantine rather than aborting the whole log.

Billing-code or type mismatch. When Avstar exports a spot length as the string :30 instead of the integer 30, Pydantic raises a ValidationError and the row is quarantined with pipeline_stage="validation". Add a narrow pre-coercion layer (strip a leading colon, cast to int) before model_validate, or fix the export template — do not loosen the schema, which would let genuinely bad durations through.

API rate-limit trip (HTTP 429). A stall at high volume is usually an unhandled 429 Too Many Requests. Lower max_concurrency, honor the Retry-After header, and add exponential backoff around dispatch_batch; if downstream error rates exceed a set threshold, open a circuit breaker and route remaining batches to a dead-letter queue for manual traffic-desk review instead of hammering the API.

FAQ

Why does the converter default to utf-8-sig instead of plain utf-8?

Avstar’s Windows-based export utility prepends a Byte Order Mark to many CSV files. Plain utf-8 decodes the BOM into a stray character on the first header, so DictReader produces a garbled first field name and every row misses that column. utf-8-sig strips the BOM transparently, which is why it leads the fallback list in Parsing Avion Export Formats.

What happens to rows that fail validation — are they lost?

No. Every rejection is appended to a JSON Lines quarantine file with the raw payload, error type, spot_id, and a UTC timestamp. That record is enough to correct the source and replay the row deterministically. The deeper duration rules that cause many rejections are documented in validating spot durations against broadcast standards.

How do I stop the async dispatch from tripping the Avstar API rate limit?

Cap concurrency with the asyncio.Semaphore, keep batch size at ~500, and honor Retry-After on any 429. Token lifetime, refresh, and per-endpoint ceilings are covered in Avstar API authentication and rate limits, and throughput tuning in async batch processing for high-volume logs.

Can I run this converter on a multi-gigabyte daily log without exhausting RAM?

Yes — the pipeline is generator-based end to end. stream_avstar_csv yields one row at a time, validation is lazy, and only one batch (500 records) is materialized before dispatch. Avoid calling list() on the stream, and cap batch_size so accumulation stays bounded.

Parsing Avion Export Formats — the extraction boundary this converter belongs to: delimiter resolution, fixed-width maps, and time normalization.
Schema validation with Pydantic for traffic data — the Pydantic validator layer that enforces broadcast business rules on every parsed record.
Avstar API authentication and rate limits — token handling and Retry-After discipline for the dispatch stage.

Step-by-Step Avstar CSV to JSON Conversion

Operational Architecture & Data Flow #

Prerequisites #

Step-by-Step Implementation #

Step 1 — Define the compliance schema #

Step 2 — Stream the CSV with encoding fallbacks #

Step 3 — Validate and quarantine #

Step 4 — Serialize and dispatch in async batches #

Verification & Testing #

Edge Cases & Failure Handling #

FAQ #

Related #

Related content

Operational Architecture & Data Flow

Prerequisites

Step-by-Step Implementation

Step 1 — Define the compliance schema

Step 2 — Stream the CSV with encoding fallbacks

Step 3 — Validate and quarantine

Step 4 — Serialize and dispatch in async batches

Verification & Testing

Edge Cases & Failure Handling

FAQ

Related