Stream nodes
Common node categories include:- Ingress / parser – Deserialize Avro, Protobuf, JSON, or CSV; validate envelopes and headers.
- Filter and route – Drop test traffic, split by tenant, or branch poison messages to a DLQ path.
- Map and enrich – Join reference data from a low-latency cache or warehouse snapshot table (with staleness you accept).
- Aggregate – Compute counts, sums, or distinct estimates over windows.
- Sink – Write to warehouses, lakes, queues, or APIs with batching tuned for the destination.
Windowing
Windows bound unbounded streams into finite chunks for aggregation:- Tumbling
- Sliding
- Session
Tumbling windows are fixed, non-overlapping intervals (for example, 1-minute buckets). Every event belongs to exactly one window.
Real-time transforms
Transforms in streaming graphs should be associative and deterministic where state is involved:- Prefer idempotent sinks with natural keys so at-least-once delivery does not duplicate business facts.
- Use side outputs for bad records instead of failing the whole job when a single malformed event arrives.
- Cap state growth: evict TTL’d keys from enrichment caches; avoid unbounded distinct unless you intend approximate algorithms.
Stateful joins
Stateful joins
Joining two high-volume streams requires keyed state and retention policies. Size state from peak key cardinality, not average.
Complex event processing
Complex event processing
Patterns across multiple event types may need sequence matchers; test with recorded topics before production cutover.
Testing and promotion
Replay captured samples in a staging environment, compare outputs to a reference batch pipeline for the same day, then promote. Pair stream jobs with alerts on lag and error percentage.Related topics
CDC sources
Database and bus ingestion options.
Dead letter queue
Handle persistently bad stream records.