Filter
Filter keeps rows that match a boolean expression. Configuration:- Expression: A predicate evaluated per row (for example,
order_status = 'paid'andorder_total > 0). - Null handling: Decide how
NULLcomparisons behave; explicitIS NULLchecks avoid surprises.
event_name is purchase_completed and currency is USD:
Sample
Sample takes a random or stratified subset of rows for exploration, testing, or cost control. Configuration:- Sample rate or row limit: Fixed fraction (
10%) or max rows (50_000). - Seed (when available): Reproducible samples for tests.
- Stratification keys (when available): Preserve rare segment representation.
5% of transactions stratified by country so evaluation metrics are not dominated by one region.
When to use: Development pipelines, QA harnesses, or staged rollouts where full scans are unnecessary.
Sort
Sort orders rows by one or more keys. Configuration:- Sort keys: Columns with ascending or descending order.
- Nulls first/last: Explicit placement prevents flaky joins or window partitions.
- Stability: Pair sort with a tie-breaker column (such as
event_id) when duplicates on sort keys exist.
(customer_id, effective_date); sort by those columns so merge operators observe deterministic runs and easier diffing in logs.
When to use: Before nodes that assume order (some window setups, certain file writers), or when breaking ties for deduplication.
Unique (deduplicate)
Unique removes duplicate rows—either full-row duplicates or duplicates by a key subset. Configuration:- Key columns: Deduplicate on
user_idwhile keeping the first or last row per key according to sort order. - Keep policy: First vs last requires an upstream Sort when order matters.
- Hash vs key: Full-row unique is cheap mentally; key-based unique matches business keys.
event_id. Sort by ingest_timestamp descending, then Unique on event_id keeping the first row to retain the latest version.
When to use: After merges of streams, before cardinality-sensitive aggregations, or prior to loads that enforce primary keys.
Patterns on the canvas
Shrink early
Place Filter and column projection (where available) close to the source to save compute on joins.
Sort before deterministic dedupe
When duplicate resolution depends on time or version columns, Sort then Unique.
Related nodes
Column transforms
Change types, derive fields, and apply window logic.
Aggregation
Roll up after you have narrowed and cleaned rows.