Data quality - Planasonix

Data quality nodes guard correctness before data reaches consumers or production systems. You express expectations as rules, measure distributions, detect sensitive fields, synthesize test data, and notify owners when checks fail.

Validation

Validation applies schema and rule checks to incoming rows. Configuration:

Schema constraints: Required columns, non-null keys, type compatibility.
Business rules: Domain checks (amount >= 0, status IN (...)), regex patterns, referential checks against small lookup tables when supported.
On failure: Fail the run, route bad rows to a reject path, or count violations—per node settings.

Typical use: Block loads when primary keys are duplicated or when mandatory GDPR consent flags are missing.

Keep validation close to landing for external feeds, and close to publication for curated marts—two layers catch different classes of defects.

Data Profiling

Data Profiling computes statistics on columns: null rates, distinct counts, min/max, histograms, and inferred patterns. Configuration:

Sample size: Full scan vs sample for large tables.
Columns: All vs selected sensitive dimensions.
Output destination: Inline report vs persisted profile table for trend charts.

Typical use: After a vendor schema change, compare null rate on email week over week to catch silent breakage.

PII Detection

PII Detection scans text and structured fields for likely personal data using patterns and models (capabilities vary by plan). Configuration:

Detector packs: Email, phone, government IDs, payment artifacts.
Masking vs tagging: Flag columns for governance, or mask values before logging.
Locale: Tune detectors for country-specific ID formats.

Typical use: Before writing to a sandbox bucket, confirm no unexpected credit-card-like tokens appear in free-text notes columns.

Automated detection is probabilistic. Combine with policy tags and access controls—not solely automated redaction—for regulated data.

Row Generator

Row Generator produces synthetic or templated rows for tests and demos. Configuration:

Row count and seed for reproducibility.
Column generators: Ranges, enumerations, UUIDs, Faker-style patterns when available.
Schema target: Match production schema to test downstream nodes without copying real customer data.

Typical use: CI pipelines that run integration tests against ephemeral warehouses.

Notification

Notification sends alerts when upstream conditions trigger—often paired with validation or error paths. Configuration:

Channels: Email, Slack, Microsoft Teams, PagerDuty—per integration.
Message template: Include pipeline name, run URL, row counts, and top error messages.
Severity: Route SLA breaches to paging; send warnings to a team channel only.

Typical use: When Validation rejects more than 1% of rows, notify the data owner with a sample of offending keys.

Anomaly Detection (professional+)

Anomaly Detection identifies statistical outliers in numeric columns using configurable methods. Configuration:

Columns: Numeric columns to analyze.
Method: Detection algorithm — Z-Score, IQR (Interquartile Range), Percentile, or Modified Z-Score.
Threshold: Sensitivity level (e.g. Z-Score of 3.0 flags values beyond 3 standard deviations).
Window size: Number of recent rows to consider for rolling statistics.

Actions:

Action	Behavior
Flag	Adds a boolean `_anomaly` column (TRUE = anomaly)
Remove	Drops anomalous rows from the output
Cap	Clamps values to the boundary thresholds
Replace	Sets anomalous values to NULL

Typical use: Before loading financial metrics, flag or cap extreme outliers that would skew dashboard averages.

Use the Flag action first to review detected anomalies before committing to removal or capping.

Recommended layout

Profile after structural parsers

Run Data Profiling once files parse successfully; skip expensive profiling on known-bad batches.

Validate before expensive joins

Fail fast on key violations before paying for large shuffle operations.

Notify with context

Include variable values (environment, batch id) so on-call engineers reproduce the issue quickly.

Governance

Contracts, catalog, and policy alignment.

Observability

Dashboards, diagnostics, and DLQ patterns.

Anomaly Detection reference

Full guide to detection methods, thresholds, and patterns.

Control flow Anomaly Detection Transform

​Validation

​Data Profiling

​PII Detection

​Row Generator

​Notification

​Anomaly Detection (professional+)

​Recommended layout

​Related topics

Governance

Observability

Anomaly Detection reference

Validation

Data Profiling

PII Detection

Row Generator

Notification

Anomaly Detection (professional+)

Recommended layout

Related topics