Apache Spark 4.0: New Features for Data Engineers | The Inner Join Blog By Evan Rosa

In enterprise, “it works” isn’t enough… You need “it works at scale, fast, and without breaking.”

Today’s article we discuss design patterns, infrastructure components, and operational best practices that make a modern data platform enterprise-grade. We will cover:

Key challenges and patterns
High-level architecture blueprint
In-the-weeds patterns for ingestion, processing, and serving
Governance, testing, and observability

1. Enterprise Challenges & Proven Patterns

Challenge	Pattern / Technique	Benefit
High-volume writes & schema drift	Schema evolution + append-only using Iceberg/Delta	Safe evolution, no backfills on write
Late-arriving & out-of-order events	Event time watermarks in Flink/Beam	Correct windowed aggregations & dedupe
Duplicate or missing records	Idempotent upserts (MERGE INTO) + CDC	Guaranteed consistency without full rewrites
Expensive full-table scans	Partitioning + clustering (time, tenant key)	Pruned queries, faster reads
Mixing batch & streaming logic	Kappa vs Lambda; prefer Kappa (stream-first)	Unified code path, simpler maintenance
Multi-team collaboration	Environment isolation + model versioning	Safe deploys, parallel work, clear lineage

2. Architecture Blueprint

Ingestion
- CDC: Debezium/Fivetran capture DML from OLTP and stream to Kafka.
- API/webhook events: Producers push directly into Kafka.
StreamCompute (Kappa)
- Flink consumes raw topics, applies watermarks, windowed dedupe, enriches with lookup state (e.g., reference tables).
- Writes out an append-only Bronze layer in Iceberg for audit and replay.
BatchCompute
- SQLMesh picks up Bronze and Silver-stream topics to run incremental models:
  - Silver tables: cleaned, typed, deduped.
  - Gold tables/views: dimensional models, aggregated down to business granularity.
Serving
- BI tools hit Gold materialized views for sub-second dashboards.
- ML pipelines read Silver for feature engineering.

3. In-the-Weeds Design Patterns

A. Append-Only Bronze + Periodic Merge

Write: every event → new Iceberg file, partitioned by date/tenant.

Compact & Merge (Silver process):

MERGE INTO silver.table AS S
USING (SELECT * FROM bronze.table WHERE _partition = '2025-04-29') AS B
  ON S.event_id = B.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Benefits: high-throughput writes, backfill late data, audit trail.

B. Stateful Streaming with Watermarks

Watermark at event-time minus allowed lateness (e.g., 5m).

Deduplicate in Flink:

keyedStream
  .assignTimestampsAndWatermarks(watermarkStrategy)
  .keyBy(Event::getId)
  .process(new DeduplicateFunction())

Outcome: exactly-once, order-corrected streams.

C. Kappa Architecture

Single code path: all transformations expressed in Flink + SQLMesh, no separate batch pipeline.
Replay: To rebuild Silver/Gold, replay Kafka from offset 0 into SQLMesh jobs.

D. Partitioning & Clustering

Partition by:
- ingestion_date (daily/hourly folders)
- tenant_id (multi-tenant isolation)
Cluster / Z-order on high-cardinality columns (user_id, product_id) to accelerate selective queries.

E. CQRS for Serving Layer

Command (writes): flow through Bronze/Silver → main tables.
Query (reads): pre-computed Gold views or caches (Redis, Pinot) for ultra-low latency.

4. Governance, Testing & Observability

Data Contracts

Schema Registry enforces field types and evolution policies.
Kafka Schema Registry + Iceberg schema validation.

Automated Testing

Unit tests for SQLMesh models (schema diff, row count).
Integration tests: run mini-pipelines in CI (using Docker-Compose).

Monitoring & Alerting

Metrics: Kafka consumer lag, Flink checkpoint durations, Iceberg file counts.
Dashboards: Grafana for infra, custom “data health” dashboards in BI.
Alerts: SLA misses (job runtime), data quality (null rate spikes).

Conclusion

Building an enterprise-grade data platform means more than picking tools, it’s about patterns and processes:

Append-only Bronze for speed & audit
Stateful streaming for freshness & correctness
Incremental batch for reliability & governance
CQRS & materialized views for performance

With these building blocks, your pipelines will stay fast, fresh, and fault-tolerant—even at millions of events per minute.

📩 Need expert guidance? Let’s talk

Deep Dive: Enterprise-Grade Architecture for Fresh, Fast, Scalable Data