Archos Labs
The Execution Layer

How To improve Data Pipeline Reliability

Rob Angeles5 min readPublished
Share
illustration of failing data pipeline requiring reliability design

Why data pipeline reliability depends on observability and redundancy by design—not frantic fixes after a breakdown

Data teams borrow the language of factories: assets, pipelines, flow. But when a supply chain fails, companies don’t just swap out a node. They redesign for visibility, resilience, and alternative routing. Data pipelines rarely get the same treatment.

Why failure is not an exception, it’s the default

Pipeline reliability is measured by what breaks, not by what stays upright under pressure. Yet most pipelines aren’t designed to withstand stress. They’re built for flow—all green lights, no branches, no detours. Failure gets handled after the fact, when engineering scrambles to patch and log and monitor.

This isn’t acceptable in physical supply chains. A single delayed shipment triggers contingency plans. Distribution centers reroute. Materials get sourced from multiple vendors. Downtime is calculated in dollars. Redundancy and observability aren’t bolted on. They’re designed in.

Data platforms need a similar architecture. Volume alone doesn’t shield a system from fragility. In fact, the more a data stack scales, the more likely it is to fragment—moving parts copied, transformed, and handed off across teams, tools, formats. The potential for silent failure compounds at each handoff.

Monte Carlo’s 2024 Data Observability Report found that 87% of data leaders estimated each incident costs more than $100,000 in lost trust, productivity, or revenue. One Snowflake customer reported losing a full day’s downstream analytics after a transformation step quietly dropped a required column due to schema drift. No alert fired. No fallback existed. The only discovery came when customers noticed dashboards were empty.

Most teams believe their pipeline is reliable because the last run succeeded. That’s not reliability. That’s survivorship.

Observability fails when it only watches outputs

Adding alerts, dashboards, and checks after deployment is like tracking shipments only at delivery. You know the result, but not the path. Traditional pipeline monitoring tools focus on endpoint success—if a table loads without error, the system considers it healthy.

But success masks corruption. A job can run green while inserting nulls where values matter. A step can complete without error even if input volume drops by 40%. When observability exists only at the boundary, it misses the internal state. Errors rot undetected.

PagerDuty’s engineering team treats observability as a first-order design input rather than an add-on. Every data pipeline component reports on its own health, latency, and throughput in real time. Their platform assumes partial visibility is as dangerous as no visibility. This orientation let them shift from reactive ticket-swapping to proactive debugging across platform teams.

The right question isn’t “What failed?” It’s “What state is this component in—and should it be trusted?”

Redundancy prevents single points of silence

Supply chains avoid dependent links. If one transporter shuts down, others kick in. Routes shift. Warehouses become buffers. They don’t rely on a perfect day. Data pipelines, meanwhile, often hang on one brittle warehouse or ETL job. If it stalls or misfires, everything downstream inherits the failure—or worse, continues quietly with corrupted data.

Redundancy in this context isn’t duplication for the sake of coverage. It’s intentional design of checkpoints, versioning, and alternate resolutions. At Airbnb, the data platform team built fallback mechanisms into their core pipeline framework. For example, when a table needed for reporting lags beyond tolerance, the system can switch to a backup source configured to maintain schema parity but update hourly instead of in real time. Users get slightly stale data, not silence.

This isn’t a BI trick. It’s architecture. The failover is triggered intelligently by health metrics and historical expectations, not arbitrary rules. The system treats expected freshness as a vector in decision making, not as a single binary.

Compare that with a pipeline that tries to reload failed steps automatically without observability context. One attempt at correction can double the damage if trust signals don’t inform routing.

Start with system-level questions, not tools

Pipeline reliability is not a tooling problem. It’s an architecture symptom. You can add Great Expectations, Monte Carlo, or Datafold to your stack—but if your system wasn’t designed to detect, respond, and reroute under stress, the tools will only surface known breakpoints.

Redesign begins when teams shift the frame from “Did my pipeline run?” to “Can the system tolerate failure in any one node without blinding me to it?” Ask questions that expose silent dependencies:

  • If a transformation step emits columns in a different order, will downstream steps catch it?
  • If a job silently skips a record group, who gets notified?
  • If a lookup table fails to refresh, how long until dashboards reflect that rot?

Design answers, not manual responses. Build fallback sources by default. Require every component to report internal state metrics, not just success booleans. Design pipelines to fail in contained ways, not as invisible sinkholes.

A reliable pipeline doesn’t just move data. It protects trust.

Share
Rob Angeles

Written by

Rob Angeles

Most consulting engagements split the thinking from the doing. Rob doesn't. Principal Consultant at Archos Labs, he owns the full stack — assessment, architecture, delivery — across retail, financial services, healthcare, and government.