Build to Launch

Build to Launch

The Data Engineering Mindset Every AI Builder Needs

Why the decisions you make before your first real user will define everything that comes after

Jenny Ouyang's avatar
Pipeline to Insights's avatar
Jenny Ouyang and Pipeline to Insights
Mar 09, 2026
Cross-posted by Build to Launch
"I wrote about the importance of data engineering for AI builders and how decisions around data flow, quality, and monitoring can impact everything once real users arrive. Also, don’t forget to check out Jenny’s work on Build to Launch. I really enjoyed this collaboration. Hope you enjoy it too."
- Pipeline to Insights

You know what nobody warns you about when you vibe code your first real product?

The model works. The interface works. The demo is gorgeous. Then users show up, and the data underneath starts falling apart.

I’ve written about this moment a lot. In my production-ready guide, I broke down the gap between “it works on my machine” and “it works for 500 users.”

In the engineering practices piece, I documented the security disasters I found in my own AI-generated code: plaintext passwords, exposed API keys, the works.

Building 12 progressively complex Claude Code projects, I kept hitting the same wall. Things broke not because the AI messed up, but because the data flowing through the system was wrong in ways I couldn’t see.

Most of us spend 80% of our energy on the model, the prompt, the interface. Almost none on the data path. And when something goes wrong in production, the data path is where the answer almost always lives.

With AI’s assistanace, anyone can build. But data infrastructure thinking? That’s a different discipline, and one most AI builders are flying blind on.

That’s where today’s guest comes in.

Erfan is a Senior Data Engineer and Architect who writes Pipeline to Insights, one of the clearest data engineering newsletters I’ve found. 9,400+ subscribers learning how data systems work under the hood, without the academic overhead.

We’ve already collaborated once. I wrote a guest piece on their newsletter about the tension between vibe coding speed and engineering fundamentals. Erfan wanted to bring the other side of that conversation to you: the data thinking that keeps AI products from breaking when real users show up.

What I appreciate about Erfan’s perspective is that he explains concepts in the most approachable ways. He takes concepts like data contracts, observability, and schema validation and makes them concrete. Not “memorize this framework.” More like “this is what breaks, and this is how to catch it early.”

We both agree on this: the AI model is rarely your problem. The foundation underneath it is.

If you want to explore Erfan’s world, here are three starting points:

  1. The Data Engineer’s GitHub Portfolio (2026 Edition)

  2. API Fundamentals for Data Engineers

  3. From Application Support to Data Engineer: A Real Transition Story

Pipeline To Insights
Pipeline To Insights is a community-driven blog by passionate Data Engineers, sharing real-world experiences, technical tutorials, and personal reflections to inspire growth and continuous learning in the evolving world of data and AI.
By Pipeline to Insights

In this piece, Erfan walks through three decisions every AI builder should make before their first real user shows up: how data moves through your system, what data quality means for your specific product, and how to know when something breaks before your users tell you.

Here’s what he has to share:

Pixar-style 3D illustration of a data engineer examining a transparent holographic cross-section showing three foundational layers beneath an AI interface — data flow streams, quality validation grids, and monitoring signals — representing the data engineering mindset every AI builder needs

Hi, I’m a Senior Data Engineer/Architect passionate about sharing insights and lessons from my journey in data and AI. I write about data engineering fundamentals, hands-on tutorials, and practical advice to help data professionals build strong foundations and grow their careers.

Imagine you’ve spent weeks building an AI product. The demo is clean, the model is sharp, and stakeholders are excited. You push it to production and wait for feedback.

Then users arrive. Slowly, things start to break.

An AI assistant that answered questions brilliantly in your tests starts hallucinating responses that never made sense. A search feature starts showing irrelevant results. An automation that worked perfectly in testing is now confidently processing the wrong data.

You investigate. And what you find:

  • The training data had duplicate records you never caught.

  • A field that fed your model silently changed its meaning three weeks ago.

  • You have no monitoring in place to tell you when something drifted.

  • You can’t even trace where the bad data came from.

The model was never the problem. The data infrastructure underneath it was.

The Intersection of Data and AI

“AI systems rely on both the model and the data. Preparing and adapting data thoughtfully is key to building effective AI solutions.” — Andrew Ng

Most AI builders spend 80% of their time on model choice, prompt engineering, and fine-tuning. Very few think about the data until something breaks in production. By then, fixing it is expensive, time-consuming, and embarrassing.

This post is for builders who want to get ahead of that. I’ll cover the three foundational data engineering decisions that will either save you or hurt you: data flow, data quality, and monitoring.

What we’ll cover:

  • Why data flow design matters before you write a single line of code

  • The data quality dimensions every AI builder needs to understand

  • How to think about monitoring before your first user ever shows up


Data Flow: Design It Before You Need It

Data seems simple: input goes in, the model works on it, and output comes out. That mental model breaks when something goes wrong, because you can’t see what happened in the middle.

Data flow is the path your data takes from its source to your model and beyond. Every transformation, every storage layer, every handoff. If you don’t consciously design that flow upfront, you inherit a mess that grows harder to fix with every new user.

What does a data flow answer?

At a minimum:

  • Where does data enter the system? (user inputs, APIs, databases, event streams)

  • How does it get transformed before reaching the model? (cleaning, formatting, combining sources)

  • Where do inputs and outputs get stored? (for debugging, retraining, and auditing)

  • Who or what consumes the outputs downstream? (dashboards, other features, notifications)

This might sound like over-engineering for an early-stage product. It isn’t. The cost of retrofitting traceability into an undocumented flow is far higher than designing it right the first time.

The most important habit: log everything from day one.

Early in my career, I joined a team dealing with a recurring data pipeline issue. Upstream data was arriving late or incomplete, and we were only discovering it after spending 7 to 8 hours processing pipelines downstream. The root cause? There were no checks or logging at the ingestion stage. By the time we knew something was wrong, the damage was already done.

The same thing happens in AI products, but the impact is bigger. If your model is serving users and you have no record of what inputs it received, you can’t debug unexpected outputs. You can’t trace a bad answer back to a bad data record. You can’t tell whether the problem is your model or your data.

Log inputs and outputs from day one. Not for analytics later. Because you cannot debug what you cannot see.

This challenge goes deeper than logging. Without good metadata (clear documentation of what your data means, where it comes from, and how it’s structured), even your own systems can become a mystery over time. If you want to understand how metadata supports reliable data systems: Metadata: What it is and why do we need it?

Questions to ask when designing your data flow:

  • If a user reports a bad output, can I trace it back to the specific input that caused it?

  • If a schema changes upstream, will I know before it breaks my model?

  • Am I storing enough context alongside model outputs to audit them later?

  • Do I have a clear picture of where data enters, transforms, and lands?

Think of your data flow like a shipping tracking system. When a package is lost in transit, you need a record of every step it took to find where it went wrong. If you only know where it started and where it was supposed to end up, you’re guessing.


Data Quality: Why “Clean Data” Misses the Point

A definition worth remembering, from the Data Contracts: Developing Production-Grade Pipelines at Scale by Chad Sanderson, Mark Freeman and B E Schmidt:

Data quality doesn’t mean perfect data. It means understanding whether your data is good enough for what you’re building, and knowing how accurate it is at any given moment.

That definition doesn’t mean data quality is about perfect data. It means data quality is about whether the data is good enough for its use, and whether you understand how accurate it is at any moment.

For AI builders, this distinction matters. A dataset doesn’t need to be perfect to train or serve a model. But you need to understand its problems so you can work around them. And if those problems change suddenly, you need to know immediately.

The data quality dimensions that matter for AI products:

The DAMA-DMBOK framework defines several key dimensions of data quality. Here are the ones that have the most direct impact on AI systems:

Validity

Does the data conform to expected formats and business rules?

AI Impact: A field that accepts free-text where a structured value is expected will silently corrupt your model’s inputs. Example: a product category field that should contain “Electronics” starts receiving values like “elec.”, and “consumer tech”, your model now treats these as different categories.

Completeness

Is all required data present?

AI Impact: Missing values don’t just create nulls, they create silent bias. If certain user segments have systematically missing data, your model learns to make decisions without that signal, and performs worse for those users.

Timeliness

Is data refreshed according to business expectations?

AI Impact: A recommendation model trained on data that’s 30 days old in a fast-moving context (news, e-commerce, user behaviour) is already outdated before it serves its first user. Timeliness isn’t just a pipeline concern, it directly affects model relevance.

Uniqueness

Does each record appear only once?

AI Impact: Duplicate records are one of the most common and most damaging data quality issues for AI. They inflate the presence of certain patterns in training data, biasing the model toward outcomes associated with the duplicated records. A transaction that appears twice teaches the model it happened twice.

Consistency (Semantic Validity)

Does data mean the same thing across systems and over time?

AI Impact: This is where AI systems are most vulnerable. A column called distance_traveled that silently switches from kilometres to miles doesn’t fail loudly. It just slowly corrupts your model’s understanding of the world. These semantic violations are extremely hard to catch without proper monitoring and documentation.

Want to learn more about data quality and how to implement it in practice? Check out my Data Quality & Governance series.

The 5 Pillars of Trusted Data

If you’re looking for a more memorable way to frame everything above, dlthub summarises data quality into five pillars that map well across all the dimensions we’ve just covered:

  1. Structural Integrity: Is the data technically correct? Data matches expected schema, data types, and required fields so pipelines don’t break and tables stay usable.

  2. Semantic Validity: Does the data make sense from a business perspective? Values follow real-world rules and logic, valid ranges, correct formats, meaningful statuses.

  3. Uniqueness and Relationships: Is the data consistent with itself? No duplicates, maintained key relationships, and accurate historical records.

  4. Privacy and Governance: Is the data safe and compliant? Sensitive information is protected, masked, or removed, and usage aligns with legal and governance requirements.

  5. Operational Health: Is the data arriving reliably and on time? Even accurate data is low quality if it’s late, incomplete, or if pipelines fail silently.

Common Data Quality Problems That Break AI Systems

Breaking schema changes: When a column is renamed, a data type changes, or a field is removed, AI models that depend on that structure fail silently or loudly. Example: column [”user_id”] renamed to [”customer_id”], your pipeline keeps running, but it’s now joining on nothing.

Missed data SLAs: If data is expected to arrive hourly and it stops, a model serving real-time recommendations is now working from stale data. Expected events per hour dropped from 1,000 to 0, and no one noticed for six hours.

Data duplication. A transaction loaded via batch into one table and simultaneously streamed via CDC into another. Now your training set has every transaction twice, and your model thinks popular items are twice as popular as they are.

A Practical Starting Point: Data Contracts

You don’t need a full data quality platform to get started. The most valuable thing you can do early is define a data contract for your most critical data sources, a document (or code) that specifies:

  • What fields are expected

  • What data types they should have

  • What values are valid or invalid

  • What the acceptable null rate is

  • Who owns this data and who should be notified if it changes

This single artefact, even if it’s a YAML file or a simple table, will save you hours of debugging when something inevitably breaks.

Data contracts don’t exist on their own, they’re part of a broader data governance practice that includes accountability, standardisation, security, and discoverability. Once you know what to check, the next question is: where in your pipeline should those checks live? There are several established patterns data engineers use to protect production data, WAP (Write–Audit–Publish), AWAP, TAP, and the Signal Table Pattern. We cover both the foundations of data quality and governance and these practical design patterns in our series here.

Key questions to ask when designing data quality checks:

  • What are the most critical data sources my model depends on?

  • Do I have a data contract for each of them?

  • What happens to my model if a key field changes or goes missing?


Monitoring: Don’t Wait for Users to Report Problems

In software engineering, observability means adding enough signals ,logs, metrics, and traces, so you can understand what a system is doing and diagnose problems quickly.

Data observability applies the same idea to data pipelines.

Think of it like a fitness tracker for your data system.

Instead of tracking heart rate or steps, it tracks whether your data is:

  • fresh

  • complete

  • structurally correct

  • flowing through the pipeline as expected

When something looks wrong, you get an alert before it reaches your model or your users.

Data quality is the goal: delivering data that is clean, complete, fresh, and trustworthy. Data observability provides the sensors, dashboards, and alerts that show when quality slips, and where to fix it.

Most AI builders skip monitoring entirely until something breaks.

By the time the issue is noticed:

  • users are already seeing incorrect outputs

  • model predictions have already degraded

  • trust in the system is already damaged

Now the team is forced into reactive debugging mode, trying to trace the problem across pipelines, transformations, and models.

The mindset shift is simple:

Monitoring is part of the product, not something added later.

What to Monitor in your Data/AI Systems

A useful way to think about observability in your systems is to monitor three layers of signals:

  1. Inputs (what your model receives)

  2. Outputs (what your model produces)

  3. Pipeline health (how the system behaves)

Together, these signals tell you whether the data, the model, or the infrastructure is moving away from expectations.

  1. Input Distribution Monitoring

Your model assumes incoming data looks similar to the data it was trained on.

When this assumption breaks, model performance drops, even if the model itself hasn’t changed.

This is known as data drift.

For example:

  • a feature suddenly contains more null values

  • a categorical field starts receiving new unseen values

  • the statistical distribution shifts significantly

These changes are often subtle and easy to miss without monitoring.

What to track

  • statistical distribution of key input features

  • input volume over time

  • null rates and missing values

  • format violations or schema changes

  1. Output Monitoring

Even if inputs appear normal, model outputs can still drift.

This usually happens when the real world changes faster than the model learns.

Over time you may start seeing:

  • unusual prediction patterns

  • unexpected class distributions

  • more low-confidence predictions

Without monitoring, these issues remain invisible until users complain.

What to track

  • distribution of prediction classes or values

  • rate of low-confidence predictions

  • user feedback signals (if available)

  1. Pipeline Health Monitoring

Sometimes the first signal of a problem isn’t the data itself, it’s the pipeline behaviour.

Late jobs, failed transformations, or missing upstream data often appear before data quality problems become visible.

If ingestion breaks, every downstream model will eventually feel the impact.

What to track

  • job completion status

  • pipeline latency

  • data freshness (time since last successful load)

  • volume anomalies (for example, events per hour suddenly dropping)

You don’t need a full observability platform to get started.

Simple Monitoring Setup

For an early-stage AI system, three simple signals already provide huge value:

  • Data freshness: When was the last successful data load? Alert if it’s older than your SLA.

  • Volume checks: How many records arrived in the last hour? Alert if it drops below a threshold.

  • Schema validation: Does incoming data still match the expected structure?

Even a small dashboard with these three metrics dramatically reduces your mean time to detect (MTTD) when something breaks.

And in production AI systems, something will always break eventually.

The goal isn’t perfection.

The goal is seeing problems early enough to fix them before your users notice.


Conclusion

Every AI demo looks great. The model perfoms, the outputs are impressive, and the potential is obvious. Then real users arrive, and the data underneath starts to show its cracks.

The good news is that most of these cracks are preventable. Not through complex infrastructure or expensive tooling, but through three habits applied early:

  • Design your data flow before you need it, so you can debug when you have to.

  • Understand your data quality dimensions, so you know what trust actually means for your specific data.

  • Monitor proactively, so users aren’t the ones who tell you something broke.

These aren’t data engineering luxuries. They’re the minimum viable infrastructure for any AI system that’s going to serve real users in the real world.

Start small. Start now. Your future self, and your future users, will thank you.

What I love about Erfan’s piece is how it reframes production failures as data failures. Most of us instinctively blame the model when something goes wrong. We fine-tune, re-prompt, adjust parameters. Erfan’s point is simpler and more uncomfortable: check the data first.

The self-assessment checklist above is worth bookmarking. Next time something breaks in one of your AI builds, start there before you reach for the prompt editor.

Have you hit a production problem that turned out to be a data problem? I’d love to hear what broke and how you found it. Drop it in the comments.

— Jenny

Pipeline to Insights's avatar
A guest post by
Pipeline to Insights
Helping you understand data engineering in simple words, with tutorials, interview prep, and tips to break in and become a better data engineer.
Subscribe to Pipeline

No posts

© 2026 Jenny Ouyang · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture