Itamar Ben Hemo
MAR 25, 2024
icon
7 min read
Don’t miss a thing!
You can unsubscribe anytime

Having moved from co-founding a data engineering consultancy to launching a company that empowers data analysts to build end-to-end data pipelines without relying on data engineers, I find the build versus buy dilemma particularly intriguing.

On the one hand, my engineering background appreciates the autonomy and tailored control that comes with building a solution from the ground up. Every aspect can be finely tuned to fit your exact needs. Yet, as a founder, I recognize the pressing reality that organizations often can’t afford the time it takes for a small team of engineers to construct a solution from scratch.

In today’s fast-paced environment, speed to value is paramount. Waiting for a custom-built solution could mean missed opportunities and delayed benefits. Moreover, dedicating engineers to lengthy development projects detracts from their ability to focus on high-value analytics and engineering tasks that could propel the project forward more efficiently.

Data pipelines serve as the backbone of modern data infrastructure and drive decision-making, business insights, and innovation. With the advent of big data and real-time analytics, it’s more prominent than ever before that data pipelines can serve various use cases and scale to meet ever-growing needs. The number one question we often hear when talking with data engineers is: Should we build or should we buy our data pipelines?

In this article, we delve into the complexities of the build vs. buy decision-making process, examine the challenges associated with building and maintaining data pipelines internally, as well as the benefits and drawbacks of opting for third-party solutions. 

Should You Build Your Data Pipelines?

Historically, organizations viewed building data pipelines internally as a viable option due to the control and customization it offers over the entire end-to-end data pipeline design and functionality. However, I learned the hard way that constructing data pipelines is time-consuming, and the real challenge lies in maintaining them over time.

Your organization will likely deal with numerous data sources, each with its own schema, necessitating the maintenance of various connections as APIs and sources evolve. Additionally, it becomes overwhelming trying to keep up with the pace of ad hoc requests for new data sets. As an engineer, I despised telling internal stakeholders, “It’ll take a few weeks to fulfill that request.”

Ultimately, this hindered our organization’s efforts to become data-driven. What those outside the engineering team failed to grasp was that our small data team had to handle a multitude of scenarios:

  • Network outages from external APIs resulting in API downtime
  • Dealing with APIs that impose rate limits on requests
  • Discovering bugs in our code on the data ingestion side and having to backfill missing data
  • Schema changes in databases that disrupt downstream transformations
  • Handling changing volumes of data and related infrastructure implications 
  • Optimizing our pipelines for data quality and easy monitoring
  • Ensuring the data is secured when traveling over the web and cloud infrastructure 

The harsh reality is that data engineers in smaller teams must wear multiple hats. This brings us back to the initial question: Should you build or buy your data pipelines?

Despite co-founding a software solution that automates the process of building and maintaining end-to-end data pipelines, my answer remains the same: it depends.

If you are part of a larger data team, can allocate engineering resources solely to building and maintaining data pipelines, and have already developed custom solutions throughout your data stack, feel free to build your data pipelines. In this scenario, you likely have the in-house resources to scale this solution to meet stakeholders’ needs and strict deadlines.

Costs Associated With Building Data Pipelines?

When we talk about costs there are three types of costs to consider, setup costs, maintenance costs, and opportunity costs.

Your setup costs are the initial costs of making your data flow from your data sources into your destination of choice. If you elect to build your data pipelines the setup costs will include the following:

  • The pay of Data Engineers allocating their time to the project. According to GlassDoor, the average pay of a data engineer in the United States is $144,389. If one or more engineers spend several months exclusively developing end-to-end data pipelines, the cumulative cost becomes evident.

Maintenance costs refer to the expenditures associated with ensuring the functionality of your data pipelines. When opting to construct these pipelines, you assume responsibility for resolving bugs, updating pipelines to accommodate schema changes, and addressing security concerns as they arise. Returning to our earlier discussion on data engineering salaries, if your pipelines cater to complex use cases, a small team of data engineers will likely be necessary to manage them diligently. This team will be tasked with promptly addressing issues and ensuring the smooth operation of data pipelines.

One often overlooked cost is the opportunity cost linked with pipeline construction. Opportunity costs to consider include:

  • Diversion of attention from other analytical responsibilities potentially leads to employee frustration, particularly in understaffed organizations where employees are expected to accomplish more with fewer resources.
  • Allocation of existing resources to pipeline construction and maintenance. It’s crucial to assess where employees are being redirected from and whether they possess the requisite expertise.
  • Increased complexity as additional data sources are incorporated. How can low latency be ensured for real-time analytics?
  • Time-to-value associated with pipeline construction and maintenance. Balancing prolonged development cycles, high maintenance overhead and the necessity to keep pace with evolving technologies and APIs poses significant challenges in promptly deriving value from custom solutions.
  • Dependency on the tribal knowledge of departing engineers who built these systems. What contingency plans are in place for system operation and maintenance in their absence?

In essence, costs can accumulate swiftly without notice. While the initial expense may seem economical, over time, diverting data engineers from core projects that drive organizational profit becomes apparent.

Should You Buy Your Data Pipelines?

Stakeholders depend on data teams to promptly provide data for strategic decision-making and now even for new GenAI initiatives. With the growing demand for data, engineers require scalable solutions that save time and resources.

To meet those needs, organizations are turning to low-code and no-code dev tools to fulfill the growing demands of data. Gartner estimates that in 2026, developers outside formal IT departments will account for at least 80% of the user base for low-code development tools, up from 60% in 2021.

Like any decision in technology, there are tradeoffs between buying and building data pipeline solutions. Below is a list of the Pros and Cons of choosing to buy a data pipeline solution.

Here are some benefits of purchasing a third-party data pipeline offering:

  • Get Things Done Faster: Ready-made data pipeline solutions can tackle most of your company’s needs quickly. This means you can achieve your goals faster, without unnecessary delays.
  • Focus on What Matters: With a data pipeline tool, you can free up your time to concentrate on essential engineering tasks. Meanwhile, your business teams can dive into reporting without having to wait around for data. Maintenance expenses are an inherent aspect of any data solution. By purchasing a solution, the provider assumes the responsibility for all maintenance tasks and any accrued technical debt. This alleviates the maintenance burden on your team, allowing them to focus on initiatives aimed at creating value rather than being consumed by routine maintenance tasks.
  • Say Goodbye to API Headaches: Keeping up with changes in connectors, like API expirations, can be a real pain for data engineers. But some data pipeline tools come with pre-built connectors and ongoing maintenance, so you don’t have to worry about keeping up with the changes yourself. 
  • Reliable Support: When you invest in a data pipeline tool, you also get access to dedicated support. They’ll be there to guide you through any hiccups along the way, leaving you free to focus on your main objectives.
  • Grow Without Limits: As your company expands, so does the need for more connectors. Trying to handle this with an in-house data pipeline can become a headache. But with third-party SaaS tools, you can rest easy knowing your data pipeline can handle the growth. Plus, these tools typically support a wide range of connectors, so you’re covered no matter how big you get!
  • Ease of Use for Non-Technical Users: It’s typical for the needs of a data pipeline to evolve, especially as data consumers request new datasets to support fresh use cases. With an in-house data pipeline, accommodating these changes usually involves data engineers revisiting and adjusting the pipeline as needed. Conversely, commercial pipelines frequently provide a user-friendly web-based interface, empowering data consumers to make alterations to data collection methods, transformation processes, and destination settings without requiring engineering assistance. This level of accessibility grants business teams greater autonomy over data and enables data engineers to reallocate their time more efficiently.

However, the following trade-offs should also be taken into consideration:

  • Lack of Flexibility: There might be some times when your team or company has very specific use cases or requirements that a third party can’t provide. Sure, these can likely be accomplished with some form of workaround or working with their support team, but it’s important to assess the tool’s flexibility before opting for a solution that won’t scale.
  • Diminished Control: Although a purchased solution may currently offer all the functionality you require, future scenarios might necessitate new use cases or minor adjustments, such as personal preferences. While many vendors offer some degree of customization for your pipelines, achieving this often necessitates custom development within a framework supported by the vendor. If your organization utilizes several custom data sources, such as proprietary REST APIs, it’s important to assess the level of effort required to integrate and support these sources effectively.
  • Vendor Lock-In: Choosing any tool, whether built or bought, inevitably entails a degree of vendor lock-in. With a purchased solution, this lock-in may be more pronounced as you are committed to paying monthly invoices and may be bound by a multi-year agreement, thereby prolonging your dependence on the vendor.

Costs Associated With Buying Data Pipelines?

Purchasing tooling for data pipelines can add up quickly, primarily due to having to license data tooling. Acquiring these tools doesn’t eliminate the need for engineering involvement. While it may require less time compared to integrating open-source tools or creating custom solutions, it doesn’t reduce the time investment to zero, even with top-tier tools.

On top of that, buying modern tooling has the potential to shorten the data pipeline development time from weeks to just a few hours or days.

I like to think of licensing costs to be broken down in two ways. 

  • Modern Data Stack: Piecing together data ingestion, transformation, orchestration, and activation tooling to form an end-to-end data pipeline. This approach still requires engineering time to stitch all the tools together and ultimately slows down the creation of data pipelines.
  • Modern Data Platform: Utilizing a modern data platform to handle all the functionalities of building end-to-end data pipelines. This approach dramatically improves the develop experience and removes the need to switch back and forth between tooling to create end-to-end data pipelines.

Let’s break down how this might look across an organization.

Source: Enterprise Strategy Group, a division of TechTarget, Inc.
Tool

MDS Cost

MDP Cost
Ingestion$80,000$65,000
Transformation$20,000
Orchestration$5,000
Activation (Reverse ETL)$25,000
Total$130,000$65,000

What I want to point out is that you can see a modern data platform involves licensing fewer tools and ultimately less of a headache to manage over time. Modern data stack technologies are linked together in a linear chain. Naturally, there are pressure points in terms of integration and manpower. A lot of resources are required to serve insights to the entire business. 

Tech-wise, upstream processes enable downstream processes. So if one link in the chain fails nothing downstream can function or get updated properly. It demands a lot of workarounds. This process does not scale up well.

In 2024, organizations are maintaining smaller data teams and being asked to deliver faster justifying an investment in data even when budgets are tightly reviewed post the Zero Interest Rate Policy (zirp) era.

I might be biased, but I recommend taking a platform approach if you opt to buy your data pipelines. A platform approach to data pipelines consolidates the various functions that a small data team would need to build end-to-end data pipelines and blends simplicity and flexibility to adapt to specific use cases. With a platform approach, data teams don’t have to worry about setting, maintaining, and linking together various data tools. A platform approach provides centralized visibility over your data pipelines, simplifying root-cause analysis when something goes wrong.

Finding the Right Data Pipeline Solution for Your Business

The build vs. buy decision in data pipeline development is multifaceted, requiring careful consideration of technical, financial, and strategic factors. I often advise data teams to adopt a strategic approach when navigating the build vs. buy dilemma. I recommend starting this process by:

  • Conducting a thorough analysis of what they require in a solution, whether that be a custom in-house or third-party off-the-shelf option.
  • Evaluate all available options to choose from and consider the long-term implications of choosing that option, while also focusing on the time to value.
  • Collaborate with data, IT, and business stakeholders to align with organizational goals and priorities.
  • Give the tool of choice a quick try. If the tool is going to enable time/resource savings, a trial should be able to prove that.

In my experience, I see the value in internally building data pipelines for unique use cases. However, I have experienced and still talk to other data teams who share their experiences with the significant investment and ongoing maintenance this approach entails. Alternatively, to that point utilizing third-party solutions can provide cost savings and can be implemented faster, thus resulting in faster time to value.

With the little maintenance involved and considering the data engineers’ salaries are superior to most of these tools’ license costs, the result is often not just faster time to value but as important –  faster time to ROI.  Sure there are some trade-offs you might lose (depending on the solution) in terms of flexibility and control, but using these options allows you to focus on deriving insights rather than managing infrastructure.

At the end of the day, it ultimately comes down to your organization’s unique needs and resources available to make the right choice that aligns with your short and long-term goals. 

By adopting a strategic approach and leveraging available expertise, they can navigate this decision-making process effectively and build robust data infrastructure to drive business success.

Minimize the firefighting.
Maximize ROI on pipelines.

icon icon