Chen Cuello
JAN 11, 2023
icon
8 min read
Ingest data using Rivery

The Rise of Data Mesh

Data mesh has been gaining popularity among large organizations that know the value of their data but realize that gleaning that data is not as easy as pooling it into one data lake and hiring experts to handle it. An Accenture study found only 32% of firms achieved tangible and measurable value based on their data analytics investments. Gartner estimates that poor data quality is costing companies an average of $12.9 million annually.

Data mesh is primarily a change in the responsibility for provisioning data throughout the organization. Rather than having one centralized data team that handles data storage, analytics, and reporting, different teams throughout the organization take on the responsibility of provisioning data assets, similar to how they might provide services through APIs. By moving the responsibility to the departments, the data mesh architecture lets the real domain experts provide data they consider to be reliable and useful. 

For example, the customer support team would be responsible for the provision of data services regarding the number of complaints for each product the company offers. The data about customer complaints would be used by the manufacturing and product development departments. The manufacturing department might want to use this data in combination with data about the components and component providers to find out if there is a common supplier for the products with a high volume of complaints.

The product team might use this data in combination with location data and marketing funnels to find out if the product is being marketed in a way that sets up unrealistic expectations. In this case, one set of data is being used in different combinations by different recipients. When the customer support team is cognizant of how the data is used, they can provide a better service.

For product developers, it’s easy to understand the correlation between suppliers and the defects in a product line, but a centralized data team would have to identify the problem, learn about how each department works, and then perform the data analysis. Furthermore, a data team might spend more time figuring out which source of data is most reliable, whereas the customer support team knows exactly how to relate to the data they’ve managed themselves. 

Why Companies Are Moving from Data Lake to Data Mesh?

Data mesh architecture is the latest in decentralized approaches to enterprise software infrastructure. Just as companies are migrating to microservices for their software deployment, moving to a data mesh offers potential benefits of decentralization. 

The centralized model of a data lake puts tremendous levels of responsibility on a single team of data engineers and professionals. In this approach, a centralized data team looks at the company’s KPIs and challenges and creates data structures that will collect data required to impact the company’s key performance metrics. 

While this can succeed, it requires that the data engineers also become experts in various areas of the company’s operations. Furthermore, there’s the issue of waste. Much data is outdated, of poor quality, or simply not relevant to the actual business needs. Considering that data needs to be extracted, transformed, and stored in a data lake (or even a dedicated data warehouse on top of it) just to realize later that this data isn’t useful, means a large amount of wasted data cleansing as well as wasted space for storage of that data.

The bottom line is that companies have invested a tremendous amount of resources into data lakes, and in most cases, it simply hasn’t paid off. Zhamak Dehghani, who coined the term data mesh, goes through the logic behind the movement towards the data mesh in her 2019 article outlining its components.

Tooling Implications: Core Principles of a Data Mesh

Data mesh includes four pillars: domain ownership, data as a product, self-service data infrastructure platform, and federated governance. Each one of these pillars has implications for the tooling that organizations need to consider when moving to a data mesh.

Decentralized Ownership: Domain-Owned

In a data mesh, each data product is owned by the business unit that produces that data. 

Companies moving to a data mesh architecture will need to determine what is meant by a domain. While a domain could be any business unit or group, to maintain a data mesh, each business unit needs to have specific personnel tasked with creating and maintaining the data products, as well as answering the needs of the other business units. 

In this architecture, rather than a direct reporting relationship between the data team and the business leaders, the different business units need to communicate with one another. This may require clarity in the organizational structure or formalized procedures for making requests for data products that are not already on offer.

In other words, the company may still maintain a data cloud infrastructure like Snowflake or S3 and Redshift, but in a way that the departments are responsible for their own partition in the data lake. Similarly, other products, such as the data extraction and transformation systems, need to be provisioned as services to the different groups within the enterprise. To service the data mesh, it’s important to choose tooling that provides a modular approach that caters to the different departments including capabilities such as access per environments (domains) and access to unlimited users.

Data as a Product

The principle of data as a product states that the data should be intrinsically discoverable and usable. This is likely to spawn a class of products for data package discovery. API libraries and systems for the discovery of datasets became an important part of enterprise microservices architecture and have continued to be one of the important ways in which companies are maintaining their systems.

Similarly, data mesh architecture requires standardization of the way in which business units deliver data products. Aggua is an example of one of the tools in this area, merging a data catalog with the data lineage experience to allow for faster discovery of data products and the necessary trust to leverage those. 

The other aspect of data products is usability, which requires companies to create policies for standardization of the data compilation, modeling, and packaging processes. Security and authentication are key parts of this. Maintaining security standards such as SOC2 and using the latest OAuth distributions is a requirement of doing business. Different data products may include personal data, confidential company information, or other types of sensitive data. Just like any product, data products need to be secured so that they don’t expose sensitive data to the wrong users. 

Each type of data has specific privacy and security requirements attached. In order to ensure that the policies are set correctly, companies need to maintain detailed meta-data that flags sensitive data. Tools in the data stack should include capabilities for flagging or masking sensitive data, adding and maintaining metadata, and properly managing the different types of data. In a modular data mesh architecture, it’s possible that some of the people handling the data may have less expertise in privacy and security. Therefore, it’s important to automate the security and privacy protocols using tools that provide built-in privacy-preserving mechanisms.

Finally, in the current modern world of software product development, technology capabilities that support CI/CD and version control are an industry standard. The same applies to the data product technologies being used that should support the same as align with the same standards.  

Self-Service Platform Requirements

In a data mesh environment, the IT and data teams are focused on setting up self-service infrastructure as platforms that best serve the company’s requirements. The data processes should be easy enough to build new products and create consistency throughout the organization, but flexible enough to offer different users the right tools for their specific needs. 

The importance of a self-service platform in data mesh is a single access point for data producers and data consumers. For technology, this means the interface has to support and allow for polyglot data, or expose its data in many different forms, so that all types of data consumers can consume. 

For technology the implication for Self-serve data infrastructure as a platform is the scalability of the infrastructure. To truly enable product creators to self-serve, the data platform must be able to scale without the intervention of an IT professional. This type of dynamic infrastructure scale can only be achieved with a fully managed SaaS platform like Snowflake for data lake/warehousing and Rivery for data integration.

Another key technology consideration for enabling self-service is the ease of use of such technology. Ease of use is a product of an intuitive user experience but also of the amount of skills and knowledge users need before they can use the platform. For example, using industry standards for data wrangling languages such as standard ANSI SQL and Python can shorten the learning curve for new users as they onboard onto a new platform in contrast to having to learn specific tools/functionality which are only used within a certain platform. Some tools also provide additional templates (i.e. starter kits) that can not only speed up the development time but also enable less skilled users to self-serve and develop advanced data products.  

Finally, to meet the demand for flexibility, the chosen technology has to provide ways to go beyond strictly no-code solutions. It’s extremely rare to find no-code technology that can meet the needs of different users within different domains and so enabling more advanced options in the form of combining no-code with low-code/managed code allows users to keep their self-service freedom while meeting their data products needs.

Federated Governance

Federated governance entails the security, compliance, availability, quality, standardization, and provenance of the data. In terms of data quality and provenance, the distributed model provides clear advantages. In terms of security and compliance, with self-service tools, every team can implement security appropriately depending on the sensitivity of the data. In this realm, again, the data producers and users are the best qualified to know the level of security and compliance required for each data product. Enterprises will also need analytics and usage tracking to understand the value of the data products and ensure their proper management. 

In this environment, products such as Rivery are particularly suitable. Rivery serves many of the outward-facing teams, such as marketing and sales, and provides an easily consumable set of tools and policies that can be customized for each group. The multi-tenancy features and domain-level access allow a level of flexibility for different user groups to access exactly what they need.

Rivery answers the needs for federated governance with built-in multi-tenant capabilities and user access controls that enable federated governance. 

Starting the Data Mesh Journey

Moving from a centralized data analytics approach to a data mesh approach is like any other digital transformation in that it takes time as well as a shift in mindset. Some data team members may become more integrated with the IT group, while others will work within the different business units on the data needs of each domain. The domain members responsible for data products will need training in the new processes as well as in how to create products that are useful to the rest of the organization.

Wrapping up

Moving to a data mesh is one of the ways that enterprises are improving their data-driven decision-making. The data mesh approach improves the relevance of the data provided within the organization. Furthermore, companies who adopt this approach automatically increase the direct interactions among departments, breaking down company silos. 

If you’re looking for a data integration platform that’s perfectly aligned with the data mesh approach. Stop right here. With tooling designed for flexibility to serve multi-purpose applications and a multi-tenant ETL approach, Rivery fits in with any data mesh strategy. 

Start for free 

 

Minimize the firefighting.
Maximize ROI on pipelines.

icon icon