How to Use GenAI for Efficient Data Engineering

Itamar Ben Hemo

AUG 27, 2024

8 min read

Content

Don’t miss a thing!

You can unsubscribe anytime

The last two decades of data engineering have witnessed countless significant technological shifts, from the rise of big data to the rapid growth of cloud computing. While these advancements have propelled companies to become data-driven, the disruption that generative AI (GenAI) will bring to the data industry is the most significant development we will ever encounter.

In the data industry, AI enhances business intelligence (BI), making it more efficient, effective, and accessible to a broader range of users. AI-powered BI tools can automate many of the manual tasks involved in BI, such as data preparation, cleansing, and analysis. In addition, AI can power other types of analytical operations and process automation that generate greater productivity fueled by data.

However, AI needs access to reliable data to drive these advancements, placing data engineers at the forefront of this movement. With GenAI transforming the world and the use cases for AI continuing to grow daily, it’s no surprise that a recent study by McKinsey found that 40% of organizations will increase their spending on AI.

The role of data engineering is changing before our eyes. The rise of AI is already leading data engineers to ingest data from new sources (mainly unstructured data), transform data in new ways (beyond just modeling it for analytics), and load data into new targets (vector databases).

Throughout this article, we will explore how the role of the data engineer is evolving, how GenAI can fit into the data engineering workflow, and provide a glimpse into the future of data engineering shaped by AI.

Will AI Replace Data Engineers?

No, AI won’t replace data engineers—at least not in the foreseeable future.

However, the role of data engineers is evolving as AI continues to transform the data landscape. While AI excels at automating routine tasks, data engineering encompasses much more than that. It demands critical thinking, a deep understanding of stakeholder needs and business goals, and strategic planning—skills that AI cannot fully replicate.

As AI tools become more sophisticated, data engineers will increasingly focus on higher-level responsibilities. For instance, they’ll need to ensure data quality, manage data governance, and create scalable architectures that can accommodate AI applications. This requires not just technical skills but also strong communication and collaboration abilities, as data engineers work closely with data scientists, analysts, and business leaders to align data strategies with organizational objectives.

Moreover, data engineers play a vital role in interpreting the results produced by AI systems. They must understand the context in which AI models operate and assess the implications of their outputs. For example, when an AI model identifies trends or anomalies, it’s the data engineer’s responsibility to validate those findings and integrate them into actionable business strategies.

No matter how advanced AI tools become, data engineers remain essential for building and maintaining AI applications. Skills like critical thinking, contextual decision-making, and a strong grasp of business needs are irreplaceable and often more valuable than many data engineers realize. As AI continues to evolve, the demand for skilled data engineers who can bridge the gap between technology and business will only grow, ensuring their crucial role in the data ecosystem for years to come.

How GenAI Will Help Data Engineers

1. Generating Data Transformation Code

One of the most significant benefits of GenAI for data engineers is its ability to automate code generation, drastically cutting coding time by 45 to 50%, according to research. This monumental change is facilitated by tools that can convert text to SQL and tools that allow you to enter your code without having to go through a complex user interface.

By generating data transformation SQL code based on specific prompts the time drastically reduces to transform and test data from days to minutes.

2. Translating SQL Dialects

I’ve witnessed how data engineers often face the challenge of working with multiple SQL dialects across database systems. It can be complicated and time-consuming, resulting in low productivity and wasted days.

However, GenAI can bridge these gaps by translating queries from one SQL dialect to another. This functionality helps organizations in diverse technological environments.

Here’s how:

Automated query translation: Tools powered by GenAI—such as Google Bard—can automatically translate SQL queries written in one dialect (e.g., MySQL) to another (e.g., PostgreSQL).
Seamless integration: By translating SQL dialects automatically, GenAI ensures that data from different sources can be integrated smoothly. This is particularly beneficial in environments where data is stored across multiple database systems—each with its own SQL dialect. For example, most companies leverage multi-cloud environments as each cloud provider has its strengths.
Error reduction: Manual translation of SQL queries is prone to syntax errors and inconsistencies. GenAI minimizes these risks by providing accurate and consistent translations, which helps maintain data integrity and reliability.
Enhanced collaboration: In organizations where teams use different database systems, GenAI’s ability to translate SQL dialects facilitates better cooperation. Therefore, your team can share queries and insights without worrying about compatibility issues, which builds a more cohesive work environment.

This can help you accelerate migrations from one database to another. For example, if you use Rivery to migrate data from Google BigQuery to Snowflake, Rivery will handle the automated data movement for you as well as the data type mapping between the two, and then using GenAI, you can easily adapt any downstream transformations currently performed in BigQuery’s SQL dialect to Snowflake’s before orchestrating those in Rivery.

3. Generating Documentation on a Dataset

Comprehensive documentation is essential for maintaining data quality and ensuring accessibility and understanding across all stakeholders. I’ve experienced many problems when documentation is wrong or inefficient.

Thankfully, GenAI can streamline the creation of detailed documentation by automatically providing descriptions of datasets, their structures, and usage guidelines. This saves time and improves collaboration within your organization.

Automated documentation also has guidelines for effective dataset utilization, including query examples, best practices, and access and permissions information. It showcases sample queries to illustrate how to derive meaningful insights from the data while offering recommendations on data handling practices and security considerations.

As data volumes and complexities increase, manual documentation maintenance becomes increasingly challenging. However, GenAI seamlessly scales to accommodate large and intricate datasets. This ensures documentation remains relevant and accurate by promptly reflecting continuous data changes and updates.

4. Generating a Schema from JSON

Another powerful use case for GenAI is generating schemas from JSON data. JSON is popular for its flexibility, but it often lacks the structure needed for efficient data processing. For instance, data extracted from certain sources may include fields with mixed data stored in JSON arrays. These arrays often contain keys representing “nested” fields.

To make storage and querying easier in a relational database, it’s often necessary to extract these nested fields into separate columns. Some cloud data warehouses, like BigQuery and Snowflake, offer built-in functions to “unnest” or “flatten” these fields. However, in other warehouses, this process can be more manual. With Rivery, you can use a logic River to unnest a JSON field and leverage GenAI to quickly generate the logic needed for this step, saving time compared to doing it manually.

5. Personalized Data Recommendations

Personalized data recommendations leverage GenAI’s capabilities to analyze user behavior, preferences, and historical interactions with data to deliver tailored recommendations.

This presents several advantages for data engineers:

User-Centric Data Discovery: GenAI analyzes user profiles, search history, and data consumption patterns to understand individual preferences and interests. This user-centric approach to data discovery ensures that users receive personalized recommendations tailored to unique requirements.
Context-Aware Recommendation Generation: GenAI considers contextual factors such as user roles, projects, and business objectives when generating data recommendations. This enhances the relevance and usefulness of recommendations, leading to increased user engagement and satisfaction.
Adaptive Recommendation Algorithms: GenAI continuously learns from user interactions and feedback to refine its recommendation algorithms. This adaptive approach ensures that recommendations remain relevant and effective.

6 Automated Data Quality Assurance

Ensuring high-quality data is critical in data engineering to maintain the integrity and reliability of analytics and decision-making processes. However, my experience with data shows how time-consuming quality assurance can be.

I’m pleased that GenAI can automate data quality assurance processes, providing many advantages for data engineers.

GenAI can automatically conduct comprehensive data profiling to identify anomalies, inconsistencies, and errors within datasets. Automated data profiling and cleansing enhance data accuracy and reliability, mitigating the risk of erroneous insights and decisions.

Moreover, GenAI effortlessly generates validation rules based on historical data patterns and business requirements. These rules are applied to incoming data streams or batch processes to ensure data consistency and compliance with predefined standards.

7. Automated Data Source Integration

For data source connections, copilots use AI to automatically detect, configure, and optimize the integration process. They generate the necessary code, handle authentication, and manage data flows, minimizing manual setup and trial and error coding against various APIs. This allows data engineers to connect to data sources more quickly, leading to faster access to insights.

At Rivery, we designed a Copilot to generate data pipeline configuration in the form of a human-readable YAML file. This makes it easy for any data engineer not only to validate what the Rivery Copilot generated but also to modify it as needed.

The Possible Challenges of GenAI and Data

Although GenAI will positively transform the data engineer’s role, there could be pitfalls. Data engineers should be wary of these risks.

Here are some things to consider:

Correctness: One of the most significant risks of GenAI is incorrectness due to weak context or model limitation, resulting in the SQL, Python, or other code not running as expected. Therefore, data engineers must test all code generated by GenAI by treating it as a starting place.
Staleness: Another major issue is staleness due to GenAI models not being trained on the data. You can overcome this by providing more context—such as documentation—but the number of tokens an AI prompt can accept will limit you.
Security: Security is fundamental when handling large datasets. However, GenAI-generated code may cause security vulnerabilities due to the model’s low awareness regarding secure coding practices.
Compliance: Ensuring compliance with relevant legal and regulatory requirements can be tricky with GenAI. You’ll need legal experts to review the code to ensure it adheres to laws and regulations, such as data protection laws (e.g., GDPR, CCPA).

A survey conducted in the latter half of 2023 involving 334 Chief Data Officers (CDOs) and data leaders, sponsored by Amazon Web Services and the MIT Chief Data Officer/Information Quality Symposium, revealed a common enthusiasm for generative AI.

However, interviews with these executives highlighted that significant preparation is still needed to embrace this technology fully. This is especially true for the potential pitfalls.

The Future of Data Engineering with GenAI

As generative AI continues to evolve, its integration into data engineering will become increasingly refined.

One thing is certain: with the rise of GenAI, data engineers are at the forefront of organizations seeking to build AI applications that unlock new efficiencies, expand product offerings, and enhance customer experiences.

The main challenge data engineers face is bringing in the right data in the appropriate structure for AI applications. Fortunately, there are several options available to achieve this:

Build your own AI architecture: Data engineers can assemble various tools to form their AI data architecture. This includes gathering and loading data using data integration tools or open source libraries and Python code into dedicated vector databases like Pinecone or storing data in vector format within PostgreSQL. This approach enables scalable and efficient retrieval and analysis of large volumes of unstructured data. Vector databases are essential for applications that require similarity searches, such as image retrieval systems and document search engines.

Lean on your existing Cloud Data Warehouse /Lake: Another option is to load data into cloud data warehouses or lakes, leveraging existing data infrastructure along with the LLM or vector capabilities of these platforms. For instance, data can be loaded into Snowflake using a tool like Rivery to extract and transform it for use with large language models (LLMs). This allows for analysis using Snowflake’s Cortex functions on relevant and up-to-date data through simple SQL queries.

Leverage an AI abstraction service: You can also utilize services like Amazon Q, a fully abstracted solution that manages everything from RAG workflow management to, underlying infrastructure (eliminating the need to choose a vector database) and down to the user interface (i.e. a web chatbot). To use Amazon Q, you only need to load data into the platform. This is where Rivery comes in, managing the loading of data into a dedicated S3 bucket and triggering an Amazon Q sync to ensure the freshest data is available to the users.

Use dedicated AI Applications: Consider leveraging specialized AI applications for specific use cases. The market for dedicated AI tools is growing by the day. For example, tools like Kapa.ai for customer support. If you need specific AI capabilities without extensive development time, tools like Kapa.ai offer pre-built data integrations to match their dedicated use case (i.e. for Kapa.ai, it would be built-in integrations to support documentation websites, ticketing systems such as Zendesk, Slack channels and more). By deploying AI-powered chatbots using Kapa.ai’s templates, companies can automate responses to common queries, access customer records, and provide real-time troubleshooting assistance. The platform’s pre-trained models can also be fine-tuned with specific customer data to enhance response accuracy.

Moving Forward

As data engineers evolve, their role will shift from merely managing data pipelines to becoming strategic partners who harness AI-driven insights to foster business innovation.

By embracing generative AI (GenAI), your team can unlock new opportunities, enhance efficiency, and maximize the value derived from data. The future of data engineering is rooted in the seamless integration of AI technologies, and those who adapt will be at the forefront of this exciting new era.

At Rivery, we are fully committed to this transformation by incorporating generative AI into our workflows and, most importantly, our product. We’ve reimagined the next generation of data pipelines to be AI-powered. With our CoPilot, you can easily connect to any data source that has a REST API endpoint. This means that even if we don’t offer a native connector for a specific source, we can still help you access all your data effortlessly, regardless of your level of data engineering expertise. While Rivery is the first data integration platform with such capability, we are not stopping there and have grand plans to continue incorporating AI to help our users do more with their data.

Itamar Ben Hemo

CEO & Co-Founder

Itamar Ben Hemo is co-founder and CEO at Rivery, the SaaS ELT platform that streamlines data integration, transformation, and orchestration. A seasoned business executive, Ben Hemo was co-founder and CEO of Vision.BI, a leading data consulting firm acquired by Keyrus Group. At Keyrus, Ben Hemo was group vice president for North America.