Taylor McGrath
APR 10, 2020
icon
5 min read
Don’t miss a thing!
You can unsubscribe anytime

Syncing data sources with your cloud data warehouse is essential for data congruity and data analysis. But how can you accomplish this without all the hassle?

In this Rivery Tip, we’ll outline how to keep your cloud data warehouse synced with your data sources, all within a matter of clicks.

There are two parts to this solution. First, you must conduct a historical backfill to sync your cloud data warehouse with all the existing data from a source. And second, you have to configure incremental loading to ensure that your cloud data warehouse remains up to date.

So, let’s dive in!

Perform a Historical Backfill – Using an API Connection

A historical backfill allows you to pull all the existing data from a source into a cloud data warehouse. This “backfills” the cloud DWH with all the data, both past and present, from the source.

To perform a historical backfill, you must first establish the entity or report that you want to pull into your cloud data warehouse. You can do this from the Source tab for any Data Source to Target River that uses an API.

The Please choose time period option allows you to select a predefined or custom date range to demarcate the historical backfill. These ranges are sometimes dictated by the API itself, since certain APIs do not hold data indefinitely.

Let’s say that you want to load Facebook Ads data, starting at the beginning of 2019:

  • Select Custom Date Range in the time period selection dropdown.
  • Choose January 1, 2019 as the Start Date.
  • Specify an End Date, or leave it empty. When the End Date is left empty, Rivery will assign the current date.

Configure Incremental Loading – Using an API Connection

Incremental loads update your cloud data warehouse with the current source data. In Rivery, once the historical backfill is complete, no action is necessary to configure incremental loads.

The Start Date will automatically change to the date of the last pull, and the End Date will default to the current day. The incremental load will resume where it left off, and pull data through the present date.

You will, however, need to set the right Loading Mode to conform with the loading requirements of your cloud data warehouse. For incremental loading, we recommend Append Only or Upsert-Merge.

Append Only simply adds all the pulled data into the target tables, without overwriting any past data.

Upsert-Merge replaces matching rows, keeps unmatched rows, and adds new rows. This mode requires you to define a primary key in the Column Mapping tab.

In Column Mapping, click the key symbol next to the fields you want to use as a key. By setting a field as your primary key, Rivery will check the source data values of this field (data being pulled in each run) against the target values of this field (data already stored in your target table).

If a source key value does not match any of the values in the target, the corresponding record will be upserted, or appended to the target table.

If a source key matches a value that already exists in the target table, then the source record will merge with or replace the existing target record.

That’s how Upsert-Merge only adds new or changed data into your cloud data warehouse.

Historical Backfilling and Incremental Loading – Using an Database Connection

The previous examples dealt with API data sources. Here’s how to perform historical backfilling and incremental loading with a database.

When connecting a database as your source, use Multi-Tables as your River mode to ingest data from multiple tables at once.

This is the best option for historically backfilling an entire database into your cloud data warehouse.

In the Target tab, set your default loading mode to ‘Upsert-Merge’. If the tables do not already exist in your target, Rivery will create them automatically.

In the Mapping step, there is a list of schemas in your source database. Select them to see the metadata of the underlying tables.

Click on the Edit button on the far right of the table name to see the metadata for an individual table.

Here you can define the primary key for Upsert-Merge, just like in the previous section. Load tables incrementally by clicking the Table Settings tab and selecting ‘Incremental’ as your extract mode.

Then define the field that will create your increments. This can be either a date field or an integer used as a running number.

Once everything is set up:

  • The first execution will pull the entire history and create the tables in the target data warehouse.
  • Future loads will abide by the conditions set in the ‘Loading Mode’ and ‘Extract Method’ configurations.

That’s how you can configure a database source for both historical backfill and subsequent incremental loads all at once.

Maintain Congruity with Your Data Sources – In a Few Easy Clicks

Data congruity is essential for accurate and effective data analysis. With Rivery, teams can perform historical backfills and incremental loading in a few simple steps, whether the data source is an API or a database.

But we’re just getting started here. Keep your eye out for more Rivery Tips in the near future!

Simple Solutions for Complex Data Pipelines

Rivery's SaaS ELT platform provides a unified solution for data pipelines, workflow orchestration, and data operations. Some of Rivery's features and capabilities:
  • Completely Automated SaaS Platform: Get setup and start connecting data in the Rivery platform in just a few minutes with little to no maintenance required.
  • 200+ Native Connectors: Instantly connect to applications, databases, file storage options, and data warehouses with our fully-managed and always up-to-date connectors, including BigQuery, Redshift, Shopify, Snowflake, Amazon S3, Firebolt, Databricks, Salesforce, MySQL, PostgreSQL, and Rest API to name just a few.
  • Python Support: Have a data source that requires custom code? With Rivery’s native Python support, you can pull data from any system, no matter how complex the need.
  • 1-Click Data Apps: With Rivery Kits, deploy complete, production-level workflow templates in minutes with data models, pipelines, transformations, table schemas, and orchestration logic already defined for you based on best practices.
  • Data Development Lifecycle Support: Separate walled-off environments for each stage of your development, from dev and staging to production, making it easier to move fast without breaking things. Get version control, API, & CLI included.
  • Solution-Led Support: Consistently rated the best support by G2, receive engineering-led assistance from Rivery to facilitate all your data needs.

Minimize the firefighting.
Maximize ROI on pipelines.

icon icon