Data Source To Target Overview
- Updated On 01 Dec 2020
- 7 Minutes To Read
A Data Source to Target river enables the ingestion of data from a source connector into a cloud target. In this article, we'll walk through the steps involved in creating, scheduling, and monitoring a Data Source to Target river.
To create a Data Source to Target river, click 'Create New River' and select 'Data Source to Target' as the river type.
The first step of the river is to give it a name and (optional) description. You can add this river to a group as a means of organization in your Rivery account.
The Source step is where you select the data source you would like to ingest data from and fill out respective parameters to further customize this ingestion.
You are first given a list of supported Rivery connectors to choose from as your source.
Regardless of which data source you choose, you'll first have to choose or define your connection. If you've already made a connection to the source in the 'Connections' tab, you should see this connection available in the dropdown. If you haven't yet made a connection, you can create one on the fly by selecting 'Create Connection.'
Depending on which data source you choose, the Source tab will provide different options for customizing and shaping the data to be pulled into your target.
For example, when connecting to an API connector such as Facebook Ads, you'll be presented with parameters and options that reflect what the API offers. For example, in Facebook Ads, you are first given the option to select which Facebook Ad report you would like to pull into your target.
For connecting to a relational database source , the Source tab will look a bit different. Instead of pulling pre-defined reports and selecting accounts, campaigns, date ranges, etc. we are prompted to select how we would like to pull data from our source database.
Selecting Time Periods
Depending on what type of source connector you are using, you will have different options for the range of data you wish to pull.
In most API connections, there will we an option to select a time period to pull data from:
In the above dropdown you'll find pre-canned options for time period to pull, as well as a custom date range option.
A note about the custom date range option: Upon the first run with a custom date range, the river will pull data from the Start Date indicated to the End Date indicated (if End Date is left empty, it will pull data from the Start Date until the current time). The next time the river runs, the Start Date will be updated to reflect the date/time of the latest run, and the End Date will be turned empty. Thus, the river load will pick up where it left off from the previous load.
'Last Days Back' option
In some source connections, there is a 'Last Days Back' option. This means that in addition to the time period specified, Rivery will add X number of retroactive days to the data pull. For example, if the river time period is set to 'Yesterday' and 'Last Days Back' = 1, then the river will pull both yesterday's and the prior day's data.
In our RDBMS sources (MySQL, SQL Server, Oracle, PostgreSQL), as well as some others (Salesforce, Netsuite, to name a couple) there is an option to set the Extract Method to load the data incrementally.
What this means is that instead of extracting all of the data in a given table or report, you can identify a field that will serve as your increment. Most commonly this is a timestamp or ID field of some sort. For example, if I set my river to load incrementally based on a field in my table called DATE_LAST_MODIFIED, then each time the river runs, Rivery will check to see what the largest DATE_LAST_MODIFIED value was from the previous run, and then only take records with a greater DATE_LAST_MODIFIED to be sent to the target data warehouse.
It is important to note that the "Start Date" for the incremental field is inclusive, while a set "End Date" defaults to exclusive, meaning that for the following scenario:
Start Date = 2020/01/10 00:00:00,
End Date = 2020/01/11 03:00:00,
The query in the backend would have the following range:
start_date >= 2020/01/10 00:00:00
end_date < 2020/01/11 03:00:00
There is an option to specify that the end value should be included in these increments:
which would then change the above query range to look like:
start_date >= 2020/01/10 00:00:00
end_date <= 2020/01/11 03:00:00
The incremental loading method can save time and RPUs as since it is only processed the delta between runs, the data volume will be presumably lower than extracting all data every run.
Once you have configured the Source tab to your requirements, move on to the Target tab.
In the Target tab, you will first be prompted to select your cloud target. This is where the data coming from the source will land.
Currently, Rivery supports the following cloud data warehouses and cloud file storage as targets:
Regardless of what target you choose, you will need to set the connection. Again, you can either choose an existing connection or create one on the fly.
If your target is a cloud data warehouse, the Target tab will look something like this:
You'll have to define a database, schema, and target table. This is where the data pulled from your selected source will land. Rivery will automatically detect available databases and schemas for you to choose.
Next, choose the loading mode you require:
- Overwrite: Replace the target table completely with the data from this run. If the target table does not yet exist, it will be created automatically.
- Upsert-Merge: Only add new or changed records to the target table based on a defined key in the Column Mapping.
- Append Only: Union the data from this load onto the existing target table.
If you choose to use Upsert-Merge functionality, you can define your merge key by clicking in the Column Mapping (the next tab).
Upsert-Merge also has the ability to Filter on Logical Keys. This allows for deduplication within the Source records (Existing Target records will not be deduplicated). This can be done by adding an expression (in Target Database syntax) to the 'Filter Order Expression' field. This acts an ORDER BY function in SQL with the River only pulling the first record.
If your target is cloud storage, the parameters will look like this:
You'll need to identify the desired bucket and file path for the data to land. The default is the bucket used to set up the connection.
The Column Mapping tab enables the auto-detection of a table schema and allows for additional customization and edits to the metadata of the target table before the data is loaded.
To auto-detect the schema outlined from the Source tab you configured, select 'Auto-Mapping.' The corresponding fields and their data types will populate.
If Upsert-Merge was selected as your loading mode in the Target tab, you can define the primary key to use to define a match for when data from the source should be merged. You can do this by clicking the key symbol next to the desired key field so that it is colored yellow. If you click more than one field to be the key, then this will act as a combination key, meaning that the combination of the two field values will act as the primary key in the upsert-merge.
In addition to setting a primary key, you can edit the target field name, data type, and mode of any field. This enables renaming fields and casting data types during the load process itself.
The 'Cluster Key' option can be used to define a partition field.
The 'Expression' option allows for SQL syntax to be used to either define a new field or alter the definition of an existing field. The SQL syntax to be used is that of the target data warehouse you have selected.
Once you have configured the Column Mapping step of your river, you can either run it on demand (click the 'Run' button on the bottom of the screen) or schedule it.
Click 'Schedule Me' to pop open the scheduling options. For more custom scheduling frequencies, you can enter a Cron expression.
If you would like to be notified upon failure or warning, simply toggle this to be true and either enter your email address, or edit the existing variable on the 'Variables' page.
Once the river is running, click into the Activities tab to monitor the progress. For rivers pulling multiple entities at once, you can group the run statuses by individual run or by target table.
If an error is encountered, you can drill down into the error message of the river.
By drilling into an individual run, you can see the time it took to run, as well as the RPU amount the execution used.