Loading data files to an analytics database in the cloud is a common and essential task for anyone responsible for data and insights.
Regardless of the source of the file and its content, loading it to a database is necessary to facilitate data analysis.
It is much easier to filter and manipulate data with simple queries than by using more limiting ways such as Excel – so when it comes to joining and aligning data from multiple sources, a robust database is the first and only choice.
Loading data files can be done by with code (python, java etc), which opens the file, connects to the database and uploads the data to the table.
This task is might seem like a straightforward line of code, but in reality it’s a tricky job which contains difficulties and many sub-tasks.
This is one of the key pains Rivery helps solving, enabling BI and data teams to aggregate data and load it to the cloud within 5 minutes of configuring a river – and without the need for code.
Simply tell Rivery where the file is located, where and when would you like the data to be loaded, and your data will be in the cloud ready for analysis.
However, there are certain things that need to be taken into account when loading data files to any database in the cloud.
The list below represents some of the challenges BI analysts face and how we’re solving them:
Often the files you are about to load contain duplicate data.
There could be many reasons for this but it’s important to avoid duplications in your tables. Imagine if your database leads to wrong calculations on business metrics such as revenues or spend…
Duplication in the files you are about load might be due to files containing overlapping data over a specific period of time.
For example, each individual file could be a report of the past 7 days. Loading this to a database without paying attention to duplications would lead to a table which contains 7 times each record!
The best way to exclude duplications is by using your database engine, after the initial data is loaded to the table.
With a simple query, you will be able to rank all records based on the key of your data – giving you the ability to exclude any duplicated row. This way, your data will be left clean without any duplicates.
Data files are usually routinely added to cover a determined period of time, so the process of loading files to a database is carried with the frequency that a business needs.
For example, files with in website user data might be available every half an hour, while more comprehensive reports on campaign performance might be available on a daily basis.
There is obviously no need to manually execute and process routine uploads which could run automatically on a minute or hour basis. Instead, data analysts and BI professionals have to act like the conductor of an orchestra, creating a harmonious flow that runs in sync with the required frequency.
This can be done by using some Python code which will wrap the loading process or trigger it. The code can be triggered by a basic linux command or a cloud job scheduler such as Rundeck.
To simplify this step, we added scheduling as a key feature within Rivery – helping users configure their data flow without code. It’s as simple as choosing when and how often you’d like your data flows to run.
So you have managed to pull the data from the files, and are about to load it to your target table. Will you append all the records to the table? Maybe overwriting it? Even merging the data from the files with the data in your target table?
Each loading method can be executed in multiple ways, and each database has its own best practices. Loading to a target table is important, and it’s key to understand that loading data from files doesn’t end by pulling the data. That’s where it begins…
Rivery supports appending data to a target table, overwrite, and merge it – doing all these loading methods using best practices of your target database. Users just need to choose how they’d like to load the data for Rivery to make it happen.
Manipulating data can be a necessity when there’s a discrepancy between the way data appears in the source files and how it should look in your target table.
For example, files which contains sensitive or private information might be required to be loaded with changes to protect the information in the raw data.
This means that the process which loads data should be able to skip, exclude, or manipulate the value of the data in certain columns.
At the most basic level, ahead of loading any data, you need a table in the database to store the data in the files.
The columns in that table must have the right type to fit the data source. For example, it would be impossible to load string values to integer columns in the table, or string values into boolean columns.
Going over the data in the files and find what is the type of each of the columns in the file can be an exhausting task, especially in large data sets.
To solve this hideous task, Rivery will automatically match the tags of each data column, and use that mapping to create a table in the database.
The volume of the data is an issue which must be addressed when loading files.
The resource required to load thousands of files are very different than what’s necessary to loading small amounts of data. When managing large amounts of data, it is important to take into account factors such as the disk size and memory of the machine which runs the process.
This process can be broken down into gradually loading the data. The upside of a platform like Rivery, is that regardless of the size or number of the files, there’s a single reliable loading process.
In some cases, in addition to loading data to target tables, it’s useful to keep store data files in the cloud for further analysis or archiving.
The way those files are organized matters, and creating the right structure for them is important.
Partitioning can be done on a daily basis, creating a folder for each day with the files that were uploaded on that day, but it can also be done at different frequencies – from months to minutes.
Because this step might be easily overlooked, we added partitioning by default to Rivery, so your data is always safely organized and archived.