For data teams big and small, machine learning models are vital to analyzing and deriving insights from datasets of all kinds. In recent years, Google BigQuery has made building and deploying machine learning models easier and more accessible for data-driven teams.

 

With recently added support for custom BigQuery SQL queries, Rivery’s Logic Rivers can now run machine learning models in Google’s cloud data warehouse directly from the platform. Using Rivery, teams can automate the deployment of machine learning models in BigQuery, augmenting their data pipelines with predictive insights.

 

Here’s a step-by-step guide on how to pull this off, based on an example from a famous open source dataset.

 

Machine Learning in BigQuery: What You Need to Know

 

Teams can create and execute machine learning models in BigQuery using standard SQL. BigQuery makes machine learning more democratized by allowing any team to build models using existing SQL tools and skills. BigQuery currently supports the following popular machine learning models:

 

  • Linear regression
  • Binary Logistic regression
  • Multiclass logistic regression
  • K-Means clustering
  • Matrix factorization
  • K-means clustering
  • Time-series forecasting
  • Boosted Tree/XGBoost based classification/regression
  • Deep Neural Networks (DNN)
  • Custom Tensorflow ML models created in languages such as Python

 

Teams can take advantage of the computation and storage capabilities available in the cloud by using Rivery to push their data to BigQuery, resulting in a more efficient process to build, train, and deploy ML models.

 

Step-by-Step Example: Pima Diabetes Dataset

 

In this example, we’ll use the Rivery platform, BigQuery, and binary logistic regression, to predict a set of patients’ risk for diabetes using the famous Pima diabetes dataset. The Pima dataset details whether or not a patient will develop diabetes, based on certain diagnostic health measurements.

 

Let’s get started!

 

1. Create two Rivers, with both Types set to Data Source to Target, to transfer the Pima diabetes dataset from Google Sheets into the BigQuery database.

2. One of the Rivers should bring in the data for the patients that you wish to predict the diabetes risk for, in the format shown below:

3. Create a Logic River. Within the Logic River, create a container that concurrently calls both Rivers created in Step 1. This represents the extraction and loading steps of the ELT workflow.

4. Add another Step to the Logic River, outside of the container in Step 3. Choose “SQL/Script” and “BigQuery”. Select the same BigQuery database connection from the River in Step 1.

5. Split the dataset into a training set (to train the model) and a testing set (to validate the model). Generally, teams use an 80/20 train/test split for ML models. Create a Logic Step that executes the ML model using 80% of the available data. This represents the training set. Running the River at this juncture will generate a BigQuery logistic regression model with the name specified in the project directory.

6. Evaluate the performance of the model by running the 20% of the data that was set aside for testing. Create another Logic Step. Select “SQL/Script” and “BigQuery”. Set the “Target” to a table name of your choosing.

7. When you run this River, you’ll see technical details about the ML model’s performance within a table in the BigQuery project directory.

8. See the model in action: predict if patients are at a high risk for developing diabetes. Create another Logic Step with the same settings as in Step 6. Set the “Target” table to a table name of your choosing. In this example, the target table is “diabetes_new_patient_data_predictions”.

9. Create one more Logic Step with the same settings as in Step 6 to join the patient_ids to the predictions of the model. A simple join in another Logic Step can accomplish this. Select the “Target” and save this final predictions table.

10. Run the River. In BigQuery, preview the table named in Step 9. This will show the model’s predictions for each patient. “1” means likely to develop diabetes, “0” means unlikely.

11. Here’s what the finished ELT pipeline should look like:

Rivery Automates the Deployment of ML Models in BigQuery

 

Rivery’s Logic Rivers automate the training, testing, evaluation, and deployment of machine learning models in BigQuery. This enables teams to unlock a whole new universe of predictive insights with their BigQuery data, leading to deeper, more insightful analysis that provides an edge over the competition.

 

Of course, the ML prediction model based on the Pima dataset represents just one use case. But whether it’s predicting customer purchases based on engagement, or recommending movies based on viewing history, the possibilities for machine learning models are endless. And Rivery is there to make sure you get the most out of them.

 

For a more detailed version of this implementation, including specific SQL query syntax used in the creation, validation, and deployment of this model, please visit the Rivery Community.