Introducing Azure Machine Learning Pipelines

In Azure Machine Learning, you define a sequence of data transformation and machine learning tasks as a pipeline. This pipeline includes all of the steps that are required to import and process the data, and train a machine learning model. 

Each pipeline step (called a 'module') is an independent module and can be run on any valid compute instance or cluster. The pipeline manages the flow of execution from one module to the next.

Most pipelines begin with a dataset from which the training data is imported, and then the data is transformed through a succession of data transformation modules, until the result is fed into a machine learning algorithm to train a model.

The Azure Machine Learning Designer

Azure Machine Learning Designer is a graphical environment for creating pipelines. It allows for no-code machine learning development through a drag-and-drop interface.

The designer includes a wide range of predefined modules for data loading and transformation, model training, and validation. There are also modules for running custom Python, R, and SQL script.



Standard pipeline modules

The Azure Machine Learning Designer includes modules for transforming data, training models, creating predictions, and comparing generated predictions with data labels. 

The general approach for each pipeline is the same:
  • Import data from a dataset
  • Transform any data columns that need additional processing to prepare them for training.
  • Select a training algorithm to train a machine learning model. The designer supports a large selection of algorithms for regression, classification, and clustering.
  • Train the model by fitting the training algorithm to the training data.
  • Use the fully trained model to generate predictions for a subset of records in the dataset.
  • Compare the predictions to the actual data and evaluate the effectiveness of the model. 

We will explore these pipeline steps in great detail in the upcoming lessons and assignments.

Advanced pipeline modules

The pipeline designer includes a lot of modules that provide common data transformations. However, sometimes you may want to implement a custom transformation using your own SQL, Python, or R code. 

To support you, the designer includes the following advanced modules:
  • Apply SQL Transformation: this module uses a SQL statement to transform one or more columns in the dataset.
  • Execute Python Script: you can use this module to run any custom Python function. The function can process up to two input dataframes and should return one or two output dataframes. 
  • Create Python Model: you can use this module to provide Python code that generates a fully trained model, in case you need a specific training setup that is not supported by the graphical designer.
  • Execute R Script: this module will run a custom R function that processes up to two input dataframes and returns one or two output dataframes.

In the upcoming assignment, we're going to build a pipeline to transform the California Housing data and get it ready for machine learning training.