Amazon SageMaker AI by Example
An introduction using a simple machine learning problem.
This article walks one through the first, Linear regression, exercise of the excellent Machine Learning Crash Course tutorial; but rather implementing it in Amazon SageMaker AI.
While it is assumed that the reader has already completed the Linear regression module, we will be going back through it to illustrate the concepts using Amazon SageMaker AI.
This article is organized around the typical machine learning (ML) workflow as described in the document Overview of machine learning with Amazon SageMaker AI.
Prerequisites
If you wish to follow along, you will need to have completed the instructions provided in the Guide to getting set up with Amazon SageMaker AI.
Fetch
From Overview of machine learning with Amazon SageMaker AI.
To train a model, you need example data. The type of data that you need depends on the business problem that you want the model to solve. This relates to the inferences that you want the model to generate. For example, if you want to create a model that predicts a number from an input image of a handwritten digit. To train this model, you need example images of handwritten numbers.
The exercise uses the following data (download).
The dataset for this exercise is derived from the City of Chicago Taxi Trips dataset. The data for this exercise is a subset of the Taxi Trips data, and focuses on a two-day period in May of 2022.
Because we want to keep a copy of the original data file, we create a Amazon S3 bucket and upload the downloaded data file; chicago_taxi_train.csv. Here my bucket is named amazon-sagemaker-ai-by-example.
note: Remember bucket names are globally unique.
Clean
From Overview of machine learning with Amazon SageMaker AI.
To improve model training, inspect the data and clean it, as needed. For example, if your data has a country name attribute with values United States and US, you can edit the data to be consistent.
Here we use SageMaker Canvas as described in Recommendations for choosing the right data preparation tool in SageMaker AI.
SageMaker Canvas is a visual low-code environment for building, training, and deploying machine learning models in SageMaker AI. Its integrated Data Wrangler tool allows users to combine, transform, and clean datasets through point-and-click interactions.
We choose the Canvas application from the Amazon SageMaker AI service screen in the AWS Management Console; we open it using the appropriate user profile.
note: As a cost savings, it is important to log out of the Canvas application (not just close the window) when we are done using it for the day.
Using the Data Wranger tool, we select Import and prepare > Tabular. We select the Amazon S3 data source and browse to our bucket and then the data file we uploaded.
We complete the import to create a new data flow.
As in the exercise, we start by previewing the data by selecting the Data tab in our newly created data flow.
note: Was not happy with the non-descriptive name of the data flow; renamed it from the Data flows tab to the name of the exercise (Linear regression).
The exercise has us next generate statistics of the data, e.g., mean, min, max, etc. We can do the same by selecting the Analyses tab and creating a Table Summary analysis (named it based on the date time for no particular reason other than it is guaranteed to be unique).
It turns out that in this exercise, there is no need to clean the data.
Prepare
From Overview of machine learning with Amazon SageMaker AI.
To improve performance, you might perform additional data transformations. For example, you might choose to combine attributes for a model that predicts the conditions that require de-icing an aircraft. Instead of using temperature and humidity attributes separately, you can combine those attributes into a new attribute to get a better model.
Next we remind ourselves what we are looking to predict (aka, the label).
In this Colab you will use a real dataset to train a model to predict the fare of a taxi ride in Chicago, Illinois.
The next step in the exercise.
An important part of machine learning is determining which features correlate with the label. If you have ever taken a taxi ride before, your experience is probably telling you that the fare is typically associated with the distance traveled and the duration of the trip. But, is there a way for you to learn more about how well these features correlate to the fare (label)?
In this step, you will use a correlation matrix to identify features whose values correlate well with the label.
From the Analyses tab, we press the plus (+) button and create a Feature Correlation analysis; as we are working toward creating a linear regression model we select a Correlation type of linear (and name it using our date time pattern).
The analysis indicates that TRIP_MILES feature is most correlated with the FARE label.
Scrolling down, we see our correlation matrix which we can see other correlated features, e.g., TRIP_SECONDS.
Looking ahead in the exercise, we will want to transform the TRIP_SECONDS column to TRIP_MINUTES (mostly because minutes make more sense when thinking about taxi rides). Also, because we will only be using three columns (FARE, TRIP_MILES, and the new TRIP_MINUTES) we will go ahead and drop the rest of the columns.
We navigate to the the Data tab in our data flow and in the right menu (you may have to open), select the Add transform > Custom formula. Complete as shown and press the Add button.
We then select Add transform > Manage columns (drop all the columns except FARE, TRIP_MILES, and TRIP_MINUTES).
At this point, our data flow looks something like this.
From the exercise.
Sometimes it is helpful to visualize relationships between features in a dataset; one way to do this is with a pair plot. A pair plot generates a grid of pairwise plots to visualize the relationship of each feature with all other features all in one place.
Here we will produce two pair plots; both with FARE (label) as the y (vertical) value and one of TRIP_MILES and TRIP_MINUTES (features) as the x (horizontal) value.
From the Data flow tab in our data flow, we press the plus (+) button in the last (Drop column) step in our transformation pipeline and select Get data insights. We create the first pair plot using the values shown.
and then
Train
From Overview of machine learning with Amazon SageMaker AI.
To train a model, you need an algorithm or a pre-trained base model. The algorithm you choose depends on a number of factors. For a built-in solution, you can use one of the algorithms that SageMaker provides.
From the exercise.
In this step you will train a model to predict the cost of the fare using a single feature. Earlier, you saw that TRIP_MILES (distance) correlates most strongly with the FARE, so let’s start with TRIP_MILES as the feature for your first training run.
In this case, linear regression with a single feature, the training’s output is two parameters (bias and weight) used in the following formula to predict the FARE (y’) from the TRIP_MILES (x1).
Looking the document Built-in algorithms and pretrained models in Amazon SageMaker, we can quickly determine that we want to use the Linear Learner Algorithm.
Linear models are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass classification problems, the labels must be from 0 to num_classes — 1. For regression problems, y is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector x to an approximation of the label y.
From the Data flow tab in our data flow, we press the plus (+) button in the last (Drop column) step in our transformation pipeline and select Create model.
Under the hood, two things are created; a dataset and a model.
note: In hindsight, I should have used more descriptive names for these.
Navigating to the My Models tool and opening up the newly created model, we start by dropping the TRIP_MINUTES column as our first model will only use the single TRIP_MILES feature.
In the Model type panel, we select Configure model and observe that we are using a Numeric model type.
Select the Objective metric tab and observe that we are optimizing on MSE (Mean Squared Error); comparing this to the code in the exercise, we can see that we are aligned here.
Select the Training method and algorithms. We will leave the default setting of Auto here.
Canvas selects the algorithms that are most relevant to your dataset and the best range of hyperparameters to tune model candidates. The best-performing model candidate is chosen.
That being said, it is interesting to observe there are number of algorithms of the numeric model type; each optimized for a particular situation.
It is interesting here to observe that by using Canvas, we do not have to optimize the hyperparameters, e.g, learning rate, epochs, or batch sizes, that we had to do in the exercise; it is all automatically done for us.
Back on the model screen, we select Quick build to train the model; there is an option for Standard build that can take up to four hours.
Evaluate
From Overview of machine learning with Amazon SageMaker AI.
After you train your model, you evaluate it to determine whether the accuracy of the inferences is acceptable.
From the exercise.
During training, you should see the root mean square error (RMSE) in the output. The units for RMSE are the same as the units for the label (dollars). In other words, you can use the RMSE to determine how far off, on average, the predicted fares are in dollars from the observed values.
From the Analyze tab on the model screen, we can see the RMSE.
From the Scoring tab of the Analyze tab, we can can see a visualization of the predicted vs. actual values.
We can also use the model to make predictions by selecting the Predict tab.
note: One non-obvious aspect of this is that even though this particular model does not depend on TRIP_MINUTES, the underlying dataset has this column (so you can edit it; but to no effect).
Train / Evaluate with Two Features
From the exercise.
The model you trained with the feature TOTAL_MILES demonstrates fairly strong predictive power, but is it possible to do better? In this step, try training the model with two features, TRIP_MILES and TRIP_MINUTES, to see if you can improve the model.
Using the My Models tool, we press the New model button; naming appropriately, e.g., linear_regression_two_features, and leaving the default Predictive Analysis option set. We select the dataset that we created earlier. We set the target column to be FARE. Finally, we select the we select Quick build to train the model.
As before we can evaluate the model; at the same time Canvas automates the selection of models and hyperparameters to tune model candidates (so there is not much for us to analyze here).
Deploy
From Overview of machine learning with Amazon SageMaker AI.
You traditionally re-engineer a model before you integrate it with your application and deploy it. With SageMaker AI hosting services, you can deploy your model independently, which decouples it from your application code.
From the first model we generated (predicts FARE from TRIP_MILES feature), we select the Deploy tab. Press the Create Deployment button; name the deployment appropriately, e.g, linear-regression-one-feature. We also select the smallest instance type, ml.t2.medium, and only a single instance count. Press the Deploy button.
note: Coming from an infrastructure background, the deployment options of instance type and an instance count feels a bit crude, e.g., no autoscaling options, etc.
note: Was also a bit surprised that it took over 20 minutes to deploy; would have thought this would have taken seconds (the model is already trained).
Monitor
From Overview of machine learning with Amazon SageMaker AI.
Machine learning is a continuous cycle. After deploying a model, you monitor the inferences, collect more high-quality data, and evaluate the model to identify drift. You then increase the accuracy of your inferences by updating your training data to include the newly collected high-quality data. As more example data becomes available, you continue retraining your model to increase accuracy.
The from the deployment details we can monitor some aspects of the deployment, e.g., Average predictions per day.
note: Coming from an infrastructure background, this feels like an insufficient amount of monitoring.
We can see among other things a deployment url and sample code.
note: Did spend a few minutes validating that the sample code works (it did); one thing that messed me up was that the Endpointname is not the URL but rather just the name of the deployment.
Much easier, however, is that we can test the deployment from the Test deployment tab.
Whew. All done.