Mar 14, 2023

Use F# and MLNET to predict New York taxi fares

Building machine learning apps has never been easier!

And that's because we have MLNET, Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.

But did you know that the F# language is the perfect choice for developing machine learning applications with MLNET?

The F# language is just perfect for machine learning. It’s a 100% pure functional programming language based on OCaml and inspired by Python, Haskell, Scala, and Erlang. It has a powerful syntax and lots of built-in classes and functions for processing data.

Check out the following F# code fragment that trains a machine learning model to predict taxi fares in New York city, and then uses the fully-trained model to predict a single trip for a passenger paying with a credit card.

Look how compact the syntax is. F# machine learning code is elegant, concise, and beautiful:

A nice feature of F# is that it supports Duck Typing. In many cases you can leave out class names or generic type names and the compiler will just figure them out on its own. You can see an example of that in the screenshot where I initialize taxiTripSample without having to specify a class name.

And check out the very cool pipe operator |> that allows me to create a chain of functions that operate on a data stream. In the screenshot, I use this feature to initialize metrics by piping a test dataset into the fully trained machine learning model, collect predictions, and then pipe those predictions into the Evaluate function to compute evaluation metrics. All in a single line of code!

With tricks like this, F# code is often very compact without sacrificing readability. On average my F# code is about 30% more compact than comparable C# code.

But let’s take a look at that taxi fare prediction case in more detail.

Did you know that the NYC Taxi & Limousine Commission keeps meticulous records of all taxi trips in the New York city area?

I’m going to grab their data file for December 2018. This is a CSV file with 8,173,233 records that looks like this:

There are a lot of columns with interesting information in this data file, but I will only be focusing on the following:

Column 0: The data provider vendor ID
Column 3: Number of passengers
Column 4: Trip distance
Column 5: The rate code (standard, JFK, Newark, …)
Column 9: Payment type (credit card, cash, …)
Column 10: Fare amount

I’ll build a machine learning model in F# that will use columns 0, 3, 4, 5, and 9 as input, and use them to predict the taxi fare for every trip. Then I’ll compare the predicted fares with the actual taxi fares in column 10 and evaluate the accuracy of my model.

And I will use NET Core to build my app.

Here’s how to set up a new F# console project in NET Core:

$ dotnet new console --language F# --output PricePrediction
$ cd PricePrediction

Now I need to install the following packages:

$ dotnet add package Microsoft.ML
$ dotnet add package Microsoft.ML.FastTree

This will install the MLNET NuGet package and support for fast decision tree learning algorithms. I’ll be using a decision tree to make my taxi fare predictions.

Now you are ready to add some classes. You’ll need one to hold a taxi trip, and one to hold your model predictions.

I will modify Program.fs like this:

/// The TaxiTrip class represents a single taxi trip.
[<CLIMutable>]
type TaxiTrip = {
    [<LoadColumn(0)>] VendorId : string
    [<LoadColumn(5)>] RateCode : string
    [<LoadColumn(3)>] PassengerCount : float32
    [<LoadColumn(4)>] TripDistance : float32
    [<LoadColumn(9)>] PaymentType : string
    [<LoadColumn(10)>] [<ColumnName("Label")>] FareAmount : float32
}

/// The TaxiTripFarePrediction class represents a single far prediction.
[<CLIMutable>]
type TaxiTripFarePrediction = {
    [<ColumnName("Score")>] FareAmount : float32
}

// the rest of the code goes here...

The TaxiTrip type holds one single taxi trip. Note how each field is tagged with a LoadColumn attribute that tells the CSV data loading code which column to import data from.

I’m also declaring a TaxiTripFarePrediction type which will hold a single fare prediction.

Note the CLIMutable attribute that tells F# that I want a ‘C#-style’ class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The MLNET library cannot handle immutable classes.

Also note the mutable keyword in the definition for TaxiTripFarePrediction. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The mutable keyword tells the compiler to create a mutable type instead and allow property assignments after construction.

I am loading all data columns as float32, except VendorId, RateCode and PaymentType. These columns hold numeric values but I will load them as string fields.

The reason I need to do this is because RateCode is an enumeration with the following values:

1 = standard
2 = JFK
3 = Newark
4 = Nassau
5 = negotiated
6 = group

And PaymentType is defined as follows:

1 = Credit card
2 = Cash
3 = No charge
4 = Dispute
5 = Unknown
6 = Voided trip

These actual numbers don’t mean anything in this context. And I certainly don’t want the machine learning model to start believing that a trip to Newark is three times as important as a standard fare.

So converting these values to strings is a perfect trick to show the model that VendorId, RateCode and PaymentType are just labels, and the underlying numbers don’t mean anything.

Now I need to load the training data in memory:

// file paths to data files (assumes os = windows!)
let dataPath = sprintf "%s\\yellow_tripdata_2018-12_small.csv" Environment.CurrentDirectory

/// The main application entry point.
[<EntryPoint>]
let main argv =

    // create the machine learning context
    let context = new MLContext()

    // load the data
    let dataView = context.Data.LoadFromTextFile<TaxiTrip>(dataPath, hasHeader = true, separatorChar = ',')

    // split into a training and test partition
    let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2)

    // the rest of the code goes here...

    0 // return value

This code calls LoadFromTextFile to load the CSV data into memory. Note the TaxiTrip type that tells the function which type to use to load the data.

There is only one single data file, so I’m calling TrainTestSplit to set up a training partition with 80% of the data and a test partition with the remaining 20% of the data.

You often see this 80/20 split in data science, it’s a very common approach to train and test a model.

Now I’m ready to start building the machine learning model:

// set up a learning pipeline
let pipeline = 
    EstimatorChain()

        // one-hot encode all text features
        .Append(context.Transforms.Categorical.OneHotEncoding("VendorId"))
        .Append(context.Transforms.Categorical.OneHotEncoding("RateCode"))
        .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType"))

        // combine all input features into a single column 
        .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance"))

        // cache the data to speed up training
        .AppendCacheCheckpoint(context)

        // use the fast tree learner 
        .Append(context.Regression.Trainers.FastTree())

// train the model
let model = partitions.TrainSet |> pipeline.Fit

// the rest of the code goes here...

Machine learning models in MLNET are built with pipelines which are sequences of data-loading, transformation, and learning components.

This pipeline has the following components:

A group of three OneHotEncodings to perform one hot encoding on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because I don’t want the machine learning model to treat the enumerative data as numeric values.
Concatenate which combines all input data columns into a single column called Features. This is a required step because MLNET can only train on a single input column.
AppendCacheCheckpoint which caches all data in memory to speed up the training process.
A final FastTree regression learner which will train the model to make accurate predictions.

The FastTreeRegressionTrainer is a very nice training algorithm that uses gradient boosting, a machine learning technique for regression problems.

A gradient boosting algorithm builds up a collection of weak regression models. It starts out with a weak model that tries to predict the taxi fare. Then it adds a second model that attempts to correct the error in the first model. And then it adds a third model, and so on.

The result is a fairly strong prediction model that is actually just an ensemble of weaker prediction models stacked on top of each other.

With the pipeline fully assembled, I train the model on the training partition by piping the TrainSet into the pipeline.Fit function.

I now have a fully- trained model. So next, I’m going to grab the validation data, predict the taxi fare for each trip, and calculate the accuracy of my model:

// get regression metrics to score the model
let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate

// show the metrics
printfn "Model metrics:"
printfn "  RMSE:%f" metrics.RootMeanSquaredError
printfn "  MSE: %f" metrics.MeanSquaredError
printfn "  MAE: %f" metrics.MeanAbsoluteError

// the rest of the code goes here...

This code pipes the TestSet into the model.Transform function to generate predictions for every single taxi trip in the test partition. I then pipe these predictions into the Evaluate function to compare then to the actual taxi fares and automatically calculates these metrics:

RootMeanSquaredError: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
MeanAbsoluteError: this is the mean absolute prediction error or MAE value, expressed in dollars.
MeanSquaredError: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.

To wrap up, let’s use the model to make a prediction.

Imagine that I’m going to take a standard taxi trip, I cover a distance of 3.75 miles, I am the only passenger, and I pay by credit card. What would my fare be?

Here’s how to make that prediction:

// create a prediction engine for one single prediction
let engine = context.Model.CreatePredictionEngine model

let taxiTripSample = {
    VendorId = "VTS"
    RateCode = "1"
    PassengerCount = 1.0f
    TripDistance = 3.75f
    PaymentType = "CRD"
    FareAmount = 0.0f // To predict. Actual/Observed = 15.5
}

// make the prediction
let prediction = taxiTripSample |> engine.Predict

// show the prediction
printfn "\r"
printfn "Single prediction:"
printfn "  Predicted fare: %f" prediction.FareAmount

I’m using the CreatePredictionEngine method to set up a prediction engine. This is a type that can make predictions for individual data records.

Next, I set up a sample with all the details of my taxi trip and pipe it into the Predict function to make a single prediction.

The trip should cost anywhere between $13.50 and $18.50, depending on the trip duration (which depends on the time of day). Will the model predict a fare in this range?

Let’s find out. I’m going to run the code like this:

$ dotnet run

And this is what I see:

I get an RMSE value of 58.95 and a Mean Absolute Error (MAE) value of 2.12. This means that my predictions are off by only 2 dollars and 12 cents on average.

How about that!

And according to the model, my 19-minute trip will cost me $16.54. This prediction is nicely in the range of $13.50 — $18.50 for real-life trip fares.

Machine Learning With F# and MLNET

This benchmark is part of my online training course Machine Learning with F# and MLNET that teaches developers how to build machine learning applications in F# with Microsoft's MLNET library.

View The Training Course

I made this training course after I had already completed a similar machine learning course on C# with MLNET, and I was looking for an excuse to start learning the F# language.

After I started porting over my C# code examples to F#, I noticed that the new F# code was often a lot shorter and easier to read than the corresponding C# code. In my opinion, that makes F# the ideal language for building machine learning apps.

Anyway, check out the training if you like. It will get you up to speed on the MLNET library and you'll learn the basics of regression, classification, clustering, gradient descent, logistic regression, decision trees, and much more.