• Dec 20, 2023

Predict New York Taxi Fares With Python and ML.NET

    There are many popular machine learning libraries for Python. There’s TensorFlow, scikit-learn, Theano, Caffe, and many others.

    • Column 0: The data provider vendor ID

    • Column 3: Number of passengers

    • Column 4: Trip distance

    • Column 5: The rate code (standard, JFK, Newark, …)

    • Column 9: Payment type (credit card, cash, …)

    • Column 10: Fare amount

    $ mkdir TaxiFarePrediction
    $ cd TaxiFarePrediction

    And install the NimbusML package:

    $ pip install nimbusml

    And now I’ll launch the Visual Studio Code editor to start building the app:

    $ code Program.py

    I will need a couple of import statements:

    import pandas as pd
    import numpy as np
    
    from sklearn.model_selection import train_test_split
    from nimbusml import Pipeline, Role
    from nimbusml.preprocessing.schema import TypeConverter
    from nimbusml.preprocessing.schema import ColumnConcatenator
    from nimbusml.feature_extraction.categorical import OneHotVectorizer
    from nimbusml.ensemble import FastTreesRegressor

    I use Pandas DataFrames to import data from CSV files and process it for training. I also need Numpy because Pandas depends on it.

    # load the file
    dataFrame = pd.read_csv("yellow_tripdata_2018-12.csv", 
                            sep=',', 
                            header=0)
    
    # create train and test partitions
    trainData, testData = train_test_split(dataFrame, test_size=0.2, random_state=42, shuffle=True)
    
    # the rest of the code goes here...

    This code calls read_csv from the Pandas package to load the CSV data into a new DataFrame. Note the header=0 argument that tells the function to pull the column headers from the first line.

    # build a machine learning pipeline
    pipeline = Pipeline([
        TypeConverter(columns = ["passenger_count", "trip_distance"], result_type = "R4"),
        OneHotVectorizer() << ["VendorID", "RatecodeID", "payment_type"],
        ColumnConcatenator() << {"Feature":["VendorID", "RatecodeID", "payment_type", "passenger_count", "trip_distance"]},
        FastTreesRegressor() << {Role.Label:"total_amount", Role.Feature:"Feature"}
    ])
    
    # train the model
    pipeline.fit(trainData)
    
    # the rest of the code goes here...

    Machine learning models in ML.NET are built with Pipelines which are sequences of data-loading, transformation, and learning components.

    • A TypeConverter that converts the passenger_count and trip_distance columns to R4 which means a 32-bit floating point number or a single. I need this conversion because Pandas will load floating point data as R8 (64-bit floating point numbers or doubles), and ML.NET cannot deal with that datatype.

    • An OneHotVectorizer that performs one-hot encoding on the three columns that contains enumerative data: VendorID, RatecodeID, and payment_type. This is a required step because I don’t want the machine learning model to treat these columns as numeric values.

    • A ColumnConcatenator which combines all input data columns into a single column called Feature. This is a required step because ML.NET can only train on a single input column.

    • A final FastTreeRegressor learner which will analyze the Feature column to try and predict the total_amount.

    • 1 = standard

    • 2 = JFK

    • 3 = Newark

    • 4 = Nassau

    • 5 = negotiated

    • 6 = group

    • 1 = Credit card

    • 2 = Cash

    • 3 = No charge

    • 4 = Dispute

    • 5 = Unknown

    • 6 = Voided trip

    # evaluate the model and report metrics
    metrics, _ = pipeline.test(testData)
    print("\nEvaluation metrics:")
    print("  RMSE: ", metrics["RMS(avg)"][0])
    print("  MSE: ", metrics["L2(avg)"][0])
    print("  MAE: ", metrics["L1(avg)"][0])
    
    # the rest of the code goes here...

    This code calls the test pipeline function and provides the testData partition to generate predictions for every single taxi trip in the test partition and compare them to the actual taxi fares.

    • RMS: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate regression models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.

    • L1: this is the mean absolute prediction error or MAE value, expressed in dollars.

    • L2: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.

    # set up a trip sample
    tripSample = pd.DataFrame(  [[1, 1, 1, 1.0, 3.75]],
                                columns = ["VendorID", "RatecodeID", "payment_type", "passenger_count", "trip_distance"])
    
    # predict fare for trip sample
    prediction = pipeline.predict(tripSample)
    print("\nSingle trip prediction:")
    print("  Fare:", prediction["Score"][0])

    This code sets up a new DataFrame with the details of my taxi trip. Note that I have to provide the data and the column names separately.

    $ python ./Program.py

    Here’s what that looks like in Windows Terminal:


    Machine Learning With Python and ML.NET

    This code example is part of my online training course Machine Learning with Python and ML.NET that teaches developers how to build machine learning applications in Python with Microsoft's ML.NET library.

    I made this training course after I had already completed a similar machine learning course in C#, and I started wondering if it would be possible to use the ML.NET library in Python apps.

    After a bit of research, I discovered the NimbusML library and I started porting my C# code over to Python. The whole process went quite smoothly and I decided to share what I had discovered in a new training course.

    Anyway, check it out if you like. The course will get you up to speed on NimbusML and ML.NET and you'll learn the basics of regression, classification, clustering, gradient descent, logistic regression, decision trees, and much more.


    0 comments

    Sign upor login to leave a comment

    Featured Training Courses

    Would you like to learn more? Then please take a look at my featured training courses.
    I'm sure I have something that you'll like.

    • Starting at €35/mo or €350/yr

    All Course Membership

    • Community
    • 16 products

    Become a member and get access to every online training course on this site.

    Would You Like To Know More?

    Sign up for the newsletter and get notified when I publish new posts and articles online.
    Let's stay in touch!

    You're signing up to receive emails from MDFT Academy