Aug 22, 2023

Predict Heart Disease With F# And MLNET

In this article, I am going to build an F# app with MLNET and NET Core that reads medical data and predicts if a patient has a risk of heart disease. I will show you how to do this with only 120 lines of code.

MLNET is Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.

NET Core is the Microsoft multi-platform NET Framework that runs on Windows, OS/X, and Linux. It’s the future of cross-platform NET development.

And F# is a perfect language for machine learning. It’s a 100% pure functional programming language based on OCaml and inspired by Python, Haskell, Scala, and Erlang. It has a powerful syntax and lots of built-in classes and functions for processing data.

The first thing I need for my app is a data file with patients, their medical info, and their heart disease risk assessment. I will use the famous UCI Heart Disease Dataset which has real-life data from 303 patients.

The training data file looks like this:

It’s a CSV file with 14 columns of information:

Age
Sex: 1 = male, 0 = female
Chest Pain Type: 1 = typical angina, 2 = atypical angina , 3 = non-anginal pain, 4 = asymptomatic
Resting blood pressure in mm Hg on admission to the hospital
Serum cholesterol in mg/dl
Fasting blood sugar > 120 mg/dl: 1 = true; 0 = false
Resting EKG results: 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
Maximum heart rate achieved
Exercise induced angina: 1 = yes; 0 = no
ST depression induced by exercise relative to rest
Slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
Number of major vessels (0–3) colored by fluoroscopy
Thallium heart scan results: 3 = normal, 6 = fixed defect, 7 = reversible defect
Diagnosis of heart disease: 0 = normal risk, 1–4 = elevated risk

The first 13 columns are patient diagnostic information, and the last column is the diagnosis: 0 means a healthy patient, and 1 means an elevated risk of heart disease.

I will build a binary classification machine learning model that reads in all 13 columns of patient information, and then makes a prediction for the heart disease risk.

Let’s get started. Here’s how to set up a new console project in NET Core:

$ dotnet new console --language F# --output Heart
$ cd Heart

Next, I need to install the MLNET NuGet packages:

$ dotnet add package Microsoft.ML
$ dotnet add package Microsoft.ML.FastTree

Now I’m are ready to add types. I’ll need one to hold patient info, and one to hold my model predictions.

I will replace the contents of the Program.fs file with this:

open System
open System.IO
open Microsoft.ML
open Microsoft.ML.Data

/// The HeartData record holds one single heart data record.
[<CLIMutable>]
type HeartData = {
    [<LoadColumn(0)>] Age : float32
    [<LoadColumn(1)>] Sex : float32
    [<LoadColumn(2)>] Cp : float32
    [<LoadColumn(3)>] TrestBps : float32
    [<LoadColumn(4)>] Chol : float32
    [<LoadColumn(5)>] Fbs : float32
    [<LoadColumn(6)>] RestEcg : float32
    [<LoadColumn(7)>] Thalac : float32
    [<LoadColumn(8)>] Exang : float32
    [<LoadColumn(9)>] OldPeak : float32
    [<LoadColumn(10)>] Slope : float32
    [<LoadColumn(11)>] Ca : float32
    [<LoadColumn(12)>] Thal : float32
    [<LoadColumn(13)>] Diagnosis : float32
}

/// The HeartPrediction class contains a single heart data prediction.
[<CLIMutable>]
type HeartPrediction = {
    [<ColumnName("PredictedLabel")>] Prediction : bool
    Probability : float32
    Score : float32
}

// the rest of the code goes here....

The HeartData class holds one single patient record. Note how each field is tagged with a LoadColumn attribute that tells the CSV data loading code which column to import data from.

There’s also a HeartPrediction class which will hold a single heart disease prediction. There’s a Boolean Prediction, a Probability value, and the Score the model will assign to the prediction.

Note the CLIMutable attribute that tells F# that we want a ‘C#-style’ class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.

Now look at the final Diagnosis column in the data file. The label is an integer value between 0–4, with 0 meaning ‘no risk’ and 1–4 meaning ‘elevated risk’.

But I’m building a Binary Classifier which means my model needs to be trained on Boolean labels.

So I have to somehow convert the ‘raw’ numeric label (stored in the Diagnosis field) to a Boolean value.

To set that up, I’ll need a helper type:

/// The ToLabel class is a helper class for a column transformation.
[<CLIMutable>]
type ToLabel = {
    mutable Label : bool
}

// the rest of the code goes here....

The ToLabel type contains the label converted to a Boolean value. I will set up that conversion in a minute.

Also note the mutable keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The mutable keyword tells the compiler to create a mutable type instead and allow property assignments after construction.

Now I’m going to load the training data in memory:

/// file paths to data files (assumes os = windows!)
let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory

/// The main application entry point.
[<EntryPoint>]
let main argv =

    // set up a machine learning context
    let context = new MLContext()

    // load training and test data
    let data = context.Data.LoadFromTextFile<HeartData>(dataPath, hasHeader = false, separatorChar = ',')

    // split the data into a training and test partition
    let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)

    // the rest of the code goes here....

    0 // return value

This code uses the method LoadFromTextFile to load the CSV data directly into memory. The field annotations I set up earlier tell the function how to store the loaded data in the HeartData class.

The TrainTestSplit function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data.

Now I’m ready to start building the machine learning model:

// set up a training pipeline
let pipeline = 
    EstimatorChain()

        // step 1: convert the label value to a boolean
        .Append(
            context.Transforms.CustomMapping(
                Action<HeartData, ToLabel>(fun input output -> output.Label <- input.Diagnosis > 0.0f),
                "LabelMapping"))

        // step 2: concatenate all feature columns
        .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal"))

        // step 3: set up a fast tree learner
        .Append(context.BinaryClassification.Trainers.FastTree())

// train the model
let model = partitions.TrainSet |> pipeline.Fit

// the rest of the code goes here....

Machine learning models in MLNET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

This pipeline has the following components:

A CustomMapping that transforms the numeric label to a Boolean value. I define a 0 value as healthy, and anything above 0 as an elevated risk.
Concatenate which combines all input data columns into a single column called ‘Features’. This is a required step because ML.NET can only train on a single input column.
A FastTree classification learner which will train the model to make accurate predictions.

The FastTreeBinaryClassificationTrainer is a very nice training algorithm that uses gradient boosting, a machine learning technique for classification problems.

With the pipeline fully assembled, I can train the model by piping the TrainSet into the Fit function.

I now have a fully- trained model. So now it’s time to take the test partition, predict the diagnosis for each patient, and calculate the accuracy metrics of the model:

// make predictions and compare with the ground truth
let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate

// report the results
printfn "Model metrics:"
printfn "  Accuracy:          %f" metrics.Accuracy
printfn "  Auc:               %f" metrics.AreaUnderRocCurve
printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
printfn "  F1Score:           %f" metrics.F1Score
printfn "  LogLoss:           %f" metrics.LogLoss
printfn "  LogLossReduction:  %f" metrics.LogLossReduction
printfn "  PositivePrecision: %f" metrics.PositivePrecision
printfn "  PositiveRecall:    %f" metrics.PositiveRecall
printfn "  NegativePrecision: %f" metrics.NegativePrecision
printfn "  NegativeRecall:    %f" metrics.NegativeRecall

// the rest of the code goes here....

This code pipes the TestSet into model.Transform to set up a prediction for every patient in the set, and then pipes the predictions into Evaluate to compare these predictions to the ground truth and automatically calculate all evaluation metrics:

Accuracy: this is the number of correct predictions divided by the total number of predictions.
AreaUnderRocCurve: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
AreaUnderPrecisionRecallCurve: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
F1Score: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
LogLoss: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
LogLossReduction: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
PositivePrecision: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
PositiveRecall: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
NegativePrecision: this is the fraction of negative predictions that are correct.
NegativeRecall: this is the fraction of negative predictions out of all negative cases.

When monitoring heart disease, I definitely want to avoid false negatives because I don’t want to be sending high-risk patients home and telling them everything is okay.

I also want to avoid false positives, but they are a lot better than a false negative because later tests would probably discover that the patient is healthy after all.

To wrap up, I’m going to create a new patient record and ask the model to make a prediction:

// set up a prediction engine
let predictionEngine = context.Model.CreatePredictionEngine model

// create a sample patient
let sample = { 
    Age = 36.0f
    Sex = 1.0f
    Cp = 4.0f
    TrestBps = 145.0f
    Chol = 210.0f
    Fbs = 0.0f
    RestEcg = 2.0f
    Thalac = 148.0f
    Exang = 1.0f
    OldPeak = 1.9f
    Slope = 2.0f
    Ca = 1.0f
    Thal = 7.0f
    Diagnosis = 0.0f // unused
}

// make the prediction
let prediction = sample |> predictionEngine.Predict

// report the results
printfn "\r"
printfn "Single prediction:"
printfn "  Prediction:  %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk")
printfn "  Probability: %f" prediction.Probability

This code uses the CreatePredictionEngine method to set up a prediction engine, and then creates a new patient record for a 36-year old male with asymptomatic chest pain and a bunch of other medical info.

The code then pipes the patient record into the Predict function and displays the diagnosis.

What’s the model going to predict?

Time to find out. I’ll run my code like this:

$ dotnet run

This is what my code looks like running in the terminal:

These results nicely illustrate how to evaluate a binary classifier. I get an accuracy of 0.8, which means that my model is correct 80% of the time.

My precision is 0.86 which means that 86% of all elevated risk predictions made by the model are correct. In the remaining 14% of cases, the model predicted elevated risk in patients that were actually healthy.

The recall is 0.72 which means that out of all positive cases, my model only predicted 72% correct. The remaining 28% are high-risk heart patients who were told that everything is fine and they can go home.

That’s obviously very bad, and it clearly shows how important the recall metric is in cases where we want to avoid false negatives at all costs.

I’m getting an AUC of 0.82 which is a good start. It means this model has good predictive ability.

Finally, my model is 99% confident that my 36-year old male patient with asymptomatic chest pain has a high-risk for heart disease.

Looks like we caught that one in time!

The dataset has 164 normal predictions and 139 elevated risk predictions. That’s a nicely balanced set, and it means I can safely use the AUC, Precision, and Recall metrics. There’s no need to use the alternate AUCPRC and F1Score for imbalanced datasets instead.

Machine Learning With F# and MLNET

This benchmark is part of my online training course Machine Learning with F# and MLNET that teaches developers how to build machine learning applications in F# with Microsoft's MLNET library.

View The Training Course

I made this training course after I had already completed a similar machine learning course on C# with ML.NET, and I was looking for an excuse to start learning the F# language.

After I started porting over my C# code examples to F#, I noticed that the new F# code was often a lot shorter and easier to read than the corresponding C# code. In my opinion, that makes F# the ideal language for building machine learning apps.

Anyway, check out the training if you like. It will get you up to speed on the MLNET library and you'll learn the basics of regression, classification, clustering, gradient descent, logistic regression, decision trees, and much more.