Detecting Breast Cancer — here’s how I did it.

💥 SHOUTOUT: I used this tutorial from Computer Science to help make this model. It was so helpful and I definitely recommend checking the video out and showing some ❤️ to their channel!!

Currently, there is a 1 in 8 chance that an American woman will develop breast cancer. That’s about a 13% chance.

About 1 in 39 women die from breast cancer.

Where does detection play a role in this?

Inflammatory breast cancer can be difficult to diagnose. Often, there is no lump that can be felt during a physical exam or seen in a screening mammogram. In addition, most women diagnosed with inflammatory breast cancer have dense breast tissue, which makes cancer detection in a screening mammogram more difficult.

What also interested me about breast cancer, is the countless number of psychological conditions that arise from it, just like many other types of cancer. The diagnosis of breast cancer commonly triggers states of distress such as anxiety, depression, fatigue, pain, sexuality concerns, and self-blame.

Imagine, all of a sudden, you learn that you have stage 4 breast cancer. Imagine both the physical and mental impacts that one piece of news would have on your entire state.

The question is — by detecting breast cancer early on, would we be able to mitigate some of these latter-stage effects?

With AI, this is possible.

We can create a model that we can train using breast cancer data, in order for us to be able to detect breast cancer.

For the purpose of this article, I am going to walk you through how I built this model, so that you can build one and inspire you to create more projects that will impact people :)

Setting up your workspace

In order to get started, you will need to choose an IDE — the place where you will write your code.

IDE = Integrated development environment = software for building applications

Examples of IDEs include Pycharm, Anaconda, Jupyter, etc.

Personally, I used Google Colab notebooks. I highly recommend using this platform because:

  1. You don’t need to download any separate applications to your computer. If you have a Google account, you’re fine.
  2. Your projects will save automatically to your Google Drive. Everything will be easy to access and in one place.
  3. It’s great if you’re just starting out as a developer and only know the basics. Everything is really clear to use and it’s easy to follow.

Getting Data

As I mentioned before, we are going to train our model using previous breast cancer data. For the model, I retrieved my data set from Kaggle. This is how it looked:

Data sets 😍

This is what it should look once you download it. Name the file “data” and save it on your desktop or somewhere easy to find. Make sure that it saves as a “.csv” file.

Number palooza 😍

If you zoom in, you can see that there are various column headers representing various data (such as diagnosis, texture_mean, radius_mean, perimeter_mean, etc.). In the diagnosis column, the letters M and B represent whether the tumour is malignant or benign.

Malignant = cancerous tumours whose cells that grow uncontrollably and spread locally and/or to distant sites

Benign = tumour that is a mass of abnormal cells that can’t move into neighbouring tissue or spread to other parts of the body

I’ll show you how to upload this data into our project a bit later.

Importing libraries

Now we’re ready to start actually coding! Get your developer hats on 🎩

Tip: Use # to make notes within your code. It helps you understand and reminds you why you’re doing specific actions. This is incredibly helpful if someone else if looking at your code, or if you’re revisiting it after a while.

For this model, we’ll need to:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns

What are libraries in python?

A python library is a set of useful functions that eliminate the need for you to code from scratch. They simplify the programming process and remove the need to rewrite commonly used commands.

The libraries I used were numpy, pandas, matplotlib.pylot, and seaborn.

NumPy = Used for working with arrays. Has a large collection of high-level mathematical functions to operate on these arrays.

Pandas = Use for data manipulation and analysis. Offers data structures and operations for manipulating numerical tables and time series. (this will be helpful when we start analyzing our data)

Matplotlib.pylot = A multi-platform data visualization library built on NumPy arrays. Pylot is a module in matplotlib. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

Seaborn = A data visualization library based on matplotlib. Provides a high-level interface for drawing attractive and informative statistical graphics. (this will be helpful when we need to put our data into graphs)

Loading the Data

#load the datafrom google.colab import filesuploaded = files.upload()df = pd.read_csv('data.csv')df.head(7)

When you run this cell, you should immediately get a request to choose a file. Upload the data that we just downloaded from Kaggle.

In this section (and in the ones to come) you’ll realize that we use “df” a lot. “df” Pandas DataFrame (remember we imported the Pandas library as “pd” earlier) is a 2D mutable tabular data structure with labeled axes (rows and columns). This allows for the computer to go through our data that is in a tabular form.

“df.head” is a way in which the computer will spit back out the first 7 rows of our data set.

There are multiple different ways we will use “df” in this project.

Cleaning Data

Before we use the data, we need to clean it so that our model is effective.

We need to clear any empty values, zeroes, and count the number of both malignant and benign tumours.

We’re going to start off using the Pandas DataFrame to do some of these actions:

#Count the number or rows and columns in the data setdf.shape#Count the number of empty (NaN, NAN, na) values in each columndf.isna().sum()#Drop the column with all missing valuesdf = df.dropna(axis=1)#Get the new count of the number of rows and columnsdf.shape#Get a count of the number of Malignant (M) or Benign (B) Cellsdf['diagnosis'].value_counts()

In my data set, I had no missing values, therefore when I got the new count of the number of rows and values, I still had the original number of rows and columns (569,32).

For my

#Encode the categorical data valuesfrom sklearn.preprocessing import LabelEncoderlabelencoder_Y = LabelEncoder()df.iloc[:,1] = labelencoder_Y.fit_transform(df.iloc[:,1].values)

Then, we need to encode the categorical values. I also created a pair plot so that we could see the data more clearly.

#Create a pair plotsns.pairplot(df.iloc[:,1:5], hue='diagnosis')

This is what my data plots looked like:

More data 😍

We then need to get the correlations between the columns, and then put it into a visual representation.

I used Pandas DataFrame “.head” again to spit out the first 5 rows of our now clean data, and then used the property “iloc” to select values from the 12 rows and columns from the set of values we have in our data set.

To visualize the correlation, we are going to use the matplotlib library (plt). The .figure function is used to create a new figure — here we will be using the “figsize” parameter, which shows us the width and height in inches.

A heatmap in the seaborn library (sns) is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. It contains values representing various shades of the same colour for each value to be plotted.

The darker shades of the chart represent higher values than the lighter shade. The “annot” attribute here allows us for text to be written on the cell. If we set annot to True, text will be written all over the cell (which is what we want).

df.head(5)df.iloc[:,1:12].corr()plt.figure(figsize=(10,10))sns.heatmap(df.iloc[:,1:12].corr(), annot=True)

This is what you should get:

Building the actual model

Now that we’ve cleaned our data, we’re ready to start building the actual model.

We first need to split the data into both independent X and Y data sets, using Pandas DataFrame and “iloc” to select the specific values that we’ll need from our data set.

Our next step is to import train_test_split from sklearn.model_selection. train_test_split is a function in Sklearn model selection that splits data arrays into two subsets: one for training data and the other for testing data. Sklearn will make random partitions for the two subsets.

This makes our job easier in some ways. If we didn’t use this library and this function, we would need to need to divide the dataset manually — why make our job harder? Work smart, play harder kiddos 😎

I then created 4 variables (X_train, X_test, Y_train, Y_test) and set them equal to train_test_split.

X = df.iloc[:,2:31].valuesY = df.iloc[:,1].valuesfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

You’re probably wondering — why do we need to split the data into training and testing sets?

It’s so that we can verify how well our model perform on unseen data. By using similar data for both our training and testing, we can minimize the effects of data discrepancies.

You’re also probably wondering — what does test_size and random_state mean?

Test_size controls the size of the testing data set. Here our value is 0.25, which means that 25% of our data will be a part of our testing data set. The number you set for your test_size needs to be in-between 0.0 and 1.0 (because you can’t have more than 100% of your data going into your testing data set).

Next, we need to scale the data. Scaling the data allows us to transform our data and fit it into a particular scale, in our case, between 0–1 (which is why we import StandardScaler from Sklearn).

Explanation: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

from sklearn.preprocessing import StandardScalersc = StandardScaler()sc.fit_transform(X_train)sc.fit_transform(X_test)

The function fit_transform allows us to scale the training data and learn the scaling parameters of that data.

Creating a function for the models

To start off, we need to define our models X_train and Y_train.

def models(X_train, Y_train):#Logistic regressionfrom sklearn.linear_model import LogisticRegressionlog = LogisticRegression(random_state=0), Y_train)

Logistic regression is used to examine the association of independent variables (categorical or continuous) with one dependent variable. This is useful in this project because we need to detect whether the patient has breast cancer (the dependent variable) using multiple different categories of data like smoothness, area, etc. (the independent variables).

Logistic regression is a predictive analysis algorithm and is used to predict the probability of a target variable (in our case, whether the patient has breast cancer). It limits the function between 0 and 1 → which is our output is a series of 0s and 1s → the 0 means the patient does not have breast cancer and the 1 means they do.

The purpose of setting random_state=0 is so that we get the same results every time we execute this.

Now, let’s move onto the Decision tree 🌲

Decision trees are used for classification and regression. They predict the value of a target variable by learning simple decision rules inferred from the data features.

The goal of a Decision Tree is to create a training model that can predict the class/value of the target variable (whether the patient has breast cancer or not) by learning simple decision rules inferred from the training data.

In order to use a Decision tree for this model, we will need to import it from the Sklearn library.

#Decision Treefrom sklearn.tree import DecisionTreeClassifiertree = DecisionTreeClassifier(criterion = 'entropy', random_state=0), Y_train)

I then created a variable called “tree” and set it equal to the DecisionTreeClassifier that we inputted prior.

The next step is to set the parameters for our Decision Tree → “criterion” determines how the impurity of a split will be measured. The value we set to equal the criterion (entropy) acts as the metric for impurity.

Okay — but what does impurity mean?

Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

To train this function, we are going to use the “” command on our training data (X_train, Y_train).

Let’s move onto our Random Forest Classifier 🌳🌳🌳

#Random Forest Classifierfrom sklearn.ensemble import RandomForestClassifierforest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0), Y_train)

The process of making our Random Forest Classifier is pretty similar to the Logistic Regression and Decision Tree. The new parameter that we added to this model is the “n_estimators”.

This parameter specifies the number of trees in the forest of the model. Here the value we set is 10 — which means that 10 different decision trees will be constructed in the random forest.

Note-to-self: The higher the number of trees, the better the performance of the model (but it makes your code slower)

We obviously want to see how accurate these models we made are, so in order to do that — we simply just need to use the print command in order to print the models’ accuracy.

The accuracy is represented by the “score” function.

#Print the models accuracy on the training dataprint('[0]Logistic Regression Training Accuracy:', log.score (X_train, Y_train))print('[1]Decision Tree Classifier Training Accuracy:', tree.score (X_train, Y_train))print('[0]Random Forest Classifier Training Accuracy:', forest.score (X_train, Y_train))return log, tree, forest

In the last line, we returned all of our models.

What does returning mean?

A return statement is used to end the execution of the function call and “returns” the result (value of the expression following the return keyword) to the caller.

Our next step would be to visually see how well our model performs on our training data:

model = models(X_train, Y_train)

The results that I got showed me that the Decision Tree Classifier was the most accurate as it displayed 100% accuracy.

Testing our Model on the Testing Data

We are going to use a confusion matrix in order to do this.

What is a confusion matrix you may ask?

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Basically, the confusion matrix is going to allow us to compare the actual values in our testing data with the values our model would have predicted.

from sklearn.metrics import confusion_matrix

P.S: The name confusion matrix stems from the fact that it this matrix makes it easy to see whether the model is confusing two classes (ex. whether the model is labelling things with the wrong labels).

In order to use the confusion matrix to see how accurate our model is, we are going to create a variable called “cm” and input our testing data (Y_test) and our model’s prediction.

We’re going to put the model at position 0 (which is the Logistic Regression model), and put in its prediction for the testing data (X_test).

cm = confusion_matrix(Y_test, model[0].predict(X_test))print(cm)

This is going to give us what it thinks the actual values are for our features in the testing data.

When we print cm, it will show us the TP, TN, FP and FS (True positive, true negative, false positive and false negative).

What do these values mean?

True Positive = We predicted yes, and the patient did have the disease.

True Negative = We predicted no, and the patient didn’t have the disease.

False Positive = We predicted yes, but the patient didn’t have the disease.

False Negative = We predicted no, but the patient did have the disease.

These are the values that I got:

TP = 86

TN = 4

FN = 3

FP = 50

These values are going to help us calculate how accurate our model is on our testing data.

So now underneath the line of code where we defined what “cm” was going to be, we are going to create variables for each of these categories.

The 0 and 1s tell us at what position these variables are.

cm = confusion_matrix(Y_test, model[i].predict(X_test))TP = cm[0][0]TN = cm[1][1]FN = cm[1][0]FP = cm[0][1]print(cm)print('Testing Accuracy = ', (TP + TN)/ (TP + TN + FN + FP))print()

We are the going to print the “Testing Accuracy” which we can find with this formula:

(TP + TN)/ (TP + TN + FN + FP)

The value that I got for the testing accuracy was around 94%.

If you’re curious and want to know how you could find solely the FP rate (the rate at which our model is predicts the patient has the disease when they actually don’t), the TN rate or the FN rate, I found a thread that has the formulas you would use:

So now that we found out the accuracy of our Logistic Regression model, we should probably also find the accuracy for our Decision Tree and Random Forest Classifier 🌳

In order to do that, we’re going to make some changes to our previous code:

for i in range( len(model) ):   print('Model ', i)   cm = confusion_matrix(Y_test, model[i].predict(X_test))

We are going to create a loop by using “i in range”. This basically means that we’re going to do something n times.

Now, what is this something?

We are going to be calculating the accuracy of the models n times.

But how many times are we going to calculate this? What does n represent?

Our n is going to equal the length of our model (which goes from 0 to 2 because we have 3 different models).

TP = cm[0][0]TN = cm[1][1]FN = cm[1][0]FP = cm[0][1]print(cm)print('Testing Accuracy = ', (TP + TN)/ (TP + TN + FN + FP))print()

This code is pretty much the same as what we had before, and we’re going to include all of this in the same loop.

The only thing different is the “print ()”.

This is what I got after running this cell:

Model 0 [[84 6] [ 2 51]] Testing Accuracy = 0.9440559440559441 (Logistic Regression model has 94% accuracy on testing data)

Model 1 [[84 6] [ 1 52]] Testing Accuracy = 0.951048951048951 (Decision Tree model has 95% accuracy on testing data)

Model 2 [[87 3] [ 2 51]] Testing Accuracy = 0.965034965034965 (Random Forest Classifier model has 96% accuracy on testing data)

Therefore our Random Forest Classifier is the most effective model 🌳🌳

The Final Product

You’ve made it! We’re at the final stage of building our model.

Now you’re going to see for yourself what this model can do 😎

We’re going to start off my creating a variable called “pred” and set it equal to model[2]

We’re using model[2] because our Random Forest Classifier was our most accurate model and if you scroll up to when we found out the accuracy of each of our models — we saw that Model 2 represented the Random Forest Classifier.

The “.predict(X_test)” is going to allow the model to predict the values (meaning whether the patient has cancer or not) on the testing data from its features.

pred = model[2].predict(X_test)print(pred)print()print(Y_test)

Lastly, after we print this prediction, we’re going to also print the Y_test values which are the actual values of the patients showing whether they actually have cancer or not.

These are the results I got:

Our model’s prediction:

[1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1]

The actual values:

[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1]

As you can see, there are some places where our model’s prediction is wrong, which means our model is not 100% accurate/perfect.

But saying that, we could tweak some of the parameters or test the Logistic Regression/Decision Tree to increase the accuracy to get closer to 100%.

If you got to the end of this article, thanks for reading! I really appreciate it.

Even if you didn’t end up following the tutorial, I hope you came to some sort of appreciation for the impact that AI can have in healthcare.

By leveraging AI, we can improve the diagnosis of some of the world’s most leading causes of deaths.

If you liked this article, please leave it some claps and connect with me on LinkedIn!

Feel free to email me at if you have any questions :)

Until next time ✌🏽

On a path to impact billions. Yeah, billions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store