9 in 10 adults don’t know they have Chronic Kidney Disease — we can fix that.

12 min readApr 19, 2021

More than 1 in 7, that is 15% of US adults or 37 million people, are estimated to have chronic kidney disease (CKD).

That’s close to the amount of casualties in World War I.

9 in 10 adults with CKD don’t know they have CKD.

About 2 in 5 adults with severe CKD don’t know they have CKD.

What is CKD anyways?

CKD means your kidneys are damaged and can’t filter blood the way they should. The disease is called “chronic” because the damage to your kidneys happens slowly over a long period of time. This damage can cause wastes to build up in your body.

It’s a common condition often associated with getting older. It’s also proven to be more common in people who are Black or of South Asian origin.

If you don’t know you have it, how are you supposed to figure out how to treat it?

We can solve this issue using artificial intelligence to predict CKD.

I’m going to walk you through how to build a program that classifies patients as having CKD or not using Artificial Neural Networks.

What is a Neural Network?

Check out this article I wrote outlining the basics of a neural network, and a brief explanation of the various types (ANNs, CNNs, and RNNs) 😎

Before we keep going, I’d like to give a HUGE SHOUTOUT to this video by Computer Science! They guided me through the whole process of building this model while explaining it thoroughly.

Alright, let’s get back to our regular programming (haha, see what I did there 😉)

We’ll need to start with an IDE. I used Google Collab and I definitely recommend it, but use whichever environment you prefer. The world is your oyster 🐚

We’re going to start with importing some of the necessary libraries in order to make this model. If you’re new to python libraries, they basically just summarize a bunch of (otherwise) longer commands that we would have to manually code. By importing these libraries, we’ll save some time.

import globfrom keras.models import Sequential, load_modelimport numpy as npimport pandas as pdfrom keras.layers import Densefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoder, MinMaxScalerimport matplotlib.pyplot as pltimport keras as k

Next, we’re going to input the data that we’re going to use. I went on Kaggle and found a dataset that I’ll link here.

I saved the file as “kidney_disease.csv” and loaded that into my program.

Then, using pandas DataFrame I printed the first 5 rows of my data and got the number of rows and columns.

Cleaning the Data

When we look at our data, we have a bunch of columns that are not that relevant to our model.

We could keep these columns and create this model without deleting the irrelevant columns — but we would be compromising the accuracy of our model.

In other words, our model wouldn’t be that good at predicting who has chronic kidney disease and who doesn’t.

The columns that are needed by this model are:

sg = urinary specific gravity = a measure of the concentration of solutes in the urine (measures the ratio of urine density compared with water density and provides information on the kidney’s ability to concentrate urine)

al = albumin = helps keep fluid in your bloodstream so it doesn’t leak into other tissues

sc = serum creatinine = helps us estimate how quickly the kidneys filter blood (glomerular filtration rate)

hemo = amount of hemoglobin (protein in RBCs that carry oxygen throughout the body) in the patient’s blood

pcv = packed cell volume = measure of the ratio of the volume occupied by the red cells to the volume of whole blood in a sample of capillary, venous, or arterial blood (used to estimate the need for certain blood transfusions and monitor the response to treatment)

wbcc = # of white blood cells

rbcc = # of red blood cells

htn = hypertension (high blood pressure = when the force of the blood against your artery walls is high enough that it may cause health problems)

classification = whether the patient has CKD or not

In order to keep these columns, I first made a variable called “columns_to_retain”.

We’re now going to use an IF statement to keep delete the rest of the columns.

columns_to_retain = ['sg', 'al', 'sc', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'classification']df = df.drop( [col for col in df.columns if not col in columns_to_retain] , axis=1 )df = df.dropna(axis=0)

We need to use the pandas DataFrame function again in order to drop the irrelevant columns. You might have also noticed that we use the axis?

What is the axis?

When we set the axis=0, it means that we want to manipulate the horizontal axis (or in a data set → the rows of the data). When we set the axis=1 (you guessed it!), it means that we want to manipulate the vertical axis (the columns in our data set).

When using the pandas DataFrame to drop columns or rows, if we specify axis=1 we will remove columns, and if we specify axis=0 we will remove rows. We then need to use the pandas DataFrame dropna() function to remove the rows with Na values to ensure that our model performs at the highest accuracy.

You also may have noticed that we have some data in our columns that aren’t numbers. (ex. column “htn” has either “yes or no” as values)

Our model won’t be able to read words, which means that we need to convert those words into numbers. In order to do this, we’re going to use another if statement.

We need to go through the data and identify each of the columns with non-numeric data.

So, let’s do it a faster way.

Here’s what it will look like:

for column in df.columns:   if df[column].dtype == np.number:      continue   df[column] = LabelEncoder().fit_transform( df[column] )

In the first line, we are creating a loop and telling the model to focus on only the data in the columns.

In the second line, we are basically saying that if the data (.dtype) in the column (df[column]) is a number (==np.number), then we just continue on with the model.

But if it doesn’t equal a number, then we need to transform that column:

To transform the df[column], we need to use LabelEncoder (which if you remember we inputted from the slkearn library. The .fit_transform function it allows us to transform this data.

But what is label encoding?

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.

I then used the “df.head()” command to print the first 5 rows of the (now cleaned) data set to make sure that everything looked right.

Here, we can see that in the “htn” (hypertension) column, we have either 1s or 0s. If we compare this to our original data, we can see that 0 = no and 1= yes (meaning that 0 = patient doesn’t have hypertension and 1 = patient has hypertension).

Separating the data

Now that our data is cleaned, we want to split the data into 2 sets so that we can build and (later) train our model. We are going to split it into an X dataset (which will be our features) and a Y dataset (which will be the answer — whether our patient has CKD or not).

How do we classify X and Y such that X is only the features data and Y is only the classification data?

Well for our X variable, we basically need all the data except the classification column. Therefore, we can use the df.drop() function to remove the classification label and set axis=1 to specify that we need to remove the classification column.

For our Y variable, we only need the classification column, so we can just set y = df[‘classification’].

This is what it should look like:

X = df.drop(['classification'], axis=1)y = df['classification']

At the moment, we have a wide range of values, some being much greater than others. If we proceed with building the model with this data, our machine learning algorithm may weight the greater values with higher values and the smaller values with lower values. This compromises the effectiveness of our model.

So in order to counteract this, we are going to scale our feature data.

What is scaling?

Scaling allows us to transform the data so that the features are within a specific range (like from 0 to 1).

We can use the MinMaxScaler (we imported this from the Sklearn library earlier) in order to do this. For each value in our X independent data set, MinMaxScaler subtracts the minimum value in the feature and then divides it by the range.

x_scaler = MinMaxScaler()x_scaler.fit(X)

As you can see, I created a variable called x_scaler and set that equal to MinMaxScaler(), and then I applied the scale to the X (feature) data set.

column_names = X.columnsX[column_names] = x_scaler.transform(X)

Then I made a variable called column_names and set that equal to all of the columns in our X feature data set. I set it equal to x_scaler.transform(X) so that all of the data in the X columns would be scaled.

Now we’re going to split our data so that we have some for training and some for testing. I split it into 80% training and 20% testing by doing the following:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=True)

The shuffle() method takes a sequence, like a list, and reorganize the order of the items.

Building the Neural Network

We’re going to build an Artificial Neural Network.

Brief explanation: Neural networks are computational algorithms. It intended to simulate the behaviour of biological systems composed of “neurons”. ANNs are computational models inspired by an animal’s central nervous systems. It is capable of machine learning as well as pattern recognition.

To build our ANN we’ll need to:

model = Sequential()model.add( Dense(256, input_dim= len(X.columns) , kernel_initializer= k.initializers.random_normal(seed=13), activation='relu') )model.add( Dense(1, activation='hard_sigmoid') )

sequential() = used for a plain stack of layers where each layer has exactly one input tensor and one output tensor

model.add → allows us to create the layers of the model

dense → layer that feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer (Dense(14) has 14 neurons, Dense (459) has 459 neurons)

input_dim (or input shape) → starting tensor you send to the first hidden layer. It must have the same shape as your training data (hence why we set it equal to X.columns).

kernel_initializer → defines the way to set the initial random weights of Keras (neural network library we inputted previously) layers

seed → used to initialize the random number generator

activation=‘relu’ → stands for “Rectified Linear Activation Function” which usually achieves better performance and generalization in deep learning compared to the sigmoid activation function. This function takes a single number as an input, returning 0 if the input is negative, and the input if the input is positive.

Our next step is compiling the model:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The purpose of compiling is to set the “optimizer” and “loss” function for “training”.

Here you can see that I set the loss equal to binary cross-entropy (loss function). For the purpose of time, I’m not going to go into depth here explaining binary cross-entropy, but I found a very good article explaining the concept that I encourage you to check out :)

I then set the optimizer equal to Adam. Adam is a replacement optimization algorithm that will train our model. This algorithm is for gradient-based optimization of stochastic objective functions.

Wait — what’s an optimization algorithm?

It’s a procedure that is executed repeatedly by comparing various solutions until an optimum or satisfactory solution is found.

And hold up — stochastic objective functions?

This just refers to a bunch of methods for minimizing/maximizing an objective function (i.e., the function that gives us the numerical value of what we’re looking for — in other words, the objective) when randomness is present. I recommend checking this paper out if you want to dive deeper into stochastic optimization.

Relating this all back to our model, this allows us to minimize/maximize our function that will tell us whether the patient has CKD or not. We need this because we have such a huge range of values in our feature data.

Training the Model

I created a variable called “history” and set that equal to model.fit to determine how well our model fits the data.

But what does this mean?

Now that we’ve built our model, we need some way to evaluate how accurate it is. Model fitting allows us to see how well our model does in relation to similar data on which it was trained.

But how will we know if our model is well-fitted?

If the differences between the observed values and the model’s predicted values are small and unbiased — it indicates that our model is well-fitted.

The parameters for model.fit are going to be our data — X_train and Y_train — our epochs will equal 2000 and batch_size=X_train.shape(0).

history = model.fit(X_train, y_train, epochs = 2000, batch_size= X_train.shape[0])

The batch size controls the accuracy fo the estimate of the error gradient when training our ANN. Our epochs are the number of cycles through the full training dataset that we want our model to go through.

Run this cell, and get a coffee as our model goes through training ☕️

Last Steps

If you scroll down to the 2000th epoch, you should see that our model has a loss of 0.0043 and an accuracy of 0.9956.

Don’t forget to also save the model so we don’t lose all of our hard work 😉 :

model.save('ckd.model')

I’m also going to show you how we can visually see our model’s accuracy and loss as we go through each cycle. Basically, I’m going to show you how we can see the model’s whole training process in a graph in 0.5 less the time.

We’re going to use a library we inputted earlier — Matplotlib — to help us.

Below, you can see that I used the plot commands to plot both the accuracy and loss from our model’s training process, as well as add titles to our graph and axes.

plt.plot(history.history['accuracy'])plt.plot(history.history['loss'])plt.title('model accuracy & loss')plt.ylabel('accuracy and loss')plt.xlabel('epoch')

To see how well our model performs in terms of whether or not they can predict whether a patient has CKD or not we’re going to:

Print the shape of our training data and testing data (the shape just means how many columns/rows there are)
Show the predicted and actual values of our model

print('shape of training data:' , X_train.shape)print('shape of test data:' , X_test.shape)

This will give us the number of columns and rows in our training data and test data (80% and 20% of our original data, respectively).

Then to show the predicted and actual values, we’re going to create a variable called “pred”:

pred = model.predict(X_test)pred = [1 if y>=0.5 else 0 for y in pred]
print('Original ; {0}'.format(", " .join(str(x) for x in y_test)))print('Predicted ; {0}'.format(", " .join(str(x) for x in pred)))

The function “model.predict” provides us with a probability between 0 and 1 of whatever is in the bracket (in our case, X_test) .

In the second line, what we’re basically saying is if the patient’s probability of having CKD is equal to or above 50%, then round that up to 100% — meaning a 1 would appear. If the patient’s probability of having CKD is lower than 50%, then round down to 0% — meaning a 0 would appear.

In the last lines we’re just printing the original data (what was in our data set) and what our model would have predicted based on the features of the original data.

You’re probably wondering what the .format and .join(str(x) means — am I right?

If you go back to our original data and look at the classification column, you would see that we have words, or a sequence of characters, which in Python we call a “string”.

format() allows us to put one or more placeholders (which are defined by curly brackets — in our case, its the 0) into a string. join() takes these items and joins them in one string. str() converts values to a string form so that they can be combined with other strings.

I got the following values for my original and predicted:

Original ; 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0 Predicted ; 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0

And with that — we’re done building our model! Woohoo 🎉

I hope that this article exposed you to the gigantic impact that artificial intelligence both may and is already having in medicine! The possibilities are truly endless ✨

If you have any questions, feel free to leave a comment

Until next time ✌🏽

Medium

Github