The Power of Prediction: Logistic Regression in Machine Learning
Predicting the future is a tempting skill for many individuals, with some turning it into a lucrative business. However, let's set psychics aside for the moment and shift our focus to more science-based methods of future prediction. While we acknowledge the complexity of reality, data science allows us to simplify and categorize information into binary and non-binary dimensions.
This time, our emphasis is on the binary aspect. Binary refers to something having two values, typically 0 or 1, such as yes or no, black or white, and so forth. Logistic regression is a statistical method used to classify items into one of two groups. For instance, by gathering customer data on factors like age, income, and gender, we can attempt to predict the outcome variable, which in this case is the choice between Pepsi or Coke. Our outcome variable is binary since customers have the option to choose either Pepsi or Coke.
If we construct a logistic regression model that encompasses variables such as age, gender, income, and the choice between Pepsi and Coke as the outcome variable, we can forecast the decisions of new customers based on these same factors. Similarly, we can use this approach to predict various outcomes, such as whether a patient will survive after heart surgery, if an email will be categorized as spam, or if a person will purchase a ticket for a new movie.
If the model is appropriately fitted, we can anticipate the outcome with relatively high precision, meaning we can make accurate predictions. Conversely, if the model is poorly fitted, our predictions will be as reliable as random chance. In other words, flipping a coin would yield similarly unpredictable outcome predictions.
Before delving into our example, there's one more crucial aspect to address. In machine learning literature, two terms frequently appear: training set and test set. Both are derived from our original dataset. For instance, if our original dataset comprises 1000 observations, we can randomly select 700 observations to form the training set, leaving the remaining 300 observations as the test set. The training set is employed to construct a model, where we assess model assumptions, identify variables significantly linked to the outcome, and ultimately establish a final model.
However, the critical step is to evaluate whether our model is capable of predicting future results. For this purpose, we utilize the test set, which consists of observations not previously used in the model. This allows us to assess the extent to which our predicted results align with the actual outcomes.
To understand this idea better, let’s take a look at our example. We have a dataset with several characteristics of patients and outcome variable ‘Death Event’ (information whether a patient is alive or dead).
Before we begin, we load all libraries we are going to use. It’s not mandatory, but we can also omit the scientific notation by using options().
Our original dataset comprises 299 observations and 13 variables. Prior to commencing the model-building process, it is necessary to partition the original dataset into training and test sets.
For the sake of reproducibility, it is advisable to employ the set.seed() function. Any numerical value may be utilized for this purpose. I am going to use 42, you can choose any number you like.
We want our training set to consist of 70% of observations from our original dataset. Therefore, we set the probability to 0.7 and 0.3. However, you can also set it to 0.8 and 0.2, or 0.9 and 0.1, if you want your training set to be 80% or 90%, respectively. Remember, though, that the training set should be larger than the test set.
We want to predict whether a patient will die or not. Therefore, our outcome variable takes two values: 0 (for alive) and 1 (for death). To build a logistic regression model, we use the `glm()` function from base R, and we'll call it `model.one`. The `data = ` argument specifies the dataset to be used, which in this case will be our training set (`train_set`). To perform the logistic model, we need to add `family = 'binomial'`.
Finally, we want to indicate what our outcome variable is and which variables we want to use as predictors. If you want to use only several variables from the dataframe, such as age, smoking status, and creatinine level, you can write `DEATH_EVENT ~ age + smoking + creatinine...`.
However, if you want to use all variables to be included in the model, instead of writing each one, you can use a simple dot (`.`). A dot indicates that all variables except `DEATH_EVENT` will be used as predictors. Summary () function shows the properties of a model.
When the model is ready, we can use the `varImp()` function to assess the importance of each variable. Importance can be interpreted as the contribution of each variable or the magnitude of its explanatory ability. The 'Overall' column expresses the importance value for each variable. We can save this information as an object called 'importance'.
Since we have only a few variables, it is relatively easy to compare them. However, when dealing with a larger number of variables, it can be useful to present their importance using a visually appealing plot. To create such a plot, we will first construct a dataframe with variable names and their respective importance values. We can extract the variable names using the `dplyr` package. Taking our `test_set`, we select every variable except the outcome variable using `select(-DEATH_EVENT)`. We then extract only the column names using the `colnames()` function.
Now, we need to combine variable names and variable importance. We can achieve this by using the `data.frame()` function and save the result as 'var.importance'. Here's the code for it:
This code creates a dataframe 'var.importance' with two columns: 'variable', containing the variable names, and 'Importance', containing the corresponding importance values.
It’s not mandatory, but it looks better if variable importance is represented as a percentage. Therefore, we can add this variable to our dataset and call it fraction. Here's how you can do it:
We can use ggplot2 package to create a nice plot.
This part of the code establishes the overall structure of the plot. Geom_segment() is used to add thin lines to the plot. However, the plot doesn’t look visually appealing yet. To enhance it, we need to incorporate points with the geom_point() function and add labels to each point using the geom_text() function.
To ensure neat presentation, we can control the display of decimal places in the labels. For instance, you can use aes(label = round(fraction, 0)) within geom_text() to display numbers as integers. If you prefer one decimal place, you can change it to round(fraction, 1). Keep in mind that adjusting the decimal places might necessitate changes to the font size or the size of points for optimal visualization.
To enhance the plot's informativeness, we can modify the axis labels, changing the x-axis label to ‘Variable’ and the y-axis label to ‘Importance %’. Additionally, we can improve the visual representation by using coord_flip() to swap the x and y axes. To create a cleaner background, theme_minimal() can be employed.
The final code for creating the plot may look like this:
Now, let’s return to the model. After creating a model based on the training set, it's time to test it using our test set. We employ the `predict()` function, specifying the model to be tested, the new dataset (our test set), and the type of outcome variable. The default response is linear, so we need to change it to ‘response’. The results of this operation will be saved in an object named ‘probabilities.’
The test set comprises 82 observations from our original dataframe. The `predict()` function calculated precisely 82 probabilities, one for each participant. Given that our outcome variable is binary, we can classify probabilities lower than 0.5 as 0 and those above 0.5 as 1. These predictions will be saved in an object named ‘pred.’
We can compare the real outcome, which is DEATH_EVENT in our test set, with our predicted outcome (pred).
We see that some observations were classified correctly (green rectangle), while others were predicted incorrectly (red rectangle). It's important to note that models provide estimations, and achieving 100% accuracy is rarely possible.
To assess the performance of our model, we can use the `confusionMatrix()` function. This function works with factorial variables, so we need to wrap `pred` and `test$DEATH_EVENT` with the `factor()` function to ensure that the predicted and actual values are treated as factors, respectively.
We observe that the accuracy of this model is calculated to be 0.82. Accuracy is defined as the proportion of correct predictions over the total predictions. In this case, we have 68 correct predictions (green rectangles) out of 82 total predictions. Therefore, the accuracy is calculated as 68/82 = 0.82. A higher accuracy value is indicative of better model performance.