Linear vs. Logistic Regression: Unraveling the Mysteries of Predictive Analysis
Photo by Alina Grubnyak on Unsplash
Introduction
Have you ever been curious about how real estate companies accurately estimate a home's value, or how epidemiologists predict the likelihood of heart disease? The secret lies in the realms of linear and logistic regression. While these two methods are quite different from each other, they are critical tools across various industries for predicting either continuous or categorical variables.
At its core, regression is a statistical method embraced by professionals from diverse fields for predictive analysis. It's the key to unlocking the strength of the relationship between a target variable (our dependent variable) and a single/set of influential factors (the independent variable). Let's put this in another perspective: imagine we're trying to predict the likelihood of heart disease. We collect data on a range of factors such as age, cholesterol levels, blood pressure, lifestyle habits, and family medical history. In this scenario, the likelihood of developing heart disease is the dependent variable, and the various health and lifestyle factors are the independent variables. By using the independent variables (the features of the patient), we can make predictive guesses on the dependent variable (the likelihood of heart disease).
In the following sections, I'll delve deeper into linear and logistic regression, highlighting their differences and practical applications. Additionally, I'll provide examples showcasing how these methods can be utilized with machine learning tools like Python's SciKit library. My goal is for you to not only enjoy this read but also gain valuable insights by the end of it.
Linear Regression
What is Linear Regression?
Well, as you might have guessed, linear regression is a form of regression analysis. To be more precise, linear regression is a statistical method for predicting the behavior of a dependent behavior that holds a continuous value. Simply put, continuous means that the variable can take on any value within a certain range. For example, a rotating and adjustable light dimmer, in contrast to a light switch that only takes on the values of 'On' or 'Off'.
In linear regression, the central task is to predict the value of the dependent variable based on the relationship between the independent variable(s). The relationship between these variables is typically represented by a linear equation which manifests itself as the 'line of best fit' on a graph. This line, or regression line, is calculated to reduce the errors (distance) between the predicted values and the actual data points. In the plot below, I used Python's SciKit Learn library to help visualize this definition. The regression line represents the best-fit linear relationship between the independent (X-axis) and dependent (Y-axis) variables in the dataset.
It's important to note that Linear Regression can be subdivided into varying submodels such as simple and multiple linear regression. The difference between the two is the number of independent variables that are used to predict the outcome of the dependent variable. While a simple linear regression model uses only a single independent variable to make predictions, a multiple linear regression model uses two or more.
Mathematical Formula
The formula that maps the linear relationship between predictor and response variables can be interpreted by the following equation:
$$y = a_0 + a_1x_1 + a_2x_2 + … + a_ix_i$$
y denotes the dependent variable
xi denotes the i-th instance of the predictor variable
ai denotes i-th instance's average effect on y as xi increases by 1
Linear Regression Metrics
In order to evaluate the performance of a linear regression model, there are several evaluation metrics we can use. Below are the metrics typically used to test linear regression model performance:
Mean Absolute Error (MAE): The MAE metric provides the average absolute difference between the model's predicted values and actual values. In other words, it gives the average error outputted by the model. MAE is useful when you need a single summary statistic and is robust against outliers.
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
Mean Squared Error (MSE): The MSE metric is similar to the MAE except that the averages are calculated on the squared differences. Due to the squaring, MSE penalizes larger errors more severely than smaller errors. This metric is useful when large errors are highly undesirable.
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
Root Mean Squared Error (RMSE): The RMSE metric is the square root of the MSE. This metric essentially has the same uses as the MSE, but the square root helps scale down the errors back it its target variable's units. Doing this allows us to understand the magnitude of the errors better in the context of the data we're working with.
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
Real-World Example Project
Let's walk through a possible real-world example where a linear regression model may be useful. We'll be analyzing customer data (fictitious) from an E-commerce company to decide whether or not to focus their efforts on their website or app platforms. To provide a better experience, I created a Google Colab notebook where you can follow along with the project; you can find the entire project here.
Logistic Regression
What is Logistic Regression?
Now if you're like me when I firsted learned about logistic regression, you might be thinking that it's just another statistical method that uses regression to predict a target variable. While that is technically true, logistic regression is used primarily to solve classification problems. These classification problems usually involve classifying the outcome of a binary event. Some examples of these problems include: spam email or not spam email, cat image or dog image, etc... While linear regression aims to predict the value of a continous variable, here we're trying to classify something in a certain category.
On a graph, logistic regression's output lies between 0 and 1 as the model is created to predict the outcome of a binary event. It predicts the success of failure of of specific events. Below is a visualization I created of a logistic regression model. The graph shows the probability of a binary outcome (1 or 0) as a function of a single feature. The blue line represents the logistic regression curve, indicating the probability of belonging to one class (e.g., class 1) as the feature value changes. The black 'x' marks are the individual data points, showing their actual class (0 or 1) based on the feature value. As you can see, the logistic regression model provides a probability between 0 and 1, smoothly transitioning around the decision boundary.
Although I mainly describe logistic regression as a function of only one feature, it is more common place to use multiple independent variables to predict an outcome. Below, I created a 3D visualization that depicts the relationship between multiple features and its outcome.
Logistic Regression Metrics
Similar to linear regression, there are certain metrics used to help evaluate the effectiveness of a logistic regression model, although these metrics differ significantly. Below are the typical metrics used:
Accuracy: This is the most intuitive performance metric used. It is simply the ratio of the correct number of guesses divided by the total number of observations. This is helpful measuring all the correctly identified cases among all cases in a balanced dataset.
$$\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Number of Cases}}$$
Precision: This ratio is the number of correct positive guesses divided by the total number of positive observations. This is typically used when the cost of false positives are high (e.g. spam detection).
$$\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$$
Recall: Recall is the ratio of correctly guessed positive values divided by all the actual positive samples. This is a helpful metric when the cost of false negatives are high (e.g. disease detection).
$$\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$$
F-1 Score: The F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
Real-World Example Project
Similar to our linear regression project, the link will take you to project that demonstrates the power of logistic regression in a machine learning context. We will be using fake advertising data to determine whether a particular customer clicked on the ad or not. You can find the entire project here.
Conclusion
Well there you have it! Let's recap what we learned about these two statistical methods. While these two models share some common ground in the way they rely on relationships between variables to make predictions, the nature of these models are actually quite different. Linear regression goes more smoothly with continuous values such as house prices or temperatures. Logistic regression, on the other hand, deals more with categorical features, sort values into bins such as 'yes or no', 'win or lose'. Determining whether to use either one is going to be highly dependent on your use case and your specific needs.
My goal is to share the insights I gather along the way, honestly and transparently. I recognize that my understanding is still growing, and I'm not an authority on the subject. If anyone reading this notices any inaccuracies or has additional insights, please do share any feedback! I believe that we all have something to learn and contribute, and I'm here to engage in a constructive and collaborative learning process.
Remember, the world is vast and constantly evolving, and there's always something new to learn. So, keep learning, keep questioning, and most importantly, keep sharing your experiences and insights!