Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression. 

What is SKlearn Linear Regression?

Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:

The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Sklearn Linear Regression Concepts

When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:

  • Best Fit: The straight line in a plot that minimizes the divergence between related dispersed data points
  • Coefficient: Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable 
  • Coefficient of Determination: It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
  • Correlation: the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
  • Dependent Feature: A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
  • Estimated Regression Line: the straight line that best fits a set of randomly distributed data points
  • Independent Feature: a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
  • Intercept: It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
  • Least Squares: a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
  • Mean: an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
  • OLS (Ordinary Least Squares Regression): sometimes known as Linear Regression.
  • Residual: the vertical distance between a data point and the regression line
  • Regression: is an assessment of a variable's predicted change in relation to changes in other variables
  • Regression Model: The optimum formula for approximating a regression 
  • Response Variables: This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point) 
  • Slope: the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
  • Simple linear regression: A linear regression with a single independent variable

Sklearn Linear Regression Prerequisites

Before working with linear regression in Scikit-learn (sklearn), it is important to have a basic understanding of the following concepts:

  • Linear algebra: Linear regression involves solving a system of linear equations, so it is important to have a basic understanding of linear algebra, including concepts such as matrices, vectors, and matrix multiplication.
  • Statistics: Understanding basic statistical concepts such as mean, variance, and the standard deviation is essential for working with linear regression models.
  • Python programming: Scikit-learn is a Python library, so a basic understanding of Python programming is necessary to work with it.
  • NumPy: NumPy is a fundamental package for scientific computing in Python and is used extensively in scikit-learn. It is important to have a basic understanding of NumPy arrays and operations.
  • Pandas: Pandas is another essential package for data manipulation and analysis in Python. It is used to read and preprocess data for use in scikit-learn.
  • Data visualization: It is important to visualize and explore data before building a linear regression model. Matplotlib and Seaborn are popular data visualization packages in Python.

Once you understand these concepts well, you can start learning and working with linear regression in Scikit-learn.

How to Create a Sklearn Linear Regression Model

Step 1: Importing All the Required Libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Step 2: Reading the Dataset

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv('bottle.csv')

df_binary = df[['Salnty', 'T_degC']]

# Taking only the selected two attributes from the dataset

df_binary.columns = ['Sal', 'Temp']

# Renaming the columns for easier writing of the code

df_binary.head()

# Displaying only the 1st  rows along with the column names

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Step 3: Exploring the Data Scatter

sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None)

# Plotting the data scatter

Step 4: Data Cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method ='ffill', inplace = True)

Step 5: Training Our Model

X = np.array(df_binary['Sal']).reshape(-1, 1)

y = np.array(df_binary['Temp']).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array 

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

Step 6: Exploring Our Results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

# Data scatter of predicted values

Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.

Step 7: Working With a Smaller Dataset

df_binary500 = df_binary[:][:500]

# Selecting the 1st 500 rows of the data

sns.lmplot(x ="Sal", y ="Temp", data = df_binary500,

                               order = 2, ci = None)

We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.

df_binary500.fillna(method ='ffill', inplace = True)

X = np.array(df_binary500['Sal']).reshape(-1, 1)

y = np.array(df_binary500['Temp']).reshape(-1, 1)

df_binary500.dropna(inplace = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Linear Regression Theory

Linear Regression is a supervised learning algorithm for predicting continuous values based on input variables. This algorithm establishes a linear relationship between the independent variables (input variables, features, or predictors) and the dependent variable (output variable or target variable).

The algorithm finds the best-fit line that minimizes the sum of squared errors between the predicted values and actual values. This line is called the regression line or best-fit line. The equation of this line is of the form:

y = β0 + β1 * x1 + β2 * x2 + … + βn * xn

where y is the dependent variable, x1, x2, …, xn are the independent variables, β0 is the intercept, and β1, β2, …, βn are the coefficients.

The goal of the Linear Regression algorithm is to estimate the values of these coefficients (β0, β1, β2, …, βn) in such a way that the sum of squared errors is minimized. This process is called the Ordinary Least Squares (OLS) method.

The scikit-learn library in Python implements Linear Regression through the LinearRegression class. This class allows us to fit a linear model to a dataset, predict new values, and evaluate the model's performance.

To use the LinearRegression class, we first need to import it from sklearn.linear_model module. We can then create an instance of the class and call its fit method to train the model on a dataset. Finally, we can use the prediction method to generate predictions on new data.

In addition to the basic Linear Regression algorithm, scikit-learn also provides algorithm variants that can handle more complex data, such as polynomial regression, ridge regression, and Lasso regression. These variants involve adding additional constraints or penalties to the model to prevent overfitting and improve its generalization performance.

Evaluating the Model

Once we have trained a Linear Regression model using a dataset, we must evaluate its performance to determine how well it can predict new values. There are several metrics that we can use to evaluate the performance of a Linear Regression model:

  1. Mean Squared Error (MSE): This is the most commonly used metric for evaluating a Linear Regression model. It measures the average of the squared differences between the predicted values and the actual values. A lower MSE indicates better performance.
  2. Root Mean Squared Error (RMSE): This is the square root of the MSE and provides a more interpretable metric since it is in the same units as the target variable.
  3. R-squared (R2): This metric measures the proportion of variance in the target variable explained by the model. An R2 score of 1 indicates a perfect fit, while a score of 0 indicates that the model is no better than predicting the mean value of the target variable.
  4. Mean Absolute Error (MAE): This metric measures the average absolute differences between the predicted and actual values. It is less sensitive to outliers than the MSE.

To evaluate a Linear Regression model using these metrics, we can use the linear regression class scoring method in scikit-learn. For example, to compute the R2 score on a test set, we can do the following:

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

# Train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Evaluate the model on the test set

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print("R2 score:", r2)

In addition to these metrics, we can visualize the model's performance using various plots, such as scatter plots of the predicted values versus the actual values, residual plots, and Q-Q plots. These plots can help us identify patterns or outliers in the data the model may not have captured.

Multiple Linear Regression

Multiple Linear Regression (MLR) is a statistical technique that analyses the relationship between dependent and multiple independent variables. It is an extension of Simple Linear Regression (SLR) in which only one independent variable is used to predict the dependent variable.

In Multiple Linear Regression, a linear relationship is assumed between the dependent and independent variables. The goal is to estimate the linear equation coefficients that best describe this relationship.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) analyzes and understands a dataset to summarize its main characteristics and gain insights into its underlying patterns and relationships. EDA aims to uncover valuable information, detect anomalies and outliers, and identify any issues or biases that may affect the data quality.

The following are some common steps involved in EDA:

  1. Data collection: Gathering and collecting data from various sources.
  2. Data cleaning: Checking the data for missing values, outliers, and inconsistencies and handling them appropriately. It may involve imputing missing values, removing outliers, or transforming the data.
  3. Data visualization: Creating visual representations of the data using graphs, charts, and other visual aids. It helps to identify patterns and relationships in the data and gain insights into its characteristics.
  4. Data exploration: Analyzing the data to identify trends, relationships, and patterns. It may involve computing summary statistics such as mean, median, standard deviation, and correlation coefficients.
  5. Data modeling: Building statistical or machine learning models to make predictions or draw conclusions from the data.
  6. Data communication: Presenting the analysis results to stakeholders clearly and concisely.

Making Predictions with the Multivariate Regression Model

Multivariate Regression is a statistical model used to predict the values of a dependent variable based on the values of multiple independent variables. The model assumes a linear relationship between the dependent and independent variables.

Once we have fit the multivariate regression model on our training data, we can use it to predict new data. To make predictions, we need to provide values for each independent variable and use the regression model coefficients to calculate the dependent variable's predicted value.

Here are the steps to make predictions with a multivariate regression model:

  • Prepare the new data: We must prepare the latest data by creating a new design matrix with the same columns as the design matrix used to train the model. The values of the independent variables should be the values for which we want to make predictions.
  • Load the regression model: We need to load the model we previously trained on our training data.
  • Predict the dependent variable: We can now use the regression model object's predict () function to predict the dependent variable's values for the new data. The predict() function takes the new design matrix as input and returns the predicted values of the dependent variable.
  • Interpret the results: Once we have the predicted values of the dependent variable, we can interpret them to gain insights into the relationships between the independent and dependent variables. We can compare the predicted values with the actual values to evaluate the model's accuracy and identify any areas where the model may need improvement.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Conclusion

Enroll in Simplilearn’s PG in Data Science to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

Data Science & Business Analytics Courses Duration and Fees

Data Science & Business Analytics programs typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees
Professional Certificate in Data Science and Generative AI

Cohort Starts: 27 Jan, 2025

6 months$ 3,800
Professional Certificate in Data Analytics and Generative AI

Cohort Starts: 27 Jan, 2025

22 weeks$ 4,000
Post Graduate Program in Data Analytics

Cohort Starts: 3 Feb, 2025

8 months$ 3,500
Caltech Post Graduate Program in Data Science

Cohort Starts: 3 Feb, 2025

11 months$ 4,000
Professional Certificate Program in Data Engineering

Cohort Starts: 10 Feb, 2025

7 months$ 3,850
Data Scientist11 months$ 1,449
Data Analyst11 months$ 1,449