Introduction to Random Forest in R

Machine learning has become the hottest technologies these days, and companies are using machine learning algorithms in various applications to solve business problems. They generally use it for classification, regression, and clustering-related problems. Some of the more popular algorithms include linear regression, logistic regression, decision trees, random forest, KNN, SVM, and so on. 

In this article, we’ll cover the random forest algorithm in R from the ground up. The random forest algorithm is derived from the decision tree algorithm and consists of multiple decision trees—which is how it got its name. Tin Kam Ho created the first algorithm for random decision forests. 

What is a Random Forest? 

Random forest is a popular supervised machine learning algorithm—used for both classification and regression problems. It is based on the concept of ensemble learning, which enables users to combine multiple classifiers to solve a complex problem and to also improve the performance of the model.

The random forest algorithm relies on multiple decision trees and accepts the results of the predictions from each tree. Based on the majority votes of predictions, it determines the final result.

The following is an example of what a random forest classifier in general looks like:

training-set

The classifier contains training datasets; each training dataset contains different values. Multiple decision tree models are created with the help of these datasets. Based on the output of these models, a vote is carried out to find the result with the highest frequency. A test set is evaluated based on these outputs to get the final predicted results. 

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Random Forest Algorithm Features

  • Provides higher accuracy than other algorithms
  • Gives estimates of what variables are important in the classification
  • Handles missing data efficiently, and the generated forests can be saved for future use with other data
  • Computes proximities between pairs of cases that can be used in clustering, locating outliers, or to give interesting views of the data

How Does a Random Forest Work? 

Before understanding how a random forest algorithm works, first, let’s learn more about how a decision tree works with the following example: 

Suppose you want to predict whether a person will buy a phone or not based on the phone’s features. For that, you can build a simple decision tree.

phone price

In this decision tree, the parent/root node and the internal nodes represent the phone’s features, while the leaf nodes are the outputs. The edges represent the connections between the nodes based on the values from the features. Based on the price, RAM, and internal storage, consumers can decide whether they want to purchase the phone. The problem with this decision tree is that you only have limited information, which may not always provide accurate results.

Using a random forest model will improve your results, as it provides diversity into building the model with several different features.

price-phone.

We have created three different decision trees to build a random forest model.

Now, suppose a new phone is launched with specific features, and you want to decide whether to buy that phone or not.

price-ram

Let’s transfer this data to our random forest model and confirm the model’s output.

model-output

The first two trees predict the phone purchase, and the third decision tree suggests the disadvantages of making this purchase. Therefore, our model predicts that you should buy the newly launched phone.

Assumptions for the Random Forest Algorithm

  • There should be some actual values in the feature variables of the dataset, which will give the classifier a better chance to predict accurate results, rather than provide an estimation. Missing values should be handled from training the model.
  • The predictions from each tree must have very low correlations.

Steps to Build a Random Forest

  • Randomly select “K” features from total “m” features where k < m
  • Among the “K” features, calculate the node “d” using the best split point
  • Split the node into daughter nodes using the best split method
  • Repeat the previous steps until you reach the “l” number of nodes
  • Build a forest by repeating all steps for “n” number times to create “n” number of trees

After the random forest trees and classifiers are created, predictions can be made using the following steps:

  • Run the test data through the rules of each decision tree to predict the outcome and then store that predicted target outcome
  • Calculate the votes for each of the predicted targets
  • The most highly voted predicted target is the final prediction 

Random Forest Applications

random-forest-application

Random forest classifiers have a plethora of applications in the market today. Let’s go ahead and look at a few of them:

  1. In the field of banking, it is used to predict fraudulent customers
  2. Random forests are used to analyze the symptoms of patients and diagnose diseases
  3. In the ecommerce field, recommendation lists help predict purchases based on customer activity 
  4. Analyze stock market trends to predict profit or loss using the random forest algorithm 

Let’s now look at a few of the terms we need to know in order to understand the random forest algorithm. 

Terminologies in the Random Forest Algorithm

Before we start working with R, we need to understand a few different terminologies that are used in random forest algorithms, such as:

1. Variance - When there is a change in the training data algorithm, this is the measure of that change. 

2. Bagging - This is a variance-reducing method that trains the model based on random subsamples of training data. 

3. Out-of-bag (oob) error estimate - The random forest classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations. The out-of-bag (oob) error is the average error for each calculation using predictions from the trees that do not contain their respective bootstrap sample. This enables the random forest classifier to be adjusted and validated during training.

4. Information gain - Used to determine which feature/attribute gives us the maximum information about a class. It is based on the concept of entropy, which is the degree of uncertainty, impurity, or disorder. It aims to reduce the level of entropy, starting from the root node to the leaf nodes. 

The formula for entropy is as shown below:

entrop-formula

Where p represents the probability, and E(S) represents the entropy.

5. Gini index: The Gini index, or Gini impurity, measures the degree of probability of a particular variable being incorrectly classified when it is chosen randomly. The degree of the Gini index varies between zero and one, where zero denotes that all elements belong to a certain class or only one class exists, and one denotes that the elements are randomly distributed across various classes. A Gini index of 0.5 denotes equally distributed elements into some classes.

The Gini index formula is shown below:

gini-index.

Where pi is the probability of an object being classified to a particular class.

Let’s now look at how we can implement the random forest algorithm.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Use case: Predicting the Quality of Wine

The following use case shows how this algorithm can be used to predict the quality of the wine based on certain features—such as chloride content, alcohol content, sugar content, pH value, etc. 

To do this, we have randomly assigned the variables to our root node and the internal nodes.

Usually, with decision trees or random forest algorithms, the root nodes and the internal notes are calculated using the Gini index/Gini impurity values. 

1. We have the first decision tree, which is going to take chlorides and alcohol content into consideration. If the chloride value is less than 0.08 and the alcohol content is greater than six, then the quality is high (in this case, it’s eight). Otherwise, the quality is five. This decision tree is shown below:

/chloride

2. Our second decision tree will be split based on pH and sulphate content. If the sulphate value is less than 0.6 and the pH is lesser than 3.5, then the quality is six. Otherwise, it is five. The decision tree is shown below:

sulphates

3. Our last decision tree will be split based on sugar and chloride content. If sugar is less than 2.5 and the chloride content is less than 0.08, then we get the quality of the wine to be five. Otherwise, it’s four. The decision tree is shown below:

sugar

Two out three decision trees above indicate the quality of our wine to be five—the forest predicts the same. 

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

In this demo, we will run an R program to predict the wine’s quality. The image shown below is the dataset that holds all attribute values required to predict the wine’s quality.

wines-quality.gif

So, let’s get coding!

wine <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";") # This command is used to load the dataset

head(wine) # Display the head and dimensions of wine dataset

dim(wine)

barplot(table(wine$quality)) # Barplot to see the quality of wines. The output looks like below



table-wine-quality

# Now, we have to convert the quality values into factors

wine$taste <- ifelse(wine$quality < 5, "bad", "good")

wine$taste[wine$quality == 5] <- "normal"

wine$taste[wine$quality == 6] <- "normal"

wine$taste <- as.factor(wine$taste)

str(wine$taste)

barplot(table(wine$taste)) # Barplot to view the taste of wines. The output is shown below.

table(wine$taste) 

# Next, we need to split the data into training and testing. 80% for training, 20% for testing.


set.seed(123)

samp <- sample(nrow(wine), 0.8 * nrow(wine))

train <- wine[samp, ]

test <- wine[-samp, ]

# Moving onto the Data visualization

library(ggplot2)


ggplot(wine,aes(fixed.acidity,volatile.acidity))+ geom_point(aes(color=taste))# This command is used to display a scatter plot. The output looks like below




acidity


ggplot(wine,aes(alcohol)) + geom_histogram(aes(fill=taste),color='black',bins=50) # This command is used to display a stacked bar chart. The output looks like below


dim(train)

dim(test)  # Checks the dimensions of training and testing dataset


install.packages('randomforest')

library(randomforest)           # Install the random forest library


# Now that we have installed the randomforest library, let’s build the random forest model


model <- randomforest(taste ~ . - quality, data = train, ntree = 1000, mtry = 5)

model

model$confusion


# The next step is to validate our model using the test data

prediction <- predict(model, newdata = test)

table(prediction, test$taste)

prediction

model-random-forest


# Now, let’s display the predicted vs. the actual values


results<-cbind(prediction,test$taste)

results

colnames(results)<-c('pred','real')

results<-as.data.frame(results)

View(results)


# Finally, let’s calculate the accuracy of the model

sum(prediction==test$taste) / nrow(test) # The output is as shown below

view-results

sum-pred

You can see that this model’s accuracy is 90 percent, which is great. Now we have automated the process of predicting wine quality. This brings us to the end of this demo on random forest. 

Conclusion

After reading this article, you have likely learned more about the random forest, including how it works, different random forest terms, and more about its various applications that are used in the real world. We also included a demo, where we built a model using a random forest to predict wine quality. We worked on RStudio for this demo, where we went over different commands, packages, and data visualization methods in R. To learn more about the random forest in R, watch the following video: 

Want to Learn More?

If you’re Data scientist and want to advance in your career, check out Simplilearn’s Data Science Bootcamp today. This comprehensive course will teach you everything you need to know to boost your career as a Machine Learning Engineer.

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More

Find Professional Certificate in Data Science and Generative AI in these cities

Post Graduate Program In Data Science, HoustonPost Graduate Program In Data Science, Pittsburgh
  • Acknowledgement
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.