Random Forest for Beginners

Susanna Han
3 min readNov 5, 2020
Photo by Michael Benz on Unsplash

What is Random Forest?

Random Forest is an algorithm that consists of many decision trees using random subsets of features that average the outcome to make predictions.

What is a Decision Tree?

Decision trees are like a flowchart structure where an internal node represents a feature or attribute and the branch represents a decision rule.

Photo from geeksforgeeks

This method is a non-supervised learning method that is used for classification and regression. As we can see in the example above you start at the top and work your way down.

The very top of the tree is called the “root node” in this diagram the node is labeled “WINDY”.

Internal nodes are any nodes in a tree that has child nodes. (OUTLOOK, TEMPERATURE, HUMIDITY)

Leaf nodes are any nodes that do not have child nodes. (YES/NO)

Therefore a random forest is just the majority vote of the outcomes of multiple decision trees.

Why use random forest?

Random Forest has the power to handle thousands of different input variables with higher dimensionality. There are so many different methods that can be used in both classification and regression that versatility is a big advantage. Since a random forest consists of multiple decision trees, it is more accurate in categorizing and training the data than just a single tree alone. It is also able to handle missing values and won’t overfit the model. Most importantly, it does all the work for you! Random Forest is best used for classification.

How does it work?

The way Random Forest learns, as it is a learning algorithm, is called Bootstrap Aggregation. Each tree learns a random sample of data. The same data point can be used multiple times. It uses the data to train the model and when the data is used to run through the model to make predictions the average of all the predictions is taken. Each tree is used to classify an object and the classification with the most votes wins.

What are the parameters?

  • n_estimator — this is the number of trees in the forest that will be used in the model.
  • criterion (Gini: entropy) — these are the different measurements in the quality of a split. Gini measures the impurity and entropy calculates the information gain.
  • max_depth — this will put a cap on the number of nodes each tree is able to have.
  • min_samples_split — the minimum number of samples that are required to split an internal node
  • random_state — a random number generator that is used for random uniform sampling from a list of all possible values.

When using Random Forest you don’t have much control over what goes on behind the model. The only thing you can do is switch up the parameters with the number of trees, the size of the decision trees, etc to see what the best combination of parameters are for your model.

How do you find the best parameters?

Grid Search CV and Randomized Search CV are both library functions from sklearn. Randomized Search CV is a method that takes random combinations of the different parameters and finds the best solution for the built model. Whereas Grid Search CV loops through all the hyperparameters to fit the model to the training set and cross-validating each set of hyperparameters. Both library functions will spit out the best-found parameters with the given input.

Overall Random Forest is a popular and widely used classification algorithm to analyze and predict classifiers and non-linear data. It is easy to use and the algorithm along with some optimization methods do all the work for you!

--

--