Complete Guide to Decision Tree

22 min readJun 22, 2020

This article will give you all the information you need to understand Decision Tree. Lets start our journey ….

Decision Tree

A decision tree, as the term suggests, uses a tree-like model to make predictions. It resembles an upside-down tree and uses a similar process that you do to make decisions in real life, i.e., by asking a series of questions to arrive at a decision .

A decision tree splits data into multiple sets of data. Each of these sets is then further split into subsets to arrive at a decision

It can be used to model highly non-linear data.
Decision trees can be used for supervised AND unsupervised learning.Even with the fact that a decision tree is per definition a supervised learning algorithm where you need a target variable, they can be used for unsupervised learning, like clustering.
Decision trees are non-parametric means no specific data distribution is necessary. Decision trees easily handle continuous and categorical variables.
Decision trees is one of the best independent variable selection algorithms. Decision trees help in quantifying the importance of each feature by calculating the reduction in the impurity for each feature at a node. The feature that results in a significant reduction in the impurity is the important variable, and the one that results in less impurity reduction is the less important variable.
Decision trees handle missing values as easily as any normal value of the variable
As a rule of thumb, if interpretability by layman is what you are looking for in a model, then decision trees should be at the top of your list.
In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
A decision tree is scale-invariant. It does not require normalisation, as it only has to compare the values within an attribute, and it handles multicollinearity better.
They can identify complex relationships and work well in certain cases where you cannot fit a single linear relationship between the target and feature variables. This is where regression with decision trees comes into the picture.
In decision trees, you do not have to treat missing values, outliers and multicollinearity before proceeding with model building

Disjunctive Normal Form

Decision Trees follow Sum of Product (SOP) representation. For the above images, you can see how we can predict can we accept the new job offer? and Use computer daily? from traversing for the root node to the leaf node.

It’s a sum of product representation. The Sum of product(SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is a conjunction(product) of values, different branches ending in that class form a disjunction(sum).

Building a Decision Tree

The first and top node of a decision tree is called the root node. The arrows in a decision tree always point away from this node.The node that cannot be further classified or split is called the leaf/terminal node. The arrows in a decision tree always point towards this node.The intermediate nodes between the root and the leaf nodes are called the internal/decision nodes.

* Each node -> attribute (or feature) * Each branch -> a rule (or decision) * Each leaf -> outcome.

The depth of a Tree is defined by the number of levels, not including the root node.
They use a layered splitting process, where at each layer they try to split the data into two or more groups, so that data that fall into the same group are most similar to each other (homogeneity), and groups are as different as possible from each other (heterogeneity).
The splitting can be binary (which splits each node into at most two sub-groups, and tries to find the optimal partitioning), or multiway (which splits each node into multiple sub-groups, using as many partitions as existing distinct values).

Assumptions while creating Decision Tree :

At the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.

Discretisation is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that span the range of variable values.

Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

So, constructing a decision tree involves the following steps:

Recursive binary splitting/partitioning the data into smaller subsets
Selecting the best rule from a variable/ attribute for the split
Applying the split based on the rules obtained from the attributes
Repeating the process for the subsets obtained
Continuing the process until the stopping criterion is reached
Assigning the majority class/average value as the prediction

The trees follow a top-down greedy approach known as recursive binary splitting:

the decision tree building process is a top-down approach. The top-down approach refers to the process of starting from the top with the whole data and gradually splitting the data into smaller subsets.
The reason we call the process greedy is because it does not take into account what will happen in the next two or three steps. This means that the process is not holistic in nature, as it only aims to gain an immediate result that is derived after splitting the data at a particular node based on a certain rule of the attribute.
Because of greedy approach the tree is called a high variance model.The entire structure of the tree changes with small variations in the input data. This, in turn, changes the way you split and the final decisions altogether.

Measures for selecting best split

During the process of decision making, multiple features participate and it becomes essential to concern the relevance and consequences of each feature thus assigning the appropriate feature at the root node and traversing the splitting of nodes downward. There are some fundamental splitting parameters to address the following questions-

How to decide which feature should be located at the root node,
Most accurate feature to serve as internal nodes or leaf nodes,
How to divide tree,
How to measure the accuracy of splitting tree and many more.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.

The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes. The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:

Entropy

In the Lyman words, it is nothing just the measure of disorder, or measure of purity. Basically, it is the measurement of the impurity or randomness in the data points.

A high order of disorder means a low level of impurity. Entropy is calculated between 0 and 1, although depending upon the number of groups or classes present in the data set it could be larger than 1 but it signifies the same meaning, i.e. higher level of disorder.
Entropy is the lowest (no disorder) at extremes (both end) and maximum (high disorder) in the middle of the graph

Gini Index

Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly.
Like the properties of entropy, the Gini index varies between values 0 and 1, where 0 expresses the purity of classification, i.e. All the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes.
While designing the decision tree, the features possessing the least value of the Gini Index would get preferred

Classification and Regression Tree (CART) algorithm deploys the method of the Gini Index to originate binary splits.
In addition, decision tree algorithms exploit Information Gain to divide a node and Gini Index or Entropy is the passageway to weigh the Information Gain.

Steps to Calculate Gini for a split

1.Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²).
2.Calculate Gini for split using weighted Gini score of each node of that split

for ex-

Split on Gender:

Calculate, Gini for sub-node Female = (0.2)(0.2)+(0.8)(0.8)=0.68

Gini for sub-node Male = (0.65)(0.65)+(0.35)(0.35)=0.55

Calculate weighted Gini for Split Gender = (10/30)0.68+(20/30)0.55 = 0.59

Split on Class:

Gini for sub-node Class IX = (0.43)(0.43)+(0.57)(0.57)=0.51

Gini for sub-node Class X = (0.56)(0.56)+(0.44)(0.44)=0.51

Calculate weighted Gini for Split Class = (14/30)0.51+(16/30)0.51 = 0.51

you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.

Misclassification error

is simply the fraction of training observations in a region that do not belong to the most common class.

Unfortunately, this is not sensitive enough for tree-growing. In practice, we use gini or entropy

Gini is imutationally easier and faster
The Gini index is an intermediate measure between entropy and the classification error.
Entropy and the Gini index are similar numerically.We can saw gini is somewhat half of the entropy , maximum increase is gini is 0.5 whereas in entropy’s maximum value is 1.

The change in impurity or the purity gain is given by the difference of impurity post-split from impurity pre-split, i.e.,

Δ Impurity = Impurity (pre-split) — Impurity (post-split)

The post-split impurity is calculated by finding the weighted average of two child nodes. The split that results in maximum gain is chosen as the best split.

Information Gain

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
Information gain is why impurity is so important. Once we derive the impurity of the dataset, we can see how much information is gained as we go down the tree and measure the impurity of the nodes.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Gain ratio

Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values.
Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account.
C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option.

The gain ratio is implemented in order to punish for features that may take on A LOT of possible values. If a feature takes on a lot of possible values, it becomes plausible that if we split on that feature there may be values that only point to a single class, but simply because there are only 1 or 2 data points with that value for that feature anyways. In other words, the only reason that we would get low entropy for splitting on that feature is because the feature could take on a lot of values, and therefore a lot of those values pointed specifically to a single label. So our decision tree algorithm would end up splitting up on something like “ID #”, and wrongly calculate that we just had a HUGE information gain.

If the attritube lead to a no of splits , splitinfo will take higher value thus making the Gainratio lower. Thus it will penalize the attribute for doing many split( attribute have a no of unique values) .

Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems).
This algorithm uses the standard formula of variance to choose the best split.
The split with lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

Calculate variance for each node.
Calculate variance for each split as weighted average of each node variance

Chi-Square

It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.

It works with categorical target variable “Success” or “Failure”.
It can perform two or more splits.
Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
Chi-Square of each node is calculated using formula,
Chi-square = ((Actual — Expected)² / Expected)¹/2
It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Comparision

Information Gain :Biased towards multi valued attributes.
Gain ratio:Tends to prefer unbalanced splits in which one partition is much smaller than the others

Main DTs algorithms

how do DTs know which features to select and how to split the data?

Each node in the DT acts as a test case for some condition, and each branch descending from that node corresponds to one of the possible answers to that test case. This decision of making splits heavily affects the Tree’s accuracy and performance, and for that decision, DTs can use different algorithms that differ in the possible structure of the Tree (e.g. the number of splits per node), the criteria on how to perform the splits, and when to stop splitting.

Types of decision Trees include:

ID3 (Iterative Dichotomiser 3)
C4.5 (successor of ID3)
CART (Classification And Regression Tree)
CHAID (CHi-squared Automatic Interaction Detector)
MARS → multivariate adaptive regression splines (extends decision trees to handle numerical data better)
Conditional Inference Trees

CART

CART is a DT algorithm that produces binary Classification or Regression Trees, depending on whether the dependent (or target) variable is categorical or numeric, respectively.
It handles data in its raw form (no preprocessing needed), and can use the same variables more than once in different parts of the same DT, which may uncover complex interdependencies between sets of variables.

Classification Trees: are used when dependent variable is categorical.

A data set is completely homogeneous if it contains only a single class label.

CART algorithm uses a metric called Gini Impurity to create decision points for classification tasks.When all observations belong to the same label, there’s a perfect classification and a Gini Impurity value of 0 (minimum value). On the other hand, when all observations are equally distributed among different labels, we face the worst case split result and a Gini Impurity value of 0.5

Building a classification tree:

It begins with the original set S as the root node.
On each iteration of the algorithm, it iterates through the very unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy or Largest Information gain.
The set S is then split by the selected attribute to produce a subset of the data.
The algorithm continues to recur on each subset, considering only attributes never selected before until you find leaf nodes in all the branches of the tree and a stopping criterion is triggered.It can be if all instances in the training set belong to a single value of y,the maximum tree depth has been reached,the number of cases in the terminal node is less than the minimum number of cases for parent nodes,the best splitting criteria is not greater than a certain threshold etc

Regression Trees :

Used when dependent variable is continuous.

A data set is completely homogeneous if its variance is as small as possible

CART algorithm looks for splits that minimize the Least Square Deviation (LSD)/weighted mean square error (WMSE)/MSE, choosing the partitions that minimize the result over all possible options. The LSD (sometimes referred as “variance reduction”)/MSE metric minimizes the sum of the squared distances (or deviations) between the observed values and the predicted values.It chooses the parameter estimates so that the sum of the squared residuals is minimized.

Building a Regression tree:

Calculate the MSE of the target variable.
Split the data set based on different rules obtained from the attributes and calculate the MSE for each of these nodes.
The resulting MSE is subtracted from the MSE before the split. This result is called the MSE reduction.
The attribute with the largest MSE reduction is chosen for the decision node.
The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until you get significantly low MSE and the node becomes as homogeneous as possible.
Finally, when no further splitting is required, assign this as the leaf node and calculate the average as the final prediction when the number of instances is more than one at a leaf node.

CART doesn’t use an internal performance measure for Tree selection. Instead, DTs performances are always measured by evaluating the performance of every Tree through testing (using new data, which the DT has never seen before) or performing cross-validation (dividing the dataset into “k” number of folds, and perform testings on each fold).

CHAID

The Chi-squared Automatic Interaction Detection (CHAID) is one of the oldest DT algorithms methods that produces multiway DTs (splits can have more than two branches) suitable for classification and regression tasks.
CHAID, a technique whose original intent was to detect interaction between variables (i.e., “combination” variables), recursively partitions a population into separate and distinct groups, which are defined by a set of independent (predictor) variables, such that the CHAID Objective is met: the variance of the dependent (target) variable is minimized within the groups, and maximized across the groups.

When building Classification Trees

CHAID relies on the Chi-square independence tests to determine the best split at each step.
Chi-square tests check if there is a relationship between two variables, and are applied at each stage of the DT to ensure that each branch is significantly associated with a statistically significant predictor of the response variable.
Additionally, categories of each predictor are merged if they are not significantly different between each other, with respect to the dependent variable.

In the case of Regression Trees

CHAID relies on F-tests (instead of Chi-square tests) to calculate the difference between two population means.
If the F-test is significant, a new partition (child node) is created (which means that the partition is statistically different from the parent node).
On the other hand, if the result of the F-test between target means is not significant, the categories are merged into a single node.

CHAID does not replace missing values and handles them as a single class which may merge with another class if appropriate.
It also produces DTs that tend to be wider rather than deeper (multiway characteristic), which may be unrealistically short and hard to relate to real business conditions. Additionally, it has no pruning function.

ID3

The Iterative Dichotomiser 3 (ID3) is a DT algorithm that is mainly used to produce Classification Trees.
Since it hasn’t proved to be so effective building Regression Trees in its raw data(although some techniques such as building numerical intervals can improve its performance on Regression Trees).
For the splitting process, ID3 uses the maximum Information Gain (IG) or minimum Entropy (H) metric to select the most useful attributes for classification.
ID3 will always try to maximize this metric, which means that the attribute with the highest Information Gain will split first.
But ID3 has some disadvantages: it can’t handle numeric attributes nor missing values, which can represent serious limitations.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach with entropy more than zero needs further splitting.

C4.5

C4.5 is the successor of ID3 and represents an improvement in several aspects.
C4.5 can handle both continuous and categorical data, making it suitable to generate Regression and Classification Trees.
Additionally, it can deal with missing values by ignoring instances that include non-existing data.
Unlike ID3 (which uses Information Gain as splitting criteria), C4.5 uses Gain Ratio for its splitting process.
Additionally, C4.5 includes a technique called windowing, which was originally developed to overcome the memory limitations of earlier computers.

Windowing means that the algorithm randomly selects a subset of the training data (called a “window”) and builds a DT from that selection. This DT is then used to classify the remaining training data, and if it performs a correct classification, the DT is finished. Otherwise, all the misclassified data points are added to the windows, and the cycle repeats until every instance in the training set is correctly classified by the current DT. This technique generally results in DTs that are more accurate than those produced by the standard process due to the use of randomization, since it captures all the “rare” instances together with sufficient “ordinary” cases.

Another capability of C4.5 is that it can prune DTs.
C4.5’s pruning method is based on estimating the error rate of every internal node, and replacing it with a leaf node if the estimated error of the leaf is lower. In simple terms, if the algorithm estimates that the DT will be more accurate if the “children” of a node are deleted and that node is made a leaf node, then C4.5 will delete those children.

C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.

Controlling Overfitting

Decision trees have a strong tendency to overfit data, which is a serious problem. So, you need to pay attention to the size of the tree. A large tree will have a leaf only to cater to a single data point.

If there is no limit set on a decision tree, it will give you 100% accuracy on the training data set because in the worse case it will end up making 1 leaf for each observation. Thus this affects the accuracy when predicting samples that are not part of the training set.

The following is a summary of the disadvantages of decision trees:

They tend to overfit the data. If allowed to grow with no check on its complexity, a decision tree will keep splitting until it has correctly classified (or rather, mugged up) all the data points in the training set.
They tend to be quite unstable, which is an implication of overfitting. A few changes in the data can considerably change a tree.
Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.( High Variance model)

There are two broad strategies to control overfitting in decision trees: truncation and pruning.

Truncation — Stop the tree while it is still growing so that it may not end up with leaves containing very few data points. Note that truncation is also known as pre-pruning.
Pruning — Let the tree grow to any complexity. Then, cut the branches of the tree in a bottom-up fashion, starting from the leaves. It is more common to use pruning strategies to avoid overfitting in practical implementations.

Ensemble technique example Random Forest can also to used to overcome the overfitting in Decisison Tree

Truncation or Pre-Pruning in Decision Tree

Though there are various ways to truncate, the DecisionTreeClassifier() function in sklearn provides the following hyperparameters which you can control:

criterion (Gini/IG or entropy): It defines the homogeneity metric to measure the quality of a split. Sklearn supports “Gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value of “Gini”.
max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.

If an integer is inputted then it considers that value as max features at each split.
If float value is taken then it shows the percentage of features at each split.
If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
If “log2” is taken then max_features= log2(n_features).
If None, then max_features=n_features. By default, it takes “None” value.

max_depth: The max_depth parameter denotes the maximum depth of the tree.

It can take any integer value or None. If None, then nodes are expanded until all leaves contain just one data point (leading to overfitting) or until all leaves contain less than “min_samples_split” samples. By default, it takes “None” value.

min_samples_split: This tells about the minimum no. of samples required to split an internal node.

If an integer value is taken then consider min_samples_split as the minimum no.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. By default, it takes the value “2”.

min_samples_leaf: The minimum number of samples required to be at a leaf node.

If an integer value is taken then consider min_samples_leaf as the minimum no.
If float, then it shows the percentage. By default, it takes the value “1”.

min_impurity_decrease : A node will be split if this split induces a decrease of the impurity greater than or equal to this value.By default, it takes 0.0

The way to choose the optimal values for the hyperparameters or how to tune the hyperparameters to find their optimal values ->

k-fold cross-validation

Specifically, you can apply the k-fold cross-validation technique, where you can divide the training data into k-folds/groups of samples. If k = 5, you can use k-1 folds to build the model and test it on the kth fold. It is important to remember that k-fold cross-validation is only applied on the train data. The test data is used for the final evaluation. One extra step that we perform in order to execute cross-validation is that we divide the train data itself into train and test (or validation) data and keep changing it across “k” no. of folds so that the model is more generalised.

The green and blue boxes constitute the training data. Here, the green ones are the actual training data and blue ones are the test (or validation) data points selected within the training dataset. As you can see, the training data is divided into 5 blocks or folds, and each time 4 blocks are being used as training data and the remaining one back is being used as the validation data. Once the training process is complete, you jump to model evaluation on the test data depicted by the yellow box.

Gridsearchcv()

Instead of randomly selecting the values of the hyperparameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm. You can play around with different values of different hyperparameters using GridSearchCV(). This function helped you try out different combinations of hyperparameters which ultimately eased your process of figuring out these best values.

Prunning of Decision Tree

Tree pruning is a method of chopping off parts of a tree once it is fully grown.
It is a bottom-up approach used to solve the problem of overfitting.
In pruning, you chop off the tree branches; this results in a decrease in tree complexity and depth. It also helps in reducing overfitting.

There are various methods for the pruning:

Reduced Error Pruning (REP)
Cost-Complexity Pruning (CCP)
Minimum Error Pruning (MEP)
Critical Value Pruning (CVP)

Reduced Error Pruning

It is simplest and most understandable method in decision tree pruning. This method considers each of the decision nodes in the tree to be candidates for pruning, consist of removing the subtree rooted at that node, making it a leaf node.

Before proceeding, please note that the data set is divided into three parts: the training set, the validation set and the test set. The validation set is used for pruning the tree or tune hyperparameters, i.e. after deciding on a set of hyperparameters for a tree, you check the accuracy of the tree on the validation set. And a set of test examples used to provide an unbiased estimate of accuracy over future unseen examples.

If the accuracy of the pruned tree is higher than the accuracy of the original tree (on the validation set), then you keep that branch chopped. Remember that the validation set is the third part of the data set, the first and second being the training and test set.
The advantage of this method is its linear computational complexity. When the test set is much smaller than the training set, this method may lead to over pruning.

Cost-Complexity Pruning

Minimal cost complexity is an algorithm used to prune a tree recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first.
This algorithm is parameterized by α known as the complexity parameter. The complexity parameter is used to define the cost-complexity measure, of a given tree :

For more details -https://medium.com/swlh/post-pruning-decision-trees-using-python-b5d4bcda8e23

Decision tree on outliers -

Assuming that your outliers represent a tiny proportion of your dataset, it is very unlikely that a leaf will be created to isolate them (at least in the first steps of creation of your decision tree), because you will gain very little information on the complimentary subset. However, if you have a large amount of outliers (but then are they really outliers ?), and that they tend to have the same outcome, chances are that you may get a leaf to isolate them.

On the other hand, null values should be treated whether it is through replacement, transformation or deletion from your observations. This would depend on your dataset.

The following is a summary of the disadvantages of decision trees:

They tend to overfit the data. If allowed to grow with no check on its complexity, a decision tree will keep splitting until it has correctly classified (or rather, mugged up) all the data points in the training set.
They tend to be quite unstable, which is an implication of overfitting. A few changes in the data can considerably change a tree.
Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.( High Variance model)

Thats all for now ….

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin