Handling Class Imbalance – An Ensemble – Majority Voting on Minority Samples

read 5 mts

Class Imbalance is a major problem in machine learning models where the total number of one class data is far less than the total number of another class. This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc.

CESifo Press Service – Article on European Trade Imbalances

Existing Approaches

This can be classified into three categories

  1. Generating Synthetics Samples of minority class ( Existing package such as SMOTE in R).
  2. Oversampling, by repeated sampling of the minority class, so it has more effect on the machine learning algorithm ( ROSE in R).
  3. Under-sampling, by removing some of the majority class so it has less effect on the machine learning algorithm. ( ROSE in R).
  4. Hybrid, a mix of oversampling and undersampling. 

The above methods are based on the number of samples which we use to train any model.

There are a few algorithms that impose an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class. Often the handling of class penalties or weights is specialized to the learning algorithm. (Ex: SVM)

Few algorithms will handle the data at the sampling level as part of model building (DecisionTreeClassifier, RandomForestClasssifier, etc.. )

Each method has its own advantages and disadvantages. In this blog, a novel approach is proposed to handle the class imbalances. This methodology is been derived from the practical analysis of various class imbalance model implementations and their results. This approach is more focused on the ensembling of undersampled data points while building models that ensure more weight to the minority samples and covers most of the majority samples.

Proposed Approach

The proposed ensembling method is a Majority voting on Minority Samples through which we can get better results (Sensitivity/Specificity) on Minority Samples. 

Architecture

Training:

  1. Separate the samples based on the class: 
    a) Dataset 1- Where all Samples belong to one class Assuming it is Majority Class.
    b) Dataset 2- Where all Samples belong to Minority Class.
  2. Take subsets from Majority Class without repetition. 
  3. Merge both the Majority sample dataset and Minority dataset for each subset drawn in step 2.
  4. Repeat step3 for all Datasets and build multiple models for each.

    Note: Minority Samples are repeated in all Samples whereas the Majority is not.

Model predictions:

  1. Apply the models for each get the predictions
  2. Separate the majority sample, minority sample predictions and take majority voting on minority samples from multiple predictions generated.
  3. Now we have predictions for Majority samples and Minority Samples
C:\Users\Rigved\Downloads\Architecture1 (4).jpg

Testing:

  1. Test data should be passed through all the models that were built in the training phase.  The majority voting classifier i.e., mode of all the predictions should be considered as the final test prediction.
C:\Users\Rigved\Downloads\Architecture2-Page-1.jpg

Implementation:

This architecture was tested on a variety of datasets where there is a class imbalance.

Following are some of the considerations/assumptions while deploying the above architecture

  1. Only Binay classification
  2. Class Imbalance is considered if less than 15% of minority samples belong to one class.
  3. Sampling Majority samples into different subsets with    (70-75)-(25-30) ratio.
  4. Benchmark metrics are taken from the base models with default parameters.
  5. Train and Test data sets are common across all the Experiments conducted for a particular dataset.
  6. Datasets are used are from Banking and Insurance domain.  

The class ratios of  the datasets((in percentages)  as follows:

Insurance data(dimensions:9678*87)

BankRuptcy(dimensions: 34402*30)

CreditCard Fraud(dimensions:(284807*31))

Few Models have applied with the default parameters on the above-said datasets;  

Note that in few models class_weight parameter was also used.

Results:

Below are the results when I applied basic models  for each dataset

Dataset-1

Note that we got the Highest recall is 12.87  for Decision Tree and RandomForest class_weight=”balance_subsample”. After  applying the proposed method, the results as follows:

Note: Earlier It was 12.87  Now 46.21 

 (You can notice in the code; We have not used any model with class_weight). 

If we continue the experiment by including these types of models, where we can expect some more improvement.

Dataset-2

After  applying the proposed method, the results as follows:

Note: Recall got increased from 31.42 to 46.34

Dataset-3

Here 82.40  is the highest recall.

After applying the proposed method, the results as follows:

Please note that when we run the models multiple times there will be a slight change in the results as sample changes every time.

Additionally, the Result of the Undersampling of Minority sample is given below.

But you can see a problem with this approach; 
When we are doing undersampling we are going to miss the other distributions of the data. This can be overcome by the proposed architecture.

Insurance Data: { 0th row is Accuracy on Train Data, 1st row Recall on Train Data, 2nd row is Accuracy on Test Data, 3rdrow is Recall on Test Data)

As an EndNote,  Still, there is a lot of scope to improve result if we continue to  do experimentation by changing various parameters

Code can be found at the following link: https://github.com/kspkoyyada/HandlingClassImbalance/tree/master/HandlingClassImbalance

0
Comments
  1. Dheeraj
    Reply
  2. Reply

Leave a Reply to Dheeraj Cancel reply

Your email address will not be published. Required fields are marked *