[Home] Xu-Ying Liu's Homepage

Learning from the Imbalanced Data Sets


 

Description

Data in real-world tasks are usually imbalanced, i.e. some classes have much more instances than others. The level of imbalance (ratio of size of major class to that of minor class) can be as huge as 10^6. Learning algorithms that do not consider class-imbalance tend to be overwhelmed by the major class and ignore the minor one.

Previous Workshops

Based on our understanding of class imbalance problem, the following topics of discussion are proposed (but not limited to):

  • sampling (under-, over-, progressive, active)

  • post-processing of learned models

  • accounting for class imbalance via inductive bias

  • one-sided learning

  • handling uncertainty of target distribution and misclassification costs

  • handling varying amounts (class dependent) of label noise

Several observations were made and certain issues were explored in particular depth. First, it was observed that a large number of applications suffer from the class imbalance problem. A distinction, nonetheless, was drawn between the small sample versus the imbalance problem and it was remarked that although smart sampling can, sometimes, help, it is not always possible. Among the issues that received a lot of attention was the problem of evaluating learning algorithms in the case of class imbalances. It was emphasized that the use of common evaluation measures can yield misleading conclusions. More accurate measures include ROC Curves and Cost Curves. An evaluation measure was also proposed for the case where only data from one class is available. The other issues concerned the design of learning algorithms. It was shown that concept-learning methods can use a one-sided approach focusing on either the majority or the minority class. If both classes are used, however, avoiding fragmentation in the minority class is useful. Another important issue concerned the close connection between the class imbalance problem and cost-sensitive learning. Finally, the goal of creating a classifier that performs well across a range of costs/priors was declared to be an important one.

Special Issue

Paper List

Researchers in This Field


Last Modified: 2007-04-05 by Xu-Ying Liu

 

Machine Learning Topics

Cost-Sensitive Learning

Imbalance Problem

Rare Event Detection

ROC Analysis