|
Learning from the
Imbalanced Data Sets
Description
Data in real-world tasks are usually imbalanced, i.e.
some classes have much more instances than others. The level of imbalance
(ratio of size of major class to that of minor class) can be as huge as
10^6. Learning algorithms that do not consider class-imbalance tend to be
overwhelmed by the major class and ignore the minor one.
Previous Workshops
Based on our understanding of class imbalance problem,
the following topics of discussion are proposed (but not limited to):
-
sampling (under-, over-, progressive, active)
-
post-processing of learned models
-
accounting for class imbalance via inductive bias
-
one-sided learning
-
handling uncertainty of target distribution and
misclassification costs
-
handling varying amounts (class dependent) of label
noise
Several observations were made and certain issues were
explored in particular depth. First, it was observed that a large number of
applications suffer from the class imbalance problem. A distinction,
nonetheless, was drawn between the small sample versus the imbalance problem
and it was remarked that although smart sampling can, sometimes, help, it is
not always possible. Among the issues that received a lot of attention was
the problem of evaluating learning algorithms in the case of class
imbalances. It was emphasized that the use of common evaluation measures can
yield misleading conclusions. More accurate measures include ROC Curves and
Cost Curves. An evaluation measure was also proposed for the case where only
data from one class is available. The other issues concerned the design of
learning algorithms. It was shown that concept-learning methods can use a
one-sided approach focusing on either the majority or the minority class. If
both classes are used, however, avoiding fragmentation in the minority class
is useful. Another important issue concerned the close connection between
the class imbalance problem and cost-sensitive learning. Finally, the goal
of creating a classifier that performs well across a range of costs/priors
was declared to be an important one.
Special Issue
Paper List
Researchers in This
Field
Last Modified:
2007-04-05 by Xu-Ying Liu
|
Machine Learning Topics
Cost-Sensitive Learning
Imbalance Problem Rare Event Detection
ROC Analysis |