Image Home
Image People
Image Publication
Image Applications
Image Data & Code
Image Library
Image Seminar
Image Link
Image Album

Search LAMDA

Text Data for Multi-Instance Learning

1. Summary

This package contains the text data for multi-instance learning, which has been used in:

 ATTN:   You can feel free to use the package (for academic purpose only) at your own risk. For other purposes, please contact Prof. Zhi-Hua Zhou (

Download:   [datafile] (1.49Mb)


2. Details

The twenty text categorization data sets were derived from the 20 Newsgroups corpus popularly used in text categorization. Fifty positive and fifty negative bags were generated for each of the 20 news categories. Each positive bag contains 3% posts randomly drawn from the target category and the other instances (and all instances in negative bags) randomly and uniformly drawn from other categories. Each instance is a post represented by the top 200 TFIDF features.

Following are specific characteristics of the resultant text data:

Number of examples: 2,000
Number of classes: 20
Number of features: 200
Instances per bag:
    min: 8
    max: 84
    mean±std.: 40.07±15.27

After loading the data into MATLAB environment, the i-th text data (in bag representation) is stored in bags{i,1} with its associated labels in bags{i,2} and the labels of instances in bags{i,3}.

In the package we also provide detailed results of the miGraph approach (see our ICML'09 paper) on the data.

ATTN2:       This package was developed by Ms. Yu-Yin Sun ( For any problem concerning the package, please feel free to contact Ms. Sun.


  Name Size

(for FireFox 3+ and IE 7+)
Contact LAMDA: (email) (tel) +86-025-89681608 © LAMDA, 2016