Image Home
Image People
Image Publication
Image Applications
Image Data & Code
Image Library
Image Seminar
Image Link
Image Album



Search LAMDA
»

Text Data for Multi-Instance Multi-Label Learning



1. Summary

This package contains the text data (Reuters-21578) for multi-instance multi-label learning, which has been used in:

 

ATTN:   You can feel free to use the package (for academic purpose only) at your own risk. An acknowledge or citation to the above paper is required. For other purposes, please contact Prof. Zhi-Hua Zhou (zhouzh@nju.edu.cn).



    Download: datafile (141Kb)

2. Details



The text data is derived from the widely studied Reuters-21578 collection [1]. The seven most frequent categories are considered. After removing documents whose label sets or main texts are empty, 8,866 documents are retained where only 3.37% of them are associated with more than one class labels. After randomly removing documents with only one label, a text categorization data set containing 2,000 documents is obtained. Around 15% documents with multiple labels comprise the resultant data set and the average number of labels per document is 1.15 ± 0.37. Each document is represented as a bag of instances using the sliding window techniques [2], where each instance corresponds to a text segment enclosed in one sliding window of size 50 (overlapped with 25 words). ``Function words" on the \textsc{Smart} stop-list [3] are removed from the vocabulary and the remaining words are stemmed. Instances in the bags adopt the "Bag-of-Words" representation based on term frequency [1]. Without loss of effectiveness, dimensionality reduction is performed by retaining the top 2% words with highest document frequency [4]. Thereafter, each instance is represented as a 243-dimensional feature vector.

Following are specific characteristics of the resultant text data:

Number of examples: 2,000
Number of classes: 7
Number of features: 243
Instances per bag:
    min: 2
    max: 26
    mean ± std. deviation: 3.56 ± 2.71
Labels per example (k):
   
k=1: 1,701
   
k=2: 290
   
k=3: 9
   

After reading the data into MATLAB environment, the i-th text data (in bag representation) is stored in bags{i,1} with its associated labels in target(:,i). For illustration purpose, suppose target(:,i)' equals [1 -1 -1 1 -1 -1 1], it means that the i-th text document belongs to the 1st, 4th and 7th classes but do not belong to the rest classes.



[1] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47, 2002.

[2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 561-568. MIT Press, Cambridge, MA, 2003.

[3] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Pennsylvania, 1989.

[4] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412-420, Nashville, TN, 1997.


   



  Name Size
- miml-text-data.rar 141.84 KB

Image
PoweredBy
(for FireFox 3+ and IE 7+)
Contact LAMDA: (email) contact@lamda.nju.edu.cn (tel) +86-25-89685926. © LAMDA, 2016