| |
|
Data for Multi-Instance Learning Based Web Index Recommendation |
1. Summary
This package contains two parts:
- The "original" part contains 113 web index pages and their links.
Since every web index page has lots of links, this part is quite big, about 126Mb (30.9Mb after compression).
- The "processed" part contains 9 data sets for multi-instance learning.
This part is not big, about 5.67Mb (1.36Mb after compression).
The web index pages are mainly from:
The data set has been used in: Z.-H. Zhou, K. Jiang, and M. Li. Multi-instance learning based web mining. Applied Intelligence, in press.
ATTN: You can feel free to use the package (for academic purpose only) at your own risk. But before publishing your results, please send a copy to:
Prof. Z.-H. Zhou
National Laboratory for Novel Software Technology,
Nanjing University, Mailbox 419,
Hankou Road 22,
Nanjing 210093, China
E-mail:
zhouzh@nju.edu.cn or zhouzh@lamda.nju.edu.cn
URL: http://lamda.nju.edu.cn/zhouzh/
Download: [datafile] (30.2Mb)
กก
2. Details
The 113 web index pages are labeled by 9 volunteers according to their interests. Therefore there are 9 data sets. If the volunteer is interested in at least one linked page of the index, then the web index page is labeled as positive. Otherwise the index page is labeled as negative. There is no label for the linked pages.
For each of the 9 data sets, 75
web index pages are randomly selected as training examples while the remaining
38 pages are used as test examples.
The training and test sets are named as v1.train, v1.test, ...
The class distributions of the
data sets are:
---------------------------------------------------------------------
data positive negative positive negative
set in
train set in train set in
test set in test set
---------------------------------------------------------------------
v1 17 58 4 34
v2 18 57 3 35
v3 14 61 7 31
v4 56 19 33 5
v5 62 13 27 11
v6 60 15 29 9
v7 39 36 16 22
v8 35 40 20 18
v9 37 38 18 20
---------------------------------------------------------------------
In the "original" part of the package, there is a page shows the IDs
of the web index pages. In the "processed" part of the package, in
each .train or.test file, the 1st line shows the IDs of the web index pages that
are included in the file.
In the .train and .test files, the web index pages
are represented inmulti-instance form. For example:
e01 {i11,i12,...,i1n}, ..., {im1,im2,...,imn},1.
where "e01" means that this is the 1st example (or bag) of the file, "{i11,i12,...,i1n}" is
the 1st instance of e01, "i1j" is the value of the 1st instance of
e01 on the jth attribute, the final '1' means this is a positive example.
The examples contain different number of instances. The biggest example is the 18th, which comprises 200 instances. The smallest example is the 90th, which comprises only 4 instances. In average, each example contains 30.29 instances (3423/113).
Each instance is described by 20 attributes that are the 1st to 20th most frequent terms appearing in the corresponding linked page. Note that it is not necessary to use all these attributes. The frequencies of the terms are included in the brackets following the terms.
In counting the occurrence of the frequent terms, 77 trivial terms are neglected (stoplist):
{', a, about,
also, am, an, and, are, as, at, b, be, been, but, by,
can, com, could, didn't, do, doesn't, don't, during,
for, from, had,
has, have, he, her, here, him, his, i, if, in, is, it,
just, m, me,
might, no, not, of, on, or, our, out, over, she, so,
still, td, that,
the, their, them, there, they, this, to, too, us, was,
we, were, what,
where, when, who, whose, will, with, would, you, your}
Moreover, in order to get rid of links to advertisements or other index pages, it is constrained that for a linked page to be considered as an instance in an example, its corresponding link in the index page must contain at least four terms.