Knowledge Discovery and Opinion Mining: May 2008

Saturday, May 10, 2008

Experiment Databases

A repository with results on data mining experiments

The Experiment Databases tool makes empirical results from AI data mining experiments more accessible and reusable. The site hosts a query engine in SQL format, retrieving results from mining experiments recorded into their repository. These can then be easily studied, and compared toother similar experiments - something that could otherwise take long hours of sieving through results from articles and research reports.

At a high level, the relational data model consists of an experiment, a learner , a data set, a machine where the test is run, and results on a model and evaluation method. There is also scope for storing parameters of each learner, and ensemble experiments.

Queries are qritten using standard SQL language over the documented data model, and useful syntax like desc "table" is also available.

Check out some examples of queries here.

Quick Survey on Text Data Sets

Here is a list of text data sets available on the web, with some comments on their content.

1. UCI's Machine Learning Repository - A huge list of data sets on a variety of topics and formats in a searchable interface. There are 8 textual data sets available, including the popular Reuters 21578 and 20 NewsGroups data sets for text classification.

2. TechTC Data Set is a repository of text documents to be used in classification problems. It is the end result of the framework proposed by (Davidov et al, 2004) and is comprised by crawled documents from the Open Directory project. The data set comes in a convenient text format, stripped of HTML tags, or in a feature vector which can be more readily processed by classifiers. there are 300 labelled sets subdivided into positive and negative categories.

3. Sentiment Analysis. The polarity data set was made available by the Cornell NLP Department, and containts lots of resources for performing sentiment analysis on text data. The data is a set of extracted movie reviews from IMDB and labelled according to the author's opinion.

4. The Splog Data Set at the eBiquity Group - A data set comprised of labelled blog websites for performing detection of spam blogs - or splogs: blogs with machine generated content aimed at raising pagerank status or posting ads.

A full list of data sets I am keeping track of are avaialble on this delicious bookmark:
http://del.icio.us/bohana/dataset