Saturday, May 10, 2008

Quick Survey on Text Data Sets

Here is a list of text data sets available on the web, with some comments on their content.

1. UCI's Machine Learning Repository - A huge list of data sets on a variety of topics and formats in a searchable interface. There are 8 textual data sets available, including the popular Reuters 21578 and 20 NewsGroups data sets for text classification.

2. TechTC Data Set is a repository of text documents to be used in classification problems. It is the end result of the framework proposed by (Davidov et al, 2004) and is comprised by crawled documents from the Open Directory project. The data set comes in a convenient text format, stripped of HTML tags, or in a feature vector which can be more readily processed by classifiers. there are 300 labelled sets subdivided into positive and negative categories.

3. Sentiment Analysis. The polarity data set was made available by the Cornell NLP Department, and containts lots of resources for performing sentiment analysis on text data. The data is a set of extracted movie reviews from IMDB and labelled according to the author's opinion.

4. The Splog Data Set at the eBiquity Group - A data set comprised of labelled blog websites for performing detection of spam blogs - or splogs: blogs with machine generated content aimed at raising pagerank status or posting ads.

A full list of data sets I am keeping track of are avaialble on this delicious bookmark:
http://del.icio.us/bohana/dataset

4 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete