Saturday, May 10, 2008

Quick Survey on Text Data Sets

Here is a list of text data sets available on the web, with some comments on their content.

1. UCI's Machine Learning Repository - A huge list of data sets on a variety of topics and formats in a searchable interface. There are 8 textual data sets available, including the popular Reuters 21578 and 20 NewsGroups data sets for text classification.

2. TechTC Data Set is a repository of text documents to be used in classification problems. It is the end result of the framework proposed by (Davidov et al, 2004) and is comprised by crawled documents from the Open Directory project. The data set comes in a convenient text format, stripped of HTML tags, or in a feature vector which can be more readily processed by classifiers. there are 300 labelled sets subdivided into positive and negative categories.

3. Sentiment Analysis. The polarity data set was made available by the Cornell NLP Department, and containts lots of resources for performing sentiment analysis on text data. The data is a set of extracted movie reviews from IMDB and labelled according to the author's opinion.

4. The Splog Data Set at the eBiquity Group - A data set comprised of labelled blog websites for performing detection of spam blogs - or splogs: blogs with machine generated content aimed at raising pagerank status or posting ads.

A full list of data sets I am keeping track of are avaialble on this delicious bookmark:
http://del.icio.us/bohana/dataset

Sunday, November 25, 2007

Visual Complexity

Stunning array of projects in knowledge visualization, many of them exploring techniques for visualizing textual data and its relations to other data sources. A follow on from many mining exercises, it opens up far wider choices for humans to relate to the original raw content.

Sunday, October 14, 2007

Text Mining Video Lectures

Taken from Ljubliana's Summer School on Semantic Web '05. More lectures available on http://videolectures.net

A lecture on applications of document summarization for generating semantic information in the form of graphs representing a given relationship on the text set.

Learning Semantic Sub-graphs for Document Summarization, Marko Grobelnik


Information Extraction Lecture

Information extraction,
Ronen Feldman

And here's one that I made: a mind map of the above lecture: