Saturday, April 6, 2013

Online Tools for Sentiment Classification - Part II: The AIRC Sentiment Analyzer

The DIT AI Research Center (AIRC) Sentiment Analyzer is a web interface demonstrating some of our Sentiment Analysis research. It is available on:
http://airc-sentiment.org/
The Analyzer classifies sentiment of text input using a lexicon-based classifier (discussed in an earlier post here), allowing for some tuning of parameters so you can test different behaviors: users can select which part-of-speech tags to use, lexicon and enable/disable negation detection.



The work is based upon our research presented on:
Ohana, Bruno, Brendan Tierney, and S. Delany. "Domain Independent Sentiment Classification with Many Lexicons.Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. IEEE, 2011. (PDF)

Sunday, March 31, 2013

Online Tools for Sentiment Classification - Part I

Seeing Sentiment Analysis in action is a good way of getting a feel for what the tasks are about, and what techniques are in use today. In this post I survey some of the research interfaces available online which you can use on your favorite piece of text. 

Christopher Potts Tutorial - Text Scoring Demo


In this tutorial demo by Stanford Professor Christopher Potts, an engine that analyses text and classifies sentiment according to hits from different sentiment lexicons.

http://sentiment.christopherpotts.net/textscores/

His tutorial page also links to code demos in Python showcasing a number of techniques for Sentiment Analysis. 


TweetFeel

TweetFeel is a polished UI for quickly inspecting sentiment on twitter posts, but is not so much a research project and more a demo of a commercial product (there is not a lot of details of underlying techniques here). My tests revealed much dismay on the twittersphere at Ireland's squad recent results at the WC qualifiers :(

http://www.tweetfeel.com/#ireland


Sentimentor



Sentimentor is a project from Univ. of Brighton's James Spencer. The output of this tool is a visual display of how each term was scored with respect to sentiment orientation. 

The method and code behind this interface is available upon request from the author.



Sentiment140

Sentiment140 is a project from Stanford University students Alec GoRicha Bhayani, and Lei Huang, based on their work on using machine learning algorithms to classify twitter sentiment. (requires a twitter account)

SentiWordNet



SentiwordNet is a sentiment lexicon built from WordNet. It assigns sentiment orientation to words by expanding an initial seed set by inspecting term relationships and sentiment glosses. The UI allows querying for a term, giving back numeric scores for a tuple (positive, negative, objective)


The lexicon is described on this paper (see here for v1.0), and is also available for download.

Other Resources




Sunday, November 11, 2012

At the ICCBR 2012 Conference

Back in September I took part in the ICCBR 2012 conference in Lyon. I had a chance to present our research in Case-Based methods for sentiment classification as a main conference paper, and also take part in the doctoral consortium. All in all, a great experience.

Our research paper focus on the following question (which is also a key component of my PhD research): How can case-based methods help in cross-domain sentiment classification? Broadly speaking, the key motivators for exploring these methods are (a) the fact that there are countless strategies for applying a lexicon to extract sentiment orientation from a document - not to mention the choice of sentiment lexicons available is also big -  coupled with (b) no single method is likely to win in all domains an all types of documents one might encounter. Thus, it may be advantageous to somehow reuse the knowledge of which methods worked on past predictions when attempting to classify documents on unseen domains. Enter Case-Based Reasoning!

This idea, along with a sentiment classification experiment across domains is discussed in our paper: A Case-Based Approach to Cross Domain Sentiment Classification (full paper here). I have also posted the PPT of my presentation at the conference.





Sunday, September 25, 2011

Sentiment Classification at RCOMM 2011

Earlier this year I gave a presentation on Sentiment Classification at the 2011 RapidMiner User Conference in Dublin. I have posted the slides on Slideshare.

There is an extended experiment based on what has been discussed in this blog, but now running on RapidMiner 5. The original word vector model is extended using features derived from a sentiment lexicon using RM's data manipulation processes.

And here's the full article for the above presentation with a more detailed discussion on the results obtained.

Wednesday, February 2, 2011

Sentiment Classification and Opinion Lexicons


Lexicons are a big part of my current research in opinion mining. Aside from the potential of helping supervised learning methods, they can be applied to unsupervised techniques - an appealing idea for research whose goal is domain independence. An opinion lexicon is a database that associates terms with opinion information - normally in the form of a numeric score indicating a term's positive or negative bias.

My dissertation was an investigation on how lexicons perform on sentiment classification of film reviews - this work was later expanded and incorporated into a chapter on the book "Knowledge Discovery Practices and Applications in Data Mining - Trends and New Domains".
A shorter version of this research was presented in Dublin's IT&T 2009 and available here.

The lexicon used here was SentiWordNet. Built from WordNet, SentiWordNet leverages WordNet's semantic relationships like synonyms and antonyms, and term glosses to expand a set of seeded words into a much larger lexicon. It can be tried online here. (also see Esuli and Sebastiani's SentiWordNet paper).

Using SeniWordNet for sentiment classification involves scanning a document for relevant terms and matching available information from the lexicon according to part of speech. There are some interesting NLP challenges involved here: we run the text via a part of speech tagger first to obtain details on whether terms are adjective, verb, etc. Then negation detection is performed to identify parts of text affected by a negating statement (ex: "not good" as opposed to "good"). Then, the document is scored based on terms found and whether it is negated. The overall approach is given below.


Resources

Tuesday, September 8, 2009

Parameter Testing - Letting RapidMiner Do The Hard Work

In a previous post we have discussed an example of how to perform text classification in RapidMiner, and we used a data set of film reviews against several word vector schemes to classify documents according to their overall positive or negative sentiment. In this tutorial we show how to look for better results by using RapidMiner's parameter testing feature and evaluate the effects of feature selection to the original classification scheme.

Parameter Testing
There are many factors that come into play in determining the performance of a classification task: for example, tuning parameters on the classification algorithm, the use of outlier detection, feature selection and feature generation can all affect the end result. In general, it is hard to know a priori which combination of parameters will be the optimal one for a given data set or class of problem, and testing several possibilities of parameter values is the only way to better understand their influence and find a better fit.

The number of combined possibilities on how to tune a classification task however grows fast and testing them manually can become tedious very quickly. This is where parameterization can help. On RapidMiner, under Meta -> Parameter operators, we'll find several parameter optimization schemes:
  • Parameter iterator
  • Grid Parameter Optimization
And also algorithms that implement more sophisticated parameter searching schemes:
  • QuadraticParameterOptimization
  • EvolutionaryParameterOptimization
Feature Selection
We would like to test the effect of feature selection to our previous sentiment classification experiment. Recall that our word vector for the sentiment classifier generated some 2012 features based on unigram terms found in the source documents, after removal of stop words and stemming. Now, we can apply a scheme for filtering out uncorrelated features before we train the classifier algorithm.

Step 1 - Attribute Weighting and Selection
RapidMiner comes with a wealth of methods for performing feature selection. We extend the sentiment classification example by using a weighting scheme to attributes on the feature vector. Then, the top K highest weighted attributes (which we hope are the top most correlated to the labels) are chosen for training a classifier algorithm.

We introduce a pre-processing step to the training algorithm by introducing 2 operators:
  • InfoGainRatioWeighting - Calculates numeric weights for each attribute based on information gain in relation to the positive/negative labels.
  • AttributeWeightSelection - Filters attributes based on their associated numeric weights. This operator will take as input the example set containing data from our feature vector, and the result of applying the previous InfoGain weights to the example set. There are several criteria to choose from, and we will use the "top k" most relevant attributes.


Fixing Random Seed
By default, RapidMiner will use a dynamic seed whenever randomization is needed, for instance, when sampling the data set for cross-validation. To make sure our experimental results are repeatable on every run, we can fix our random seed by assigning it a specific value. This should be done on the "root" and "cross validation" operators.

Feature Selection by Feature Weights
Right now the project is ready to run. Lets see how it fares by leaving say, only the top 100 features according to the weighting scheme and applying those features to train the same classifier algorithm as before. Right click on the InfoGainRatioWeighting operator and add a "BreakPoint After" stop. When running the experiment, we can see the state of the execution process right after this step has run. At that stage, the attribute weights have been created. We can have a look at which ones were given the highest weights, giving an indication of the most correlated features to the positive or negative sentiment label:


In this weighting scheme we notice some familiar terms we would expect to correlate with a good or bad film review. Terms such as "lame", "poorly" and "terrific" score highly. Also, we notice some more unexpected predictors, such as the term "portray", which appears to be relevant to classification on the domain of films.


Once the experiment is complete, we see that in this data set, reducing the total number of features from 2012 to just 100 yielded an average accuracy of 81%. This is worst than having the experiment run with all the features (84.05% in our previous experiment), suggesting pruning the data set to only 100 features might be too severe and could be leaving out many terms that are good predictors. The question then is: is removing potentially uncorrelated features of any benefit to sentiment classification in this experiment?

Step 2 - Parameterization
We'll apply the GridParameterOptimization operator to test from a list of potential parameter combinations, based on accuracy criteria. The operator is added to the project just after the data set read step (ExampleSource), and the remainder of the operators are included as part of its subtree.



From this point on, the parameters affecting the behavior of the operators can be added to the parameter search scheme. Determining which combination works best is based upon results obtained from the "Main Criterion" in the Performance Evaluator operator. In our case, the criteria is accuracy. The operator is configured by selecting attributes we wish to test, and what values each attribute will take. In our example the experiment compares the results for selecting the k topmost relevant features according to seven different values of k:


Results
In our experiment, the average classification accuracy improved to 85.80% when using k = 800 topmost weighted features. This is better than our original baseline of 84.05%, and has the added benefit of using less features, therefore reducing the footprint necessary to train and run the algorithm. Not bad for a day's work, considering the tool did most of the work :-).

Further improvements could naturally be obtained by testing k at more granular increments, or including other factors such as support vector machine parameters. It is important however to bear in mind that adding more testing instances will result in a much larger search space, thus increasing the time needed to tune the experiment. For instance, searching for 50 values of k on the feature selection approach, combined with 10 possible values for tuning parameters on the classifier would result in 50 x 10 = 500 iterations. The numbers can add up quickly.

Other Approaches
The GridParameterOptimization operator is quite straightforward: just iterate over a list of parameter combinations and retrieve the one tha optimizes a particular error function, in our case accuracy. Finding the best combination of parameters relates to a more general problem of search and optimization, and lots of more sophisticated strategies have been proposed in the literature, some of which are also present in RapidMiner such as the QuadraticParameterOptimization operator, and the EvolutionaryParameterOptimization which implements a genetic algorithm for searching parameter combinations.

Further Reading
For those interested in reading on some of the subjects briefly touched upon in this tutorial, there is a good discussion on the topic of parameter search in the context of data mining on the book Principles of Data Mining by Hand, Manilla and Smyth.

Feature selection applied to text mining has been investigated by a number of authors, and a good overview of the topic can be found in the work of Sebastiani, 2002 (retrievable here).

Finally, an approach that uses feature selection techniques to the problem of sentiment classification can be seen in the work of Abbasi et al, 2008.

Friday, July 11, 2008

Digital Memories - A Google TechTalk



In this video, Steve Whittaker from Sheffield University talks about recent research in the area of Digital Memories - storing information about personal events spanning an entire lifetime in digital format, with obvious opportunities for mining all kinds of interesting patterns. It can also open up a particular branch of mining applications specifically geared at providing personalized, easy to interpret results to end users.

This video has a very interesting slide on research results showing how we tend to apply counter productive strategies for dealing with information overload (an idea I had already heard about on Merlin Mann's techtalk on Inbox Zero) - the more folders for categorizing stuff you have, the more folders with little use (2 or less items) are created. Likewise, the time spent in figuring out bookmark categories is not time well spent since over 40% of them are never used. It appears a better strategy is to just archive off this information and retrieve it later by searching.

Monday, June 23, 2008

Opinion Mining with RapidMiner - A Quick Experiment

UPDATE: I noticed some images were lost from my original post from 2008 (which I no longer have). While the contents below are still valid, they are a bit old now. For a more up-to-date description of sentiment classification with RapidMiner, a similar experiment is detailed in my RCOMM presentation, and the full paper made available here.


In this post I'll use the polarity data set from Bo Pang / Lilian Lee to perform a text classification experiment on RapidMiner.

RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the WEKA Toolkit.

The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.

RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.

A word on the JRE

RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024

- Add the "-server" flag to the JVM startup line on the startup script being used.

Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.

The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.

To add labelled sets on the TextInput operator, simply select the subdirectories where the labelled data is stored using the texts parameter. Here we add the pos and neg labels, mapping them to the respective directories where the documents were created.
TextInput will also create a special field in the word vector output file that identifies each vector with its original document, this is the id_attribute_type parameter: long or short text description based on document file name, or a unique sequential ID.

Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:

PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.

TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".

StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.

StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.

There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.

It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.

Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
  • Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
  • Term Frequency: Value is based on the normalized number of occurrences of term in document.
  • TFIDF: Calculated based on word frequency in document and in the entire corpus.

In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.

Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.


Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.

The first step on our RapidMiner experiment is reading the word vector from disk. This is the task of the ExampleSource process.

Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.

The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.

Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.

That's it. We then run the experiment, only chaning the input vector each time and compare the results.

Results

The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set. The results are summarized on the table below, with the best configuration in blue:

Data Set Average Accuracy Avg. AUC
Binary 84.05% 0.920
Term Frequency 82.7% 0.907
TFIDF 82.4% 0.914

And here is the ROC curve for the best result:

Saturday, May 10, 2008

Experiment Databases

A repository with results on data mining experiments


The Experiment Databases tool makes empirical results from AI data mining experiments more accessible and reusable. The site hosts a query engine in SQL format, retrieving results from mining experiments recorded into their repository. These can then be easily studied, and compared toother similar experiments - something that could otherwise take long hours of sieving through results from articles and research reports.

At a high level, the relational data model consists of an experiment, a learner , a data set, a machine where the test is run, and results on a model and evaluation method. There is also scope for storing parameters of each learner, and ensemble experiments.

Queries are qritten using standard SQL language over the documented data model, and useful syntax like desc "table" is also available.

Check out some examples of queries here.

Quick Survey on Text Data Sets

Here is a list of text data sets available on the web, with some comments on their content.

1. UCI's Machine Learning Repository - A huge list of data sets on a variety of topics and formats in a searchable interface. There are 8 textual data sets available, including the popular Reuters 21578 and 20 NewsGroups data sets for text classification.

2. TechTC Data Set is a repository of text documents to be used in classification problems. It is the end result of the framework proposed by (Davidov et al, 2004) and is comprised by crawled documents from the Open Directory project. The data set comes in a convenient text format, stripped of HTML tags, or in a feature vector which can be more readily processed by classifiers. there are 300 labelled sets subdivided into positive and negative categories.

3. Sentiment Analysis. The polarity data set was made available by the Cornell NLP Department, and containts lots of resources for performing sentiment analysis on text data. The data is a set of extracted movie reviews from IMDB and labelled according to the author's opinion.

4. The Splog Data Set at the eBiquity Group - A data set comprised of labelled blog websites for performing detection of spam blogs - or splogs: blogs with machine generated content aimed at raising pagerank status or posting ads.

A full list of data sets I am keeping track of are avaialble on this delicious bookmark:
http://del.icio.us/bohana/dataset