UPDATE: I noticed some images were lost from my original post from 2008 (which I no longer have). While the contents below are still valid, they are a bit old now. For a more up-to-date description of sentiment classification with RapidMiner, a similar experiment is detailed in my RCOMM presentation, and the full paper made available here.
In this post I'll use the polarity data set from Bo Pang / Lilian Lee to perform a text classification experiment on RapidMiner.
RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the WEKA Toolkit.
The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.
RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.
A word on the JRE
RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024
- Add the "-server" flag to the JVM startup line on the startup script being used.
Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.
The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.
To add labelled sets on the TextInput operator, simply select the subdirectories where the labelled data is stored using the texts parameter. Here we add the pos and neg labels, mapping them to the respective directories where the documents were created.
TextInput will also create a special field in the word vector output file that identifies each vector with its original document, this is the id_attribute_type parameter: long or short text description based on document file name, or a unique sequential ID.
Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:
PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.
TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".
StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.
StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.
There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.
It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.
Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.
Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.
Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.
The first step on our RapidMiner experiment is reading the word vector from disk. This is the task of the ExampleSource process.
Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.
The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.
Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.
That's it. We then run the experiment, only chaning the input vector each time and compare the results.
Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set. The results are summarized on the table below, with the best configuration in blue:
In this post I'll use the polarity data set from Bo Pang / Lilian Lee to perform a text classification experiment on RapidMiner.
RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the WEKA Toolkit.
The polarity data set is a set of film reviews from IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from here.
RapidMiner Setup
Get RapidMiner here, and don't forget the plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the lib\plugins subdirectory under the installation location.
A word on the JRE
RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024
- Add the "-server" flag to the JVM startup line on the startup script being used.
Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.
The TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.
To add labelled sets on the TextInput operator, simply select the subdirectories where the labelled data is stored using the texts parameter. Here we add the pos and neg labels, mapping them to the respective directories where the documents were created.
Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:
PorterStemmer - Executes the english Porter stemming algorithm on document set. Stemming is a technique that reduces words to their common root, or stem. No parameters are allowed on this operator.
TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".
StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on this stop word list.
StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.
There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.
It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.
Word Vectors
The TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
- Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
- Term Frequency: Value is based on the normalized number of occurrences of term in document.
- TFIDF: Calculated based on word frequency in document and in the entire corpus.
In the TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.
Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains 2012 features, plus 2 special attributes recodring the label and document name.
Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.
Then, we start learning/running the classifier with cross-validation. We have configured ours with 3 folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.
The XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.
Then, an OperatorChain is used to actually perform the execution of the classification experiment. It links together the ModelApplier process - which applied the trained model to the input vectors, and a PerformanceEvaluator task which calculates standard performance metrics on the classification run.
That's it. We then run the experiment, only chaning the input vector each time and compare the results.
Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set. The results are summarized on the table below, with the best configuration in blue:
Hi,
ReplyDeletefirst of all: thank you for this great introduction into sentiment classification with a freely available software. I just downloaded RapidMiner myself and I think it is great - although I am not sure if I will ever get all of its possibilities to work. Did you have thought about publishing text mining and sentiment mining processes for RapidMiner on a regular base? I think many readers would highly appreciate such tutorials and it would give your blog a nice practical twist. Just an idea.
Best regards,
Peter
Hi Peter,
ReplyDeleteThanks a lot for your comments. Glad the post could be of help. Indeed as I progress on my research I am sure to put up more RapidMiner work in tutorial format. I think it is a great tool that deserves a lot more exposure.
B.
Just want to say thank you, I came here looking for text mining tutorials for rapid miner and what you have written is perfect. More please!
ReplyDeleteHi Bruno,
ReplyDeleteYour tool looks great!! I tried to use opinion mining using your tool. I have installed it and have copied the Text Mining Plug-in in the lib\plugins folder. But I am not able to do Opinion Mining. Would be grateful if you guide me through.
Thanks and regards,
Kasi
Hi,
ReplyDeleteFirst of all let me thank you for such a wonderful help material on Text Mining.Indeed I found this out to be rich in details and a very good starting point for beginners like us.Looking for more from your side.
However one query that I had for you was- why are the pos and neg comments present in different txt files, couldnt they be kept in a single file-i.e.,one txt file containing a column for rating and one for comments.
Thanks again,
Ramkumar
In the case of keywords or bigrams, use "Support Vector Machines". They are really great for high dimensionnality jobs...
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteI want not concur on it. I think nice post. Particularly the title attracted me to read the intact story.
ReplyDeleteNice fill someone in on and this fill someone in on helped me alot in my college assignement. Gratefulness you for your information.
ReplyDeleteThank you for the very informational post. I am interested in learning RapidMiner, but I find the official documentation to be very poor.
ReplyDeleteI am trying to run your example in RapidMiner 5 and it seems that the names of some operators have changed. Which operator is equivalent to ExampleSetWriter in RapidMiner 5?
Thanks,
Kostas
Here's a series of videos on RapidMiner (5):
ReplyDeletehttp://www.youtube.com/user/VancouverData/
Thanks. I am looking for ways to use GA data from Google Analytics into RapidMiner. This will be very helpful. Thanks for sharing.
ReplyDelete