In this post I'll use the polarity data set from Bo Pang / Lilian Lee to perform a text classification experiment on RapidMiner.
RapidMiner (formerly Yale) is a open source data mining and knowledge discovery tool written in Java, incorporating most well known mining algorithms for classification, clustering and regression; it also contains plugins for specialized tasks such as text mining and analysis of streamed data. RapidMiner is a GUI based tool, but mining tasks can also be scripted for batch mode processing. In addition to its numerous choice of operators, RapidMiner also includes the data mining library from the
WEKA Toolkit.
The polarity data set is a set of film reviews from
IMDB, which were labelled based on author feedback: positive or negative. There are 1000 labelled documents for each class, and the data is presented in plain text format. This data set has been employed to analyse the performance of opinion mining techniques. This data set can be downloaded from
here.
RapidMiner Setup
Get RapidMiner
here, and don't forget the
plugin for text mining . The Text mining plugin contains tasks specially designed to assist on the preparation of text documents for mining tasks, such as tokenization, stop word removal and stemming. RapidMiner plugins are Java libraries that need to be added to the
lib\plugins subdirectory under the installation location.
A word on the JRE
RapidMiner will ship a pre-configured script for loading its command line and GUI versions in the JVM. It is worth spending a few moments checking the JRE startup parameters, as larger data sets are likely to hit a memory allocation ceiling. Also, configuring the JRE for server-side execution (Java Hotspot) is likely to help as well. On the script used for starting up RapidMiner (e.g. RapidminerGUI or RapidMinerGUI.bat under
scripts subdirectory):
- Configure the MAX_JAVA_MEMORY variable to the ammount of memory allocated to the JVM. The example below sets it to 1Gb:
MAX_JAVA_MEMORY=1024
- Add the "-server" flag to the JVM startup line on the startup script being used.
Step 1: From Text to Word Vector
Here we'll create a word vector data set based on a set of documents. The word vector set can then be reused and applied to different classifiers.
The
TextInput operator receives a set of tokenized documents, generates a word vector and passes it on to the
ExampleSetWriter operator for outputting to a file. This example was based on one of the samples from the RapidMiner Text Plugin.
To add labelled sets on the
TextInput operator, simply select the subdirectories where the labelled data is stored using the
texts parameter. Here we add the
pos and
neg labels, mapping them to the respective directories where the documents were created.
TextInput will also create a special field in the word vector output file that identifies each vector with its original document, this is the
id_attribute_type parameter:
long or
short text description based on document file name, or a unique sequential
ID.
Operator Choices
We would like to experiment with different types of word vectors, and assess their impact on the classification task. The nested operators under TextInput and their setup are briefly described here. We follow the execution sequence of the operators:
PorterStemmer - Executes the english Porter stemming algorithm on document set.
Stemming is a technique that reduces words to their common root, or
stem. No parameters are allowed on this operator.
TokenLenghtFilter - Removes tokens based on string lenght. We use a minimum string lenght of 2 characters. This is our preference as a higher length filter could remove important sentiment information such as "ok", or "no".
StopWordFilterFile - Removes stop words based on a list given in a file. RapidMiner also implements an EnglishStopWord operator, however we would like to preserve some potentially useful sentiment information such as "ok" and "not", and thus used a scaled down version based on
this stop word list.
StringTokenizer - Final step before building the word vector, receives modified text documents from previous steps and builds a series of term tokens.
There is clearly an argument for getting rid of stemming and word filtering altogether and performing the experiment using each potential word as a feature. The final word vector however would be far larger and the process more time consuming (one test run without stemming and word length greater than 3 generated over 25K features). On the basis thet we'd like to perform a quick experiment to demonstrate the features of the plugin, for now we'll keep the filtering in.
It is also worth mention the n-gram tokenizer operator, not used in this test, which generates a list of n-grams based on words occuring in the text. A 2-gram - or bigram - tokenizer generates all possible two-word sequence pairs found on the text. n-grams have the potential of retaining more information regarding opinion polarity - ex. the words "not nice" become the "not_nice " bigram, which can then be treated as a feature by the classifier. This however comes at the expense of classifier overfitting, since it would require a far larger volume of examples to train on all possible relevant n-grams, for larger values of
n, not to mention the hit in execution time due to a much larger feature space. We thus leave it out for this experiment.
Word Vectors
The
TextInput operator is capable of generating several types of word vectors. We create 3 different examples for our test:
- Binary Occurrence: Term Receives 1 if present in document, 0 otherwise.
- Term Frequency: Value is based on the normalized number of occurrences of term in document.
- TFIDF: Calculated based on word frequency in document and in the entire corpus.
In the
TextInput operator we also perform some term prunning, by removing the least and most frequent terms in the document set. We set our thresholds at terms appearing in at least 50 documents, and at most 1970 documents, out of a corpus of 2000 documents.
Running the Task
Executing the task will generate 2 output files determined by the ExampleSetWriter operator:
Word vector set (.dat) Attribute description file (.aml)
The final word vector contains
2012 features, plus 2 special attributes recodring the label and document name.
Step 2: Training and Cross-Validation
We will employ the Support Vector Machines learner to train a model based on samples from the word vector set we just created. We will use 3-fold cross validation method to compare the results obtained.
In our experiment, we apply a Linear SVM with the exact same configuration using the 3 types of word vectors obtained from the previous step: Binary, Term Frequency and TFIDF. All the hard work will be done by the XValidation process, which encapsulates the process os selecting folds from the data set and iterating through the classification execution steps.
The first step on our RapidMiner experiment is reading the word vector from disk. This is the task of the
ExampleSource process.
Then, we start learning/running the classifier with cross-validation. We have configured ours with 3
folds, meaning the process will run 3 times, using 1/3 of the data set as training, and applying the model to the remaining vectors. It will perform the same operation 3 times, each time using a different fold as training set.
The
XValidation process takes in a series of sub-processes used in its iterations. First, the learner algorithm to be used. As mentioned earlier, we are using a Linear C-SVC SVM, and at this stage not a lot of tweaking has been done on its parameters.
Then, an
OperatorChain is used to actually perform the execution of the classification experiment. It links together the
ModelApplier process - which applied the trained model to the input vectors, and a
PerformanceEvaluator task which calculates standard performance metrics on the classification run.
That's it. We then run the experiment, only chaning the input vector each time and compare the results.
Results
The classification process took around 30 minutes on my home PC (Windows XP / Intel Celeron) for each run with a different data set. The results are summarized on the table below, with the best configuration in
blue:
| Data Set | Average Accuracy | Avg. AUC |
| Binary | 84.05% | 0.920 |
| Term Frequency | 82.7% | 0.907 |
| TFIDF | 82.4% | 0.914 |
And here is the ROC curve for the best result: