Tutorial: Sentiment Analysis
This tutorial assumes that you have a working installation of NeoPulse® AI Studio. If you do not have AI Studio, please go to our Developer Portal and get one!
Sentiment Analysis with the IMDB dataset.
One of the standard classification tasks is the IMDB sentiment analysis task. The data set for this task is described in detail here. This step by step tutorial will allow you to use NeoPulse® AI-Studio to build an accurate AI model to predict the sentiment of movie reviews.
All of the files described in this tutorial are available in the following directory on your AI Studio instance:
1. Prepare the data
The data preparation for this tutorial has already been done. The test and training data have been combined in the file
/DM-Dash/examples/sentiment/data.csv consisting of 12,500 negative and 12,500 positive movie reviews. The two column headers are "Label" and "Review", and the labels are either 0 or 1, corresponding to negative or negative reviews, respectively.
There is an additional file,
/DM-Dash/examples/sentiment/query_data.csv that contains a single column labeled "Review", that contains five unlabeled reviews for the query example.
2. Write an NML script.
Now let's write an NML script to train a model using this data. We'll begin with the simplest possible NML script, and then move on to some more advanced ones:
oracle("mode") = "classification" source: bind = "/DM-Dash/examples/sentiment/data.csv" ; input: x ~ from "Review" -> text:  -> TextDataGenerator: [nb_words = 20000] ; output: y ~ from "Label" -> flat:  -> FlatDataGenerator:  ; params: validation_split = 0.5 ; architecture: input: x ~ text:  ; output: y ~ flat:  ; x -> auto -> y ; train: compile: optimizer = auto, loss = auto, metrics = ['accuracy'] ; run: epochs = 4 ; dashboard: ;
This is a file that a non-expert might write. Let's examine each section in detail.
AI oracle hints
At the top of the script, we begin by giving the AI oracle an indirect hint about the problem. Since this is a classification problem we use:
oracle("mode") = "classification"
In the source construct, we tell the compiler where to find the data, how to access it from disk, and how to split it into training and validation sets.
Let's look at the source construct:
There are four blocks, each terminated by a semi-colon. This is where we tell the compiler where to find the data and how we want it processed.
In the bind block, we tell the compiler where to find the data:
bind = "/DM-Dash/sentiment/combined_data.csv" ;
In the input block, we tell the compiler what the input to the model is, and how to access it from disk. Here, we define a variable x that comes from the "Review" column of the data file. Using a neuralflow indicator
->we connect that variable to its data type, and define the shape of the data. In this case, we pad or truncate the text to two hundred words. Then using a neuralflow indicator, we connect that data type to a TextDataGenerator to stream and pre-process the data into the correct shape.
input: x ~ from "Review" -> text:  -> TextDataGenerator: [nb_words = 20000] ;
In the output block, we tell the compiler what we intend to predict. In this case, we're going to take a classifier from the column "Label", and get two numbers: the probability that the review is negative and positive, respectively.
output: y ~ from "Label" -> flat:  -> FlatDataGenerator:  ;
In the parameter block, we set one parameter to tell the compiler that half the data is for training and half is for validation.
params: validation_split = 0.5 ;
In the architecture construct, we define the input, output, and neuralflow blocks of the model we want to build.
The input block defines the data type and shape that will be used as input by the model:
input: x ~ text:  ;
The output block defines the data type and shape that the model will learn to produce:
output: y ~ flat:  ;
The architecture block defines the model. In this case, we've used the auto keyword to allow the oracle to choose an appropriate architecture for this task.
x -> auto -> y ;
The [train construct](train.md) we tell the compiler how to train the model.
In the compile block, we use the auto keyword to tell the oracle to choose an optimizer, loss function, and measure the accuracy of the model.
`compile: optimizer = auto, loss = auto, metrics = ['accuracy'] ;
In the run block, we tell the compiler to look at the training data four times.
run: epochs = 4 ;
Finally, we conclude with the dashboard block:
3. Submit the model for training
Now that we've written our NML script, we're ready to train the model. The CLI makes it easy to compile and start training the model:
$ neopulse train -p sentiment_auto -f /DM-Dash/examples/sentiment/sentiment_full_auto.nml
You can then check the status of the model:
$ neopulse list sentiment_auto
And visualize the training process:
$ neopulse visualize sentiment_auto
This script as written generates four models.
4. Query the model
After the training has finished, let's check that we can query the model:
$ neopulse query sentiment_auto -f /DM-dash/examples/sentiment/query_data.csv
Check the query status:
$ neopulse results
See the results:
$ neopulse results -q <query_ID> -S
5. Trim and export the model:
After we've tested that we can query the model, we'll trim the project to remove unnecessary models. We want to choose the model with the best accuracy on the validation data, so we use the
neopulse trim sentiment_auto --metric=val_acc
Now we can export the model for use with NeoPulse® Query Runtime.
neopulse export sentiment_auto --metric=val_acc
The exported model package can be found at
/DM-Dash/exports/model_id.zip. This package can be moved to any machine where NeoPulse® Query Runtime is running, and loaded using the CLI:
neopulse import -p sentiment -f model_id.zip
5. Other scripts
Three other scripts have been included in the examples directory which give examples of how to give direct hints to the AI oracle using the auto keyword.