Source construct

The source construct contains all the information about the data set used to train the AI model. This information is segmented into four blocks, as shown in this example:

source:
    bind = "/DM-Dash/sentiment/data.csv" ;
    input:
        x ~ from "Review"
            -> text: [100] -> TextDataGenerator: [nb_words = 20000] ;
    output:
        y ~ from "Label" -> flat: [1] -> FlatDataGenerator: [] ;
    params:
        batch_size = 64,
        shuffle_init = True;

Blocks

There are four necessary blocks in the source construct.

  • bind This specifies the full path to the .csv file that contains the training data. The first line of the .csv file should contain column labels.
bind = "/DM-Dash/path/to/my/data.csv" ;
  • input This specifies what the input data is and assigns a variable name that will be used in the architecture construct to define a neuralflow. The syntax for this declaration is:
input: variable ~ from "column_label" -> data_type: [description] -> DataGenerator: [options] ;

Valid data_type and DataGenerator values and options are described in detail below.

An example of text input:

input:
    x ~ from "Review"
        -> text: [100]
        -> TextDataGenerator: [nb_words = 20000] ;

NOTE: NML is agnostic to white space. The same declaration can appear on one line:

input: x ~ from "Review" -> text: [100] -> TextDataGenerator: [nb_words = 20000] ;
  • output This is identical to the input declaration, but corresponds to the output data. The syntax is identical.

An example of vector output:

output:
    y ~ from "Label"
        -> flat: [2]
        -> FlatDataGenerator: [] ;
  • params This is a comma separated list of parameters. Available parameters, with their default values:
    • batch_size = 64 - The mini-batch size.
    • validation_split = 0.1 - Percentage of the data to withhold from training and use for validation.
    • shuffle = True - Shuffle the data after each training epoch.
    • shuffle_init = False - Shuffle the data before the first training epoch.
    • number_validation = None - If this number is set, it overrides the validation_split parameter.

Example:

params: shuffle = False, batch_size = 32 ;

Data

AI Studio provides a simple interface for interacting with the three most common data types: flat vectors, text, images, audio, and video. AI Studio uses DataGenerators to stream data from disk rather than trying to hold everything in memory. These DataGenerators have a limited ability to do some pre-processing of different data types.

Vectors

With enough pre-processing, everything can be represented as a single vector of numbers. Consider this input block:

input: x ~ from "Data" -> flat: [10] -> FlatDataGenerator: [] ;

The column "Data" contains 10 numbers separated by "|"

The FlatDataGenerator can also automatically fill classification vectors from integer values.

Consider the following output block:

output: y ~ from "Class" -> flat: [10] -> FlatDataGenerator: [] ;

The column "Class" contains an integer between 0 and 9

In this situation, the FlatDataGenerator automatically translates the classification integer into a vector of zeroes with a one in the correct column.

NOTE: The FlatDataGenerator takes no pre-processing arguments at this time.

Text

Consider this input block:

input:
    x ~ from "Review"
        -> text: [100]
        -> TextDataGenerator: [nb_words = 20000] ;

The column "Review" of the csv file should contain the text.

Text data types need to declare the number of words in each piece data used for training. This declaration tells the compiler to expect a vector of 100 words.

The TextDataGenerator takes care of pre-processing the raw text into that vector, by filtering, tokenizing, padding, or truncating. It has one mandatory argument, nb_words, that sets the size of the dictionary. Additional arguments can be added in a comma separated list:

  • Arguments, with their default values:
    • filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'- A single quoted (') list of characters to filter.
    • lower = True - Ignore capitalization.
    • split = " " - Word separator.
    • padding = 'pre' or 'post' - Where to pad if there aren't enough words in the data.
    • truncating = 'pre' or 'post' - Where to truncate if there are too many words in the data.
    • value = 0.0 - The value used for padding.
    • char_level = False - Encode either letters or words.

Images

Consider this input block:

input:
    x ~ from "Image"
        -> image: [shape=[32,32],channels=3]
        -> ImageDataGenerator[rescale=0.00392156862];

The column "Image" should contain the path to an image file.

Image data types need to declare their shape = [height, width] and number of channels (1 or 3).

The ImageDataGenerator takes care of loading the image and preprocessing it to downsample it to the shape specified. The ImageDataGenerator makes available a number of pre-processing options for real-time data augmentation.

  • Arguments, with their default values.
    • featurewise_center = False - Set input mean to 0 over the dataset, feature-wise.
    • samplewise_center = False - Set each sample mean to 0.
    • featurewise_std_normalization = False - Divide inputs by std of the dataset, feature-wise.
    • samplewise_std_normalization = False - Divide each input by its standard deviation
    • zca_whitening = False - Apply ZCA whitening.
    • rotation_range = 0 [integer] - Degree range for random rotations.
    • width_shift_range = 0.0 [float] (fraction of total width) - Range for random horizontal shifts.
    • height_shift_range = 0.0 [float] (fraction of total height) - Range for random vertical shifts.
    • shear_range = 0.0 [float] Shear Intensity (Shear angle in counter-clockwise direction as radians)
    • zoom_range = 0.0 [float] or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range].
    • channel_shift_range = 0.0 [float] - Range for random channel shifts.
    • fill_mode = 'nearest' - One of {"constant", "nearest", "reflect" or "wrap"}. Points outside the boundaries of the input are filled according to the given mode.
    • cval = 0.0 [float or integer]. Value used for points outside the boundaries when fill_mode = "constant".
    • horizontal_flip = False - Randomly flip inputs horizontally.
    • vertical_flip = False - Randomly flip inputs vertically.
    • rescale = None - Multiply the data by the value provided (before applying any other transformation).

Audio

As of version 1.2.0, NeoPulse™ AI Studio can take audio files as input. Consider the following input block:

input:
    x ~ from "Audio"
        -> audio: [maxlen = 1366, nbands = 96]
        -> AudioDataGenerator: [];

The column "Audio" should contain the path to an audio file (.mp3, .ogg, .wav, .au, .aiff, .flac, .au).

Audio data types need to specify two parameters: maxlen and nbands corresponding to the number of timesteps and the number of Fourier bands to include in the spectrogram, respectively.

The AudioDataGenerator takes care of preprocessing and extracting features from audio files.

  • Arguments, with their default values.
    • feature = 'mfcc' or 'melspectrogram' or 'spectrogram'. The audio feature map to calculate.
    • frame_length = None The length of the window over which to perform the Fast Fourier Transform (in seconds) used to calculate the number of timesteps in the FFT. The default value None sets the number of timesteps in the FFT to 2048.
    • frame_stride = None The time in seconds used to calculate the number of timesteps to stride when calculating each FFT. The default value None sets the number of timesteps to stride to 512.

Video

As of version 1.2.0, NeoPulse™ AI Studio can take video files as input. Any file format supported by the libavcodec library can be used by AI Studio.

Consider the following input block:

input:
    x ~  from "Video"
      -> video: [shape=[80, 80], channels=3, seqlength=32]
      -> ImageDataGenerator: [rescale = 0.003921568627451] ;

The column "Video" in the .csv file should contain the path to a video file.

Every video will be split into frames. While specifying an input of type "video", one needs to declare their shape = [height, width], number of color channels (1 or 3), and seqlength, the number of frames that the video should be split into.

NOTE: Be careful with your choice of seqlength, as every video in the training set will be decomposed into that many frames. This can take up a large amount of space on the hard disk.

After decomposing the video into frames, the ImageDataGenerator is used to pre-process and downsample the images as described above.