This Notebook focuses on NLP techniques combined with Keras-built Neural Networks. The idea is to complete end-to-end project and to understand best approaches to text processing with Neural Networks by myself on practice. The tutorial provides vivid understanding of how to prepare the data for a Neural Network with Keras and how to actually implement and run it.

Project description: predict if the review of the film is positive or negative. The dataset is a set of imdb reviews labeled as positive/negative.

It is inspired by a DeepLearning with NLP CrashCourse by Dr. Jason Brownlee.

Import libraries

import pandas as pd
import glob
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from numpy import asarray
from numpy import zeros
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import gc

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
from keras.models import Sequential
from keras.layers import Convolution1D, Conv1D, Flatten, Dropout, Dense, MaxPooling1D
from keras.callbacks import TensorBoard

Get Timestamps

Define a function to display timespent

import datetime

def display_time_spent():
    end_time = datetime.datetime.now()
    time_spent = (end_time - start_time)
    
    hours, remainder = divmod(time_spent.seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    duration_formatted = '%d:%02d:%02d' % (hours, minutes, seconds)
    print('Wall time: {}'.format(duration_formatted))

Put this at the start of the block execution

start_time = datetime.datetime.now()

Put this at the end of the block execution

display_time_spent()
Wall time: 0:00:00 ​    

Read in Data

The dataset is collection of 1000 positive and 1000 negative imdb reviews. Can be downloaded here.

# Define a function to read the file and return as a string
def read_file(file):
    f=open(file)
    return f.read()
    f.close()

# Read in positive reviews
positive_files = glob.glob('nlp_keras_embedding/data/txt_sentoken/pos/cv*.txt')
positive_reviews_list = [ read_file(file) for file in positive_files ]
labels = [1]*len(positive_reviews_list)
reviews_positive_df = pd.DataFrame(data={'review': positive_reviews_list, 'label': labels})

# Read in negative reviews
negative_files = glob.glob('nlp_keras_embedding/data/txt_sentoken/neg/cv*.txt')
negative_reviews_list = [ read_file(file) for file in negative_files ]
labels = [0]*len(negative_reviews_list)
reviews_negative_df = pd.DataFrame(data={'review': negative_reviews_list, 'label': labels})

# Concatenate the dataframes into one
reviews_df = pd.concat([reviews_positive_df,reviews_negative_df], ignore_index=True)

reviews_df.head()

Split the dataset

labels = reviews_df['label']
dataset = reviews_df['review']
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(dataset, labels, test_size=0.1, random_state=4)

Vectorize and Preprocess text

Make Bag Of Word matrix Representation and fit vectorizer on combined dataset. I’ve experimented with different Vectorizers, turns out binary shows best results with NNs.

Notice words preprocessing options passed to CountVectorizer:

  • token pattern would be minimum of 3 chars (any letter, a digit or “-“,”_” signs)
  • english stopwords are filtered out
  • ngrams would be generated from single to 3 tokens each
  • also minimum word occurence should be 3 or more, so min_df=3 Also CountVectorizer automatically lowercases the text.
# fit vectorizer
vectorizer = CountVectorizer(binary=True, min_df=3, ngram_range=(1,3), token_pattern='(?u)\\b[a-z0-9\-\_]{3,}\\b', stop_words='english')
#vectorizer = CountVectorizer(binary=True, min_df=3, ngram_range=(1,3), token_pattern='(?u)\\b[a-z0-9\-\_][a-z0-9\-\_]+\\b')
#vectorizer = TfidfVectorizer(min_df=3, ngram_range=(1,3))
# tokenize and build vocab
vectorizer.fit(X_train)
# summarize
#print(vectorizer.vocabulary_)
CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b[a-z0-9\\-\\_]{3,}\\b',
        tokenizer=None, vocabulary=None)

Transform train and test datasets to vector sparse matrices, and then to the arrays. (We need it in array forms in order to pass to the Keras NN layer)

X_train_vec = vectorizer.transform(X_train)
X_train_arr = X_train_vec.toarray()
# summarize encoded vector
print(X_train_arr.shape)
(1800, 34402) Test dataset:    
X_test_vec = vectorizer.transform(X_test)
X_test_arr = X_test_vec.toarray()
# summarize encoded vector
print(X_test_arr.shape)
(200, 34402)

So we have quite large arrays to feed into our Neural Network. Train dataset is 1800 rows (samples) of 34k features! Let’s see further on how the Net learns from this saprse representation.

Get length of a wordspace. This length would be used as argument input to our NN model.

n_words = X_train_arr.shape[1]

Define a Model

Define Neural Network Architecture and compile.

A cursory exploration ended up with 2 layer architecture of 100 neurons each, with 0.1 dropout regularization showing best results.

# define NN model
model = Sequential()
model.add(Dense(100, input_shape=(n_words,), activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
# compile NN network
tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Train and evaluate

Fit the model and evaluate.

start_time = datetime.datetime.now()

# fit network
model.fit(X_train_arr, y_train, epochs=10, callbacks=[tensorBoardCallback], verbose=2)

display_time_spent()
Epoch 1/10
 - 3s - loss: 0.4847 - acc: 0.7694
Epoch 2/10
 - 3s - loss: 0.0281 - acc: 0.9956
Epoch 3/10
 - 3s - loss: 0.0023 - acc: 1.0000
Epoch 4/10
 - 3s - loss: 8.0274e-04 - acc: 1.0000
Epoch 5/10
 - 3s - loss: 4.5116e-04 - acc: 1.0000
Epoch 6/10
 - 3s - loss: 2.7223e-04 - acc: 1.0000
Epoch 7/10
 - 3s - loss: 1.6565e-04 - acc: 1.0000
Epoch 8/10
 - 3s - loss: 1.2852e-04 - acc: 1.0000
Epoch 9/10
 - 3s - loss: 8.2725e-05 - acc: 1.0000
Epoch 10/10
 - 3s - loss: 5.5967e-05 - acc: 1.0000


Wall time: 0:01:26

Evaluate:

start_time = datetime.datetime.now()

# evaluate
loss, acc = model.evaluate(X_test_arr, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))

display_time_spent()
Test Accuracy: 91.000000
Wall time: 0:00:00

Wrap up

Ok, Nice! A simple Neural Network with just 2 layers, 100 neurons each gives very good results on predicting from relatively large bag of words!

So far, this tutorial explains in full:

  • text preprocessing
  • test preprocessing and dataset preparation
  • NN model implementation with Keras
  • prediction

Let me know of your ideas, additions and approaches to this problem. I’d be happy to hear from you and answer the questions.