In this lesson, Andrew Trask, the author of Grokking Deep Learning,
will walk you through using neural networks for sentiment analysis.
In particular, you'll build a network that classifies movie reviews
as positive or negative just based on their text!
Framing The Problem
Lesson: Curate a Dataset
let's start by curating a dataset.
Neural networks by themselves can't really do anything.
All a neural network really does is search for direct or indirect correlation between two datasets.
So in order for neural network to train anything,we have to present it with two meaningful datasets.
The first dataset must represent what we know.
And the second dataset must represent what we want to know,
what we want the neural net to be able to tell us.
As the network trains, it's going to search for correlation between these two data sets,
so that eventually it can take one and learn to predict the other.
def pretty_print_review_and_label(i):
print(labels[i] + "\t:\t" + reviews[i][:80] + "...") g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close() g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()
Right here we're going to kind of load into a list a set of IMDB movie reviews.
g = open('reviews.txt','r') # What we know!
So these are movie reviews that people uploaded to the site IMDB
g = open('labels.txt','r') # What we WANT to know!
These labels come with those reviews as people have labeled them with one to five stars
In this case we've bucketed them into just positive reviews being,
higher than three stars and negative reviews being lower.
So, we have 25000 reviews.
In [2]: len(reviews) Out[2]: 25000 In [3]: reviews[0] Out[3]: 'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life such as teachers . my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers . the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn t ' In [4]: labels[0] Out[4]: 'POSITIVE'
Here's an example of one of those reviews which is an positive review,
the label comes with a positive label. So, this is our dataset. Actually it's two datasets.
So, we have this data set which is what we know, and what we will know in the future.
So, in this case we have two example data sets.We're goint to try to train a neural network
to take this as input and be able to accurately predict this(negative or positive)
So that when we see more human generated text in the future, in theory,
our neural net will be able to classify.
Lesson: Develop a Predictive Theory
The first thing we want to do when we encounter a dataset like is develop a predictive theory.
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998) labels.txt : reviews.txt NEGATIVE : this movie is terrible but it has some good effects . ...
POSITIVE : adrian pasdar is excellent is this film . he makes a fascinating woman . ...
NEGATIVE : comment this movie is impossible . is terrible very improbable bad interpretat...
POSITIVE : excellent episode movie ala pulp fiction . days suicides . it doesnt get more...
NEGATIVE : if you haven t seen this it s terrible . it is pure trash . i saw this about ...
POSITIVE : this schiffer guy is a real genius the movie is of excellent quality and both e...
Now, a predictive theory is really about saying, okay, if I was the neural net,
and I was going to try to figure out how to look for correlation in the data set, where would I look?
Best thing that I like to do when developing predictive theory is just take a look at the dataset
Try to figure out if I can solve this problem myself.
And then sort of look inward and say.
Okay, what am I using maybe under the hood to king of understand
whether this had a positive or negative sentiment.
labels.txt : reviews.txt NEGATIVE : this movie is terrible but it has some good effects . ...
POSITIVE : adrian pasdar is excellent is this film . he makes a fascinating woman . ...
NEGATIVE : comment this movie is impossible . is terrible very improbable bad interpretat...
POSITIVE : excellent episode movie ala pulp fiction . days suicides . it doesnt get more...
NEGATIVE : if you haven t seen this it s terrible . it is pure trash . i saw this about ...
POSITIVE : this schiffer guy is a real genius the movie is of excellent quality and both e...
So already I'm starting to kind of get a feel.
It seem to me pretty obvious these are really polarized examples.
But what I'm going to be looking for is, okay, what in this is creating a correlation between
my reviews data set and my labels data set.
This is a list of characters,right?So when I actually load it in,
it says native format and it's just a list of,I guess in this case 26 plus different characters.
Is there correlation in it's current state?
Well I don't really think that letter M or letter T has mush predicted power.
Right,so we have M in negative examples and we have M in positive examples.It doesn't
really help us, so I don't think that would be a good source.So the native state it's in right
now is probably not very good.Now let's consider kind of the opposite
spectrum where we take the entire review as sort of what this dataset is.
Well it is very predictive.I mean,this review, every time we saw it , it was negative example.
Unfortunately, we only saw it once,And I think I can likely expect that most reviews we
see in the future are going to be relatively original.
We're going to see some people say this movie was terrible,or this movie was great,
or really straightforward, things like that.But most reviews have nuance.
They have a particular choice of words and sequence that's not just not really goint to be
duplicated very often.So training a neural net on the entire review might not work that well in
the real world because we just don't see it very often.
So, great correlation but kind of poor generalization.
What about kind of in between characters in the full review?
So, I noticed that in NEGATIVE examples, we see words like terrible, and improbable,
and terrible, and trash, and individual words that might have some correlation with
these POSITIVE and NEGATIVE labels,in contrast to excellent, or fascinating, or excellent quality.
So maybe it's just actually the counts of the different kinds of words that are occurring in this,
in these reviews.I think that's kind of a better theory.Certainly better than characters and
certainly better than the reviews as a whole.
But before we just kind of run off creating a neural net,I find that it's best to sort of do a
quick validate,right? So, this is something that we think is true with the theory that we have.
But before we actually go and do everything, we should see if what we have is naively predictable,
right?
Now what I typically do here is i just, i count them.Or i formulate a count based heuristic
to try to see, okay, does this phenomenon seem to happen more for this label than it does
for this label?Right, is it a good [INAUDIBLE]
So the first project that i would like for you to tackle and then i will show you how i
tackle it, is to just think about how you would take this data set and validate our theory
that words are predictive of labels.
So go ahead and take a few minutes and take a crack at it and see if you can
kid of just come up with a way of showing either is or is not predictive
Mini Project 1 Solution
Project 1: Quick Theory Validation
Presumably you kind of took a stab at validating our theory that words are predictive of labels.
So now I'm going to show you how I would attack this problem, So yeah, let's tackle this problem.
from collections import Counter
import numpy as np
So I've got a couple of go-to tools I always like to use from collections import counter.
So we're going to be counting words.And I find that the counter object is just so fast and so
much easier than using dictionaries.And I'll show you how to use it.
So the first thing that we're going to do is we're goint to just count sort of words that
show up positively and words that show up negatively in various views.
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
Create an empty counter and it acts a little bit like a dictionary.
And we'll just do total_counts so it's cached.
So they act like a dictionary ,but you don't have to actually create the original keys.
You can just start incrementing them as if you had every key that you put in as
you're in a position to.You'll see what i mean here in the blow code.
for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1
Take a second to run because we have 25,000 reviews.
Next thing we're going to do is just sort of take a look at it.
positive_counts.most_common()
So the counter gives you this nice little convenient function. I can say positive counts,
most common.whenever you count any words at all,the most frequent ones you get here.
This doesn't really tell me if these are indicative of things that are positive.
These are just telling me whether they're frequent words or not.
So what we need to do is something that's called normalization.
So we're not really interested in what's the most frequent positive word.
We're interested in the word that is most frequently positive versus negative, right?
negative_counts.most_common()
because if i look at negative counts, it's the same words, right?
So we want to kind of come up with some sort of ratio that is more comparative
between theset two lists as opposed to just these two lists by themselves.
So to speed things up a little bit, I'm going to show you hou I would calculate this ratio,
pos_neg_ratios = Counter() for term,cnt in list(total_counts.most_common()):
if(cnt > 100):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio for word,ratio in pos_neg_ratios.most_common():
if(ratio > 1):
pos_neg_ratios[word] = np.log(ratio)
else:
pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))
which i also put into a counter.And if we look at positive and negative ratios, so
words with a positive ratio looks kind of like this.
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()
So starting to see a little bit of signal.These are mostly names, so what I'm going to guess
is that these are movie reviews,right? So people have some favorite actors ,
and they like to talk about them positively. So i guess it's probably good
if you last name is Caruso or Gino or something like that,right?
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]
I'm going to guess there are not very well favored actors.
But I'm also guessing that my theory about work realtion's right.
So maybe this isn't true, but actors names happen.
If this actor's name was only mentioned once, or I guess at least ten times,
then if it was just in one positive review, it might show up here.
When we're looking for correlation, we kind of want things that happen very frequently
and have an affinity somewhere.
Somebody that's just mentioned once,100% of them will be positive,
but it's not really indicative of being a positive feature.
So let's up this to 50 and check it out.
if(cnt > 50):
See a bunch of names, ooh,excellently. So we the name delightfully, okay.
Well, let's up this a little bit more. As you can see, I'm investigating the data.
I'm taking a look and looking for patterns and refining how I'm looking and just
trying to get a feel what the day is like.Wow, so now I'm really seeing stuff.
I see a few names, I see flawless, superbly, perfection, astaire, captures, wonderful
Okay, so now I'm really seeing words that I would expect to be positive words,
being positively indicative of these labels. Let's see if I look for negative, how's that look?
So at this point, I'm feeling pretty good about the theory.
It's clear that the words that I would expect to be predictive seem to be predictive,
or at least correlative with the kind of the labels that I think they should be correlated with.
So in the kind of the next section,
we're going to be talking about how we can leverage this predictive theory
to create an input and output data
so our network can sort of refine this correlative power into a classifier
Transforming Text Into Numbers
from IPython.display import Image review = "This was a horrible, terrible movie." Image(filename='sentiment_network.png')
now that we have validated that our theory that individual words inside of a review our
predictive of that reviews positive/negative label now it's time to transform our datasets
into numbers in a way that respects this ,this theory in disbelief so that our neural network can
search for correlation in this particular way so we want to be able to de is we want to present
the words you know as input to the netural network in such a way that it can look for
correlation to make the corrent positive or negative prediction on the output so
the most natural way to start here is simply County toward and input those counts
as inputs to the neural network,pretty simple. well as well defined and i think that
should have correlations the thing i want to predict now as far as predicting
positive/negative obviously we know girl that's can't predict the word positive.
Well some more advanced ones can but that's all we're trying to do here instead we're
going to represent positiveness negative miss as a number of positive is the number one and
negative is the 0 now the reason that we're doing this in one neuron and kind
of giving it you know two sides that the network has to decide between is that we
know that positive and negative are mutually exclusive but we're not going to train our network
to ever say that everbody was both positive and negative and by modeling it this way we can make
these two different labels mutually exclusive
this reduces the number of ways that nerual network can make a mistake which
reduces the amount that has to learn and actually helps it learn this particular powder
you know in some ways that some other incentives for example have five different output labels
for different regulators so in the IMDb dataset for example can have five stars
you can put 1 star 2 3 stars five stars and it turns out that sometimes they can actually
hurt on that make it more difficult to predict if it actually has to predict which star was most likely
because it allows it to sort of make double positive predictions we always three and four were the
four is incorrect,three is correct but they share a lot of signals to create ambiguity in the network
but in this case because we only have two labels we can force the network to have to choose
between the two of them is reducing the number of ways that can make a mistake one of the themes
throughout this this come to tutorial is going to be making the prediction as easy as possible
and in framing the problem in such a way with as easy as possible for the neural net to make this prediction
what do we need and what was project to going to be about?
project2 is about setting up two functions that take our input and output data and transform them
into the appropriate 1 0 binary representation or I guess on the output
and then the count's on the input.So in this case, the first function I want you to build takes a review,
extracts the words from the review. And then counts them and it puts those counts into a vector.
Now that vector has to be constant length,it needs to be the length of the vocabulary.
And then I want you to create another function that just maps positive,negative to a 1 or 0.
So go ahead and create those functions and then I'll show you how I create them
and we can compare notes.
Transforming Text to Numbers
-- Project2: Creating the Input/Output Data
from IPython.display import Image review = "This was a horrible, terrible movie." Image(filename='sentiment_network.png')
review = "The movie was excellent" Image(filename='sentiment_network_pos.png')
Project 2: Creating the Input/Output Data
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)
import numpy as np layer_0 = np.zeros((1,vocab_size))
layer_0
from IPython.display import Image
Image(filename='sentiment_network.png')
Right, so in this project we're going to create our input and output data.
So, for input data, we're going to count all the words that happen in a review,
and then we're going to put them into a fixed length vector.
Where each place in the vector is for one of our words of our vocabulary.
So the first thing we da is we count our vocabulary.
Looks like we hve just over 74,000 words.
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)
74074
Now, we're going to create our empty vector.
Now it's generally a good practice to pre-allocate this vector,
and just something that's empty and then edit it as you go.
Because one of those expensive things you can do in computer science is allocate new memory.
So we don't want to have to create this new vector from scratch every time that we use it.
So we're going to create an empty one and then we're going to create function
that modifies this vector with the proper counts.
So, first thing we need to do is decide which place in this vector goes to each word,
and create a variable that allows us to research that.
word2index = {} for i,word in enumerate(vocab):
word2index[word] = i
word2index
Now, it doesn't really matter which place that we put it in, it's like,
horrible can be down here, or it could up there.
But as long as whatever we choose we kind of stick with, right?
So I'm going to create just a dictionary that allows us to look up every words that's in our vocabulary
according to the place that it has in that vocabulary.
def update_input_layer(review): global layer_0 # clear out previous state, reset the layer to be all 0s
layer_0 *= 0
for word in review.split(" "):
layer_0[0][word2index[word]] += 1 update_input_layer(reviews[0])
And then we're going to create our function. So layer here the global variable.
We're going to clear out the old ones.Then we're going to iterate through each word in our review.
And we're going to allocate the position in that vector where we're incrementing,
so that there's a count for each one.
layer_0
array([[ 18., 0., 0., ..., 0., 0., 0.]])
Actually one of the words, presumably the empty one when I tokenized it, happened 18 times.
def get_target_for_label(label):
if(label == 'POSITIVE'):
return 1
else:
return 0
So get_target_for_label seems to work, so label(0) was positive, and label(1) I think was negative.
labels[0]
'POSITIVE'
This is our input and output dataset and I hope that yours created kind of variables that look a look a lot like this.
The nice what we're doing here is, and I guess the thing to take away,
is mostly this efficiency piece,right?
So when you're creating these vectors, try not to allocate completely new vectors for your data.
The second thing that we're also not doing is pre-generating the entire dataset, right?
because that would be a matrix that is 74,000 by, how many train examples?
25,000, so 74,000 (vocab_size * 25,000 = 1851850000)
which is just, that's a lot of stuff to store on your machine when, in reality,
we can populate this pretty easily.
And most of them are zeroes, and they're pretty quick we need to generate.
So this is just generally good practice for creating your dataset without filling up your RAM on your laptop.
So, that's our input and output dataset. Those are kind of the things to watch our for.
Don't allocate too much memory at onece, and don't create new variables all the time.
These are forms that we're going to use in our nerual net, right?
So in the next section we're going to be talking about how we're going to put
this together into our neural network.
Building A Neural Network
So in this section we're going to take everything we're learned and
we're going to build our first neural network to train over the datasets that we just created.
Now what I'd like for you to do for this project is to start with your nerual net form the last chapter.
I guess the last module that you did where you built a basic neural network for predicting on a structured data dataset.
Then I would like to take this three layer neural network and remove the non-linearity in the hidden layer.
I'll show you why later.
Then I would like for you to use the functions that we created above to generate the trained data on the fly.
So a review and a label goes in. It's converted into the two vectors that we need for the input and output data.
And then a forward pass and a bach prop pass happen,
so that the data is being trained on the fly.
Next thing I would like for you to de is create a function for pre-processing the data.
So that all of these kind of vocabulary variables, and word2index variables are variables of this class,
so everything is self-contained in that class.
And then modify the train variable to actually train over the entire corpus,
instead of just on one inputs and targets list.
So, that's kind of what I would like for you to do.
You can start with either with this shell,
that was presented at the beginning of your last week's chapter or
with the complete neural net that you started with last time.
Now if you do need help, obviously the first thing to do is to go re-watch the previous week's
Udacity lectures, make you're familiar with that propagation and it's gradient ascent,
and the error measure that we're using and also how to modify back prop to get rid of non-linearity.
It does a comprehensive review of fore prop, back prop and error gradients and stochastic gradient descent.
In a moment I'll show you how I put this network together,
and then we'll kind of talk about the different changes that we made.
Project 3: Building a Neural Network
- Start with your neural network from the last chapter
- 3 layer neural network
- no non-linearity in hidden layer
- use our functions to create the training data
- create a "pre_process_data" function to create vocabulary for our training data generating functions
- modify "train" to train over the entire corpus
import time
import sys
import numpy as np # Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1): # set our random number generator
np.random.seed(1) self.pre_process_data(reviews, labels) self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate) def pre_process_data(self, reviews, labels): review_vocab = set()
for review in reviews:
for word in review.split(" "):
review_vocab.add(word)
self.review_vocab = list(review_vocab) label_vocab = set()
for label in labels:
label_vocab.add(label) self.label_vocab = list(label_vocab) self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab) self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes # Initialize weights
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes)) self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes)) self.learning_rate = learning_rate self.layer_0 = np.zeros((1,input_nodes)) def update_input_layer(self,review): # clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] += 1 def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0 def sigmoid(self,x):
return 1 / (1 + np.exp(-x)) def sigmoid_output_2_derivative(self,output):
return output * (1 - output) def train(self, training_reviews, training_labels): assert(len(training_reviews) == len(training_labels)) correct_so_far = 0 start = time.time() for i in range(len(training_reviews)): review = training_reviews[i]
label = training_labels[i] #### Implement the forward pass here ####
### Forward pass ### # Input Layer
self.update_input_layer(review) # Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1) # Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2)) #### Implement the backward pass here ####
### Backward pass ### # TODO: Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2) # TODO: Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error # TODO: Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1 reviews_per_second = i / float(time.time() - start) sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
if(i % 2500 == 0):
print("") def test(self, testing_reviews, testing_labels): correct = 0 start = time.time() for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1 reviews_per_second = i / float(time.time() - start) sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%") def run(self, review): # Input Layer
self.update_input_layer(review.lower()) # Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1) # Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2)) if(layer_2[0] > 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
All right, so in project three we're going to build our neural network.
That's going to predict whether or not a neural review has positive or negative sediment by
using the counts of words that are inside of our review.
Now the changes I made first were, to create a pre process data function,
that just brings in all the kind of little snippets that we built and tested above.
So word to index, kind of the different vocabularies, and vocabulary sizes.
Just all of the variables that we used in our training dataset generation logic,
I wanted to have it in a pre-processing data function,
So that it's all kind of self contained in the variables that are in this class.
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
The next thing I did was, I split off the stuff that was in here into an init network function,
just to keep things clean, and also this needs to know the number of input nodes,
the number of output nodes.
also this needs to know the number of input nodes, the number of output nodes.
Which is based on the number of reviews or unique vocabulary in our views,
and the number of labels that we have.
So it's nice to kind of have these in separate function,
just so that you can kind of read the progress that's here,
and just clean it thatway, I kind of like it.
def update_input_layer(self,review): # clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] += 1 def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0
And the next thing it did was update input layer and set target for
label which are these functions that we played with before.
I'm going to go ahead and move into the class just so
that it's all kind of self contained and together,
and that makes this class portable, so I can use it somewhere else.
All right so, now on the training method, this is where most of the action is, right?
So the first thing I checked was that,
assert(len(training_reviews) == len(training_labels))
the number of training reviews we have is the same as the number of labels.
So in the off chance someone input something that doesn't line up correctly,
we kind of want to let people know,
so we see a kind of weird behavior in around that.
correct_so_far = 0
And the next thing we're going to do is kind of intialize correct so far,
we're going keep track of how many predictions we get right and wrong while we're training.
This is a useful lmetric that I kind of like to watch to understand
how the neuromet is doing during the training process, right.
Is it getting better, is it not getting better at all, is it getting worse?
These things are kind of the basics of understanding how you're doing and
then being able to adjust for that.
review = training_reviews[i]
label = training_labels[i]
Now also we select review and a label out of our training reviews, we update the input layer.
This is the same as previously we were propagating from layer zero to layer.
one, or from your input layer to your hidden layer. Howere, in this case,
we have to adjust and generate our input data set first before we can do this propagation.
#### Implement the forward pass here ####
### Forward pass ### # Input Layer
self.update_input_layer(review) # Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1) # Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
So now we generate hidden layer same way as before, except without nonlinearity,
and the last one will generate with nonlinearity.
So that's our forward propagation step.
#### Implement the backward pass here ####
### Backward pass ### # TODO: Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2) # TODO: Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error # TODO: Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step
Our back propagation step, the first thing we do.
How close did we, did we miss?
This is where we put our function that we created.
We say our prediction minus our function, and then because we have a nonlinearity on this layer
our layer 2 delta has to multiply by this function,
which is sigmoid times 1 minus sigmoid.
And then we continue to back propagate in this way.
Now a thing that you see we skip here is that because there's a nonlinearity on layer one,
we don't do this multiplication step here, unlike before, because this is a linear layer.
So we don't actually have to adjust for the not mislope of the non linearity.
Once we have our layer two and layer one deltas, we're ready to update our weights.
Which you do here in the exact same way that we did in our previous neural network.
if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1 reviews_per_second = i / float(time.time() - start) sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
if(i % 2500 == 0):
print("")
And then add a little bit of logic just to kind of log our progress.
As well as see how fast we're training, and how man predictions we got correct.
Now how am I deciding whether we got something correct or not?
What I'm looking at is the absolute value of our prediction, or,
excuse me, the absolute value of the error of our prediction.
So up here we calculate the difference between what our prediction should be and what it was.
And so I said if it predicts exactly 0.5, well it didn't, it's totally ambiguous.
It's kind of half way between positive and negative, it didn't pick either.
But if it's closer to the right prediction, well then this error measure will be less than 0.5.
And so that's why I can kind of see how many classifications we got right,
as opposed to just the loss by kind o fjust typing this on the fly, and then logging as we go.
Now the other thing I want to be able to do here is test it.
Which is really just a matter of taking it for logic and in the evaluation logic,
put that in a one function, which I did here.
And then I add another one for running where we can put in a text review and
it converts that text into an input data, and just forward pops and give POSITIVE, NEGATIVE labels.
So we can test it on the whole data set or we can kind of throw in some examples and see wheter we like it.
So now that we've got this, let's go ahead and first validate that our and the next we'd go first and create one.
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)
Here, I'm actually selecting the first 24,000 reviews to train on.
And I'm just going ahead and say it, the last we could say these 1,000 reviews can be our test dataset.
So I'm going to kind of continue to do that. You could pick a different training test split,
there's actually another 25,000 in the IMDB data set you could use.
But, just for the sake of making it easy, I think we're just going to go with this.
So, we're going to initialize it this way, this is actually our default learning rate.
The other thing I like to do before we get started is actually test it.
So our waits are initialized randomly right now, so it shouldn't really predict well at all.
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])
So in this case, testing accuracye is exactly 50% which is,
Progress:99.9% Speed(reviews/sec):587.5% #Correct:500 #Tested:1000 Testing Accuracy:50.0%
if you just geuessed, between positive and negative randomly then gear you should get 50% accuracy,
and it's actually what we see here.
Which is a good place to start. Especially when you have a neural net with only two predictions,
I really like to see it start off not being biased towards one way.
Like if I initialize my weights in such a way where it always predicts one way
or always predicts another, or it doesn't get any of them right.
Then I kind of scratch my head like,
It doesn't seem to have any real predictive power at the moment.
So, now we're going to try to train our network.
# train the network
mlp.train(reviews[:-1000],labels[:-1000])
Something I threw in here a little bit later is that, every 2,500 predictions it will do a new line,
so we can not just see, what the progress is now but, we can kind of see it change over time.
So now when I'm watching it train there's a few things I'm looking at.
First is speed, trying to kind of gauge, how long am I going to be sitting here?
And then I'm also looking at the training accuracy.
So now if you look, so far it's actually not predicting particularly well.
It's doing just worse than random. Which is sort of worse than it was doing before.
At this point, when we're 14% of the way through the training dataset,
and it hasn't even learned anything yet, and it's like it is doing worst then that.
I'm really starting to go okay, something is probably wrong here.
They are a few types of neutral nets where at this point it actually does continue print random
especially in reinforcement learning, however on this dataset we are looking in direct correlation.
I should be seeing some change right here, so I'm just going to go ahead.
a quit this out, we could wait longer, but I just don't think that it's going to be a good idea to do that,
so we're going to go ahead and hit stop.
The natural thing for me to do here is think, okay, so the learning rate's too high, right?
So, when things are doing like this, maybe it's diverging, who knows.
So let's go ahead and adjust this learning rate to be Lower and a good way to fill things out is first
move by orders in magnitude so I'm going to divide it by ten, reinitialize the network,
I'm queue slash then compare bounce around a little bit, I'm starting to see kind of the same behavior.
It's not really getting better. Now we'll just train for a second and then kind of talk about why was I lowering rate, right?
So lowering rate, if you remember from before, is the step side, it's how big of a jump that it tries to take to reduce the error.
Probably a standard reason why things kind of thing happens is that
it's over shooting, so it's ending up not really any closer to solving the problem than when it started because it's going to far.
Under shooting means the network trains very very slowly but it does tend to make progress,
this to me could be very very slowly, but it just doesn't look like it's, training at all.
It's just camping out right near 50% and so that's really concerning.
And so, we're at our 20% here, I should be seeing something at this point.
So we're going to cancel this and we're going to go again.
So check tis out.
Eventually, these types of metrics become really entertaining to watch.
And mean, I'm actually still kind of surprised, it is not really happening.
Here we go, okay, so it's starting to learn a little bit, so this is a good sign, right?
So it's starting to find correlation but it's still going pretty slow, not only,
like this is pretty slow as far as your views per second.
It's only expressing 100 reviews per second but then it's not converting very quickly, right?
And I can keep knocking down the learning rate, but the truth is, the more you knock down the learning rate,
the slower the learning happens, right?
Whereas before, overshooting, this is still going to continue knocking down.
So one thing that I could do here is continue to tweak the learning rate,
and I could spend all day trying to do that, and I would get incremental Improvements.
But we're so early on we haven't refined anything, just pose some big frame questions
that we need to really re-evaluate in our neural networks. Say hey,
can we frame this problem so that the correlation is a little more clear, right?
So right now, I'm going back, I'm thinking, okay, so , up here.
This is our setup, right, we're counting the words and putting them in here,
and then it's making a prediction.
What about this is so difficult for this thing that it's taking this,
so it is converging, but it's just not going very quickly.
Is there anything that we can do, to make it more obvious for the network for
it to identify the words that were validated in kind of in our, well not those lists those were the raw counts.
Up here, right so it finds these words more easily, so there are two things that I typically do here.
Once is I start changing stuff and see if stuff works, and the other one is I dig deepere in to exactly
what's going here, take a look at a few training examples.
See that, make sure that the pattern that I think should be in there is actually showing up or maybe
I have a mistake in my logic. Nine times out of ten, when something's not training correctly it means
that there's something simple in here that I got backwards, more than a big complicated change.
But somethimes it needs to be a big complicated change, it's still training really really quickly.
So I mean if we if we extrapolated this you know things can train fast in the beginning and
then slow down and taper off. So I mean I don't really see this getting much past 61,62,
something in that kind of range, I don't know if they keep training.
What is the signal in my training data, and what is the noise in my training data?
And that's going to be kind of a topic of the next section, which we're going to analyze,
and then try to see if we can get this training to happen faster.
So feel free to let this kind of train all the way, I don't think it'll get too much past this.
And I really feel that we're going to be able to build a better classifier here in a minute.
Understanding Neural Noise
Okay, so in this section we're going to talk about noise versus signal.
Now job is to look for correlation and neural nets can do very, very good at that.
However, once again, this talks about framing the problem
so that the neural nets have the most advantage and can train
and understand the most complicated patterns of these as possible.
And so what we saw on our last section is that this thing really wasn't training very quickly.
It seemed like there's a lot more signal in this, excellent, versus terrible, versus moving, versus, you konw
what we can some of these other words that are really positive or negative.
It just seems like if I as a human was looking at this text I could predict better than 60% accuracy using just the words.
So what we want to do is go back to, but before we start getting fancy with crazy regularization or fancy things in the neural net we want to go back to the data.
The data is where all of the gold is.
Like the neural net is just the backhoe that we're going to dig all the gold out of the ground with.
But if we're not finding much gold, especially in the beginning, it's probably not a problem with our backhoe.
It's probably a problem with where we're digging, the dirt that we're choosing,
how we're manipulating it. And so, what I want to talk about here is noise versus signal because that's the whole game.
The whole game is saying we have a ton of data, we know there's a pattern in here, we want the neural net to be able to find it.
Now earlier, by kind of sheer luck, when we created this update_input_layer,
def update_input_layer(review): global layer_0 # clear out previous state, reset the layer to be all 0s
layer_0 *= 0
for word in review.split(" "):
layer_0[0][word2index[word]] += 1 update_input_layer(reviews[0])
I saw an 18 here. And at the time, it kind of checked off my mind. Wow, that's really high.
Consider from me what it's like if there was an 18 right here, right?
So what this is, this for propagation, it's a weighted sum so there's four weights coming out of this input layer, right?
One gets multiplied times four, one gets multiplied times these four and then those two vectors are summed here.
So it's a weighted sum, well it's actually a weighted sums of weighted sums, but whatever.
This vector is characteristic of horrible. So when I say vector, I mean the list of weights, so this weight,
this weight, and this weight, have a certain impact on these four nodes.
You know it's funny, they kind of interpret each other, right?
So how high this number is affects how dominantly these weights control this hidden layer.
And these weights control how dominantly this input affects this hidden layer.
So they're multiplied by each other and it's an associative thing,
so that they both kind of interplay with each other in that way.
However, if this is multiplied by 18, and this is multiplied by 1, this is going to be the dominant.
I mean, this vector is basically going to be exactly the same as these four weights.
For horrible, multiplied by 18, like is a percentage of the amount of energy
that's in these nodes it's going to be mostly this word and so I was looking at this and was going,
okay which one is being weighted by 18? Well if we look at word index,
there was 18 of them, One of them is probably fine, and the neural net can sort that out.
But you know it's seeing mostly nothing in this vector, and then word over here,
word over here, very softly, right? You konw I look at kind of the rest of things.
So at first I'm thinking okay, maybe a tokenization error, but then there's a whole bunch of periods in here,
so period happens a bunch. I wonder what the distribution is? So a single review_counter = Counter(), right?
review_counter = Counter()
for word in reviews[0].split(" "):
review_counter[word] += 1 review_counter.most_common()
look at that. The dominant words, these have nothing to do with sentiment.
So there's going to be some standard words down here, but, insightful it's right there.
But when you look at this, most of this review are completely irrelevant filler words like the, to, I, is, of, a.
And this waiting is causing it to have a dominant effect in the hidden layer,
and the hidden layer is all of the output layer gets to use to try to make a prediction.
So if this hidden layer doesn't have rich information, the output layer is going to struggle.
So now I'm sitting here going okay wait, we decided to do counts earlier.
Maybe counts was a bad idea, beacause the counts doesn't highlight the signal.
When I'm looking at these counts it seems like it highlights the noise.
But when I say highlight what I mean is weights it most heavily.
Neural nets they're just weights and functions.
Like you take this, these set of values, you re-weight them, right,
into these four nodes, and then you run a function.
In this case we don't do a function here, but it's a linear function.
And then we re-weight them again, and we do a function, and that's our prediction, right?
So if our weighting is off or how we're creating our input data, it's going to make it really hard to find the signal.
That's noise. That means that the way that we're framing the problem is adding a significant amount of noise.
Because in these weights, the neural net has to learn to be like, to, I, high, is ,of ,a quiet down.
I need to hear insightful. I need to hear welcome. I need to hear other positive, because this is positive view.
I don't know if it's a negative review we could look too,
but this neural net is trying to quiet down all the words that aren't relevant
and listen more attentively to the words that are relevant.
But we're not helping it by causing the weight to be the things that are most frequent.
And so I think that we should try eliminating this. So in project four I think we're going to try that.
So let me go ahead and describe what that's going to be.
Project 4: Reducing Noise in our Input Data
import time
import sys
import numpy as np # Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1): # set our random number generator
np.random.seed(1) self.pre_process_data(reviews, labels) self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate) def pre_process_data(self, reviews, labels): review_vocab = set()
for review in reviews:
for word in review.split(" "):
review_vocab.add(word)
self.review_vocab = list(review_vocab) label_vocab = set()
for label in labels:
label_vocab.add(label) self.label_vocab = list(label_vocab) self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab) self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes # Initialize weights
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes)) self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes)) self.learning_rate = learning_rate self.layer_0 = np.zeros((1,input_nodes)) def update_input_layer(self,review): # clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] = 1 def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0 def sigmoid(self,x):
return 1 / (1 + np.exp(-x)) def sigmoid_output_2_derivative(self,output):
return output * (1 - output) def train(self, training_reviews, training_labels): assert(len(training_reviews) == len(training_labels)) correct_so_far = 0 start = time.time() for i in range(len(training_reviews)): review = training_reviews[i]
label = training_labels[i] #### Implement the forward pass here ####
### Forward pass ### # Input Layer
self.update_input_layer(review) # Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1) # Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2)) #### Implement the backward pass here ####
### Backward pass ### # TODO: Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2) # TODO: Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error # TODO: Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1 reviews_per_second = i / float(time.time() - start) sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
if(i % 2500 == 0):
print("") def test(self, testing_reviews, testing_labels): correct = 0 start = time.time() for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1 reviews_per_second = i / float(time.time() - start) sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%") def run(self, review): # Input Layer
self.update_input_layer(review.lower()) # Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1) # Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2)) if(layer_2[0] > 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
And we're going to take this network and we're going to say, okay,
how can we modify this network so that we don't weight it by these counts anymore?
Well, if we don't weight it by its counts anymore, then that means that this would always be a one or a zero.
So as this point we're changing it so that it's just a representation to vocabulary in general.
If we did that that should work actually, because then okay, so the period and I will still be in here,
the neural net has to decided which words are most important.
But it doesn't have to say okay, period times 27, so 27 times the weights where period going into the hidden layer.
That's a lot of signal to push back down, where as if we just say ones and zeros, and do a binary representation,
that should be a lot less noisy and a lot easier for the neuron to figure out.
Now I think in here, that's actually going to be pretty easy to change, and I'll just do that project right here.
So it's in our update_input_layer.
def update_input_layer(self,review): # clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] = 1
Before we were incrementing it, if we just get rid of that plus we're going to set it to equal one so to each value in layer zero we're going to set to equal one if that vocabulary term exists.
Okay so let's rebuild that. And then let's grab our training value from up here. We'll do our original one.
And we need our trained one too.
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)
mlp.train(reviews[:-1000],labels[:-1000])
And this is our, basically our new class, and hit train.
It's already to 60% after 2% of progress.This is amazing progress, look at that.
So we eliminated a lot of our noise, right, by getting rid of this weighting.
And the neural net was able to find correlation so much faster.
And we're only 9% into training, look at that, 70%.
See, this is what incresing a signal and reducing the noise is all about.
It's about making it more obvious for your neural net so that it can get to work at handing the signal and
then combining it in interesting ways and looking for more difficult patterns.
And you just kind of get rid of the noise in your training data.
We could have spent days and days and days up here just tweaking our little alpha,
just moving it around, lowering it down, trying to get the train to happen slowly
but in reality we can have a big fat alpha and make huge steps and progress really quickly
if we just get rid of this really silly noise,
because we're trying to train our neural nets to do interesting stuff. Interesting stuff is not ignore periods.
Interesting stuff is identify which words are relevent. Identify which combinations of words are relevent.
That's what we want our neural net to do. Finding interesting representations, doing interesting things in this hidden layer.
To really understand the vocabulary of what's being said in the review.
Next thing I would like for this to be training a lot of faster.
So the next thing that I would like for us to be able to do is kind of take a look inside the neural net,
understanding what's going on, and see if we can kind of crank out a little bit more speed.
But for now, I'm going to let this go ahead and train.
Understanding Inefficiencies in our Network
So in the last section, we optimized our neural network to better find the correlation of our dataset
by removing some distracting noise.
And the neural network attended the signal so much better.
It trained 83% accuracy in the training data, and the testing accuracy was up to 85%
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])
Progress:99.9% Speed(reviews/sec):832.7% #Correct:851 #Tested:1000 Testing Accuracy:85.1%
So we probably could have kept training it, and squeeze out a little more,
but we're going to keep looking at this one iteration benchmark to see how fast we can get the neural net to train.
I mean, this is up from 60% before, so this is a huge gain.
In accuracy and the speed of training for our neural network, and that was a lot of progress.
However, the actual raw computational speed, the number of seconds that it takes to do a full pass in still pretty slow.
What I want to be able to do in here is attack this network and say okay.
What is this thing doing that is wasteful on the computation site.
So before we had kind of a wateful data and now I want to say what is wasteful inside this neural net.
You can do a lot of things on this theory side and try to say okay how can it learn faster.
But truth is the other one before was the learning. It was just taking a really long time.
So we also could of tied a optimize the computational side so that it's just train so much more faster
that it's still able to learn what we want the other to learn.
The faster you can get your neural to train then, to be honest.
The longer you'll let it train before you get bored. And you'll find more interesting stuff.
And people who train neural nets. You can kind of just keep training.There's no natural finish.
It's unlike probabilistic graphical models. Or many of them, anyways,
where you do a discrete count of lots of different things.
And then when it's done, it's done. In accuracy, neural nets can kind of just keep training, right?
But the faster you can get it to train, the more data that you can put into it, the stronger it can be.
So what we're going to do here is we're going to analyze what's going on in our network and look for
things that we can shave out that are going to allow our neural net to go faster.
And now there's one first one that kind of stands out to me.
We're creating a really big vector for layer 0. It's 74,000 and 70 something values, right?
And only a handful of them are being turned onto 1. Now why does this matter?
Well, this four propagation step is a weighted sum.
We take this 1, we multiply it by these weights, we add it into layer 1.
Then, we take the next one, 0, we multiply it by these weights and we add that into layer 1.
We take a zero and we multiple it by these weights, and then we add the result of that.
That means every time there's a zero, when we take this vector and do a big matrix multiplication to create our layer one,
all these zeros aren't doing anything, because zero times anything is still juse zero.
So, zero times this vector added into layer one doesn't change layer one from what it was before.
So that, to me, is like the biggest source of inefficiency in this network.
To kind of show you, computationally, and sort of prove to you that this is the case, check this out.
So we have kind of a fake layer 0 that only has 10 values, we're going to picture it here.
And then we're going to say, okay, layer zero.
Layer 0, 4 = 1, kind of pretend that we put a few words in here.
Now we're looking at layer zero again, it looks like that, right?
weights_0_1 = np.random.randn(10,5)
So, weights_0_1, we're going to say this is just a random write matrix.
And then we're going to say, okay,
layer_0.dot(weights_0_1)
Okay, so that's the output.
Out[94]:
array([-0.10503756, 0.44222989, 0.24392938, -0.55961832, 0.21389503])
Now, what if instead we only summed these vectors in here, right?
So we just said, okay, 1 times this goes in here. So if we have these two indices,
indices = [4,9]
we have to have a new layer, right?
layer_1 = np.zeros(5)
So layer 1 equals np.zeros, so it's empty, and it's got 5 values.
for index in indices:
layer_1 += (weights_0_1[index])
layer_1 Out[104]:
array([-0.10503756, 0.44222989, 0.24392938, -0.55961832, 0.21389503])
Exactly the same values, look at that and the cool thing here is we only actually worked with part of this matrix.
So if this, you know two words out of 70,000 words then we just saved not having to,
you know perform this operation in this sum with the 69,000 other words.
That should be a pretty great savings.
Now, we'll see how much it actually works out to in the end, but that should be really positive.
I'll be curious to see how that kind of works out.
Now let's take a look at the neural net again, look for some more efficiency.
Now the other thing that's inefficient is one times anything is just itself so this whole one times thing is kind of a waste.
So what if instead we change this to just be a sum? One times, we can just eliminate that, right?
So we got rid of this multiplication, we got rid of doing these all together,
and we're still getting the same hidden state.
That we were getting when we did this full dot practice, this full matrix or vector matrix multiplication.
I'm really liking this. I think this is a ton to build upon.
And most of the weights are over here, right? There's only four weights that go from the hidden to the output.
Well in our case it's hidden layer size.
I think we have a bigger layer, but most of the computation is here.
74,000 by whatever hidden layer size, this is the beefy part of training and writing our neural net.
So that brings us to kind of the next project.
So project five is about installing this into the neural network from before.