NLP using Hierarchical Temporal Memory – Generating Esperanto-ish gibberish

Can a bunch of simplified neocortex neurons learn the rules of Esperanto?

I have been learning Esperanto trough Duolingo lately. And I think it would be a good idea to test out how good Hierarchical Temporal Memory actually is. So why not combine two together and make a fun weekend project!

TL;DR

It doesn’t work well. My guess is that it works close to or as-well as a single layer RNN.  It would be cool to have HTM beating LSTM/GRU in this task. But that isn’t the point. I’m here to have fun.

Why Esperanto

Besides I’ve been learning Esperanto lately. Esperanto has some very desired property for any NLP task.

  1. Esperanto’s grammar is perfectly regular with very clear rules
  2. Part of speech is embedded into the word
    • All nouns ends in an “o”
    • All adjectives ends in a “a”
    • All present tense verb ends in “as”
    • etc…

And thus it is easy to identify weather the algorithm is learning or is it spitting out random crap.

Hierarchical Temporal Memory

What is Hierarchical Temporal Memory anyway you ask. HTM is a machine learning algorithm developed by Numenta based on a simplified model of how the neocortex may work. (Note the word may here. It really is the case of the how neocortex may work. A lot of guess work in involved). The basic assumption of HTM is that 1) Neurons communicate in binary and they are sparse. 2) Neurons learn base on it’s local environment. 3) Time is a critical part of the learning.

Binary? For real? Well… yes. In HTM all neurons accepts an arrays of either 1 or 0. Performs computation based on what it has received and also outputs 1 or 0. All data going into a HTM model has to be encoded into some sort of binary representation. – One hot encoding  can be a solution but HTM needs something better… Say hello to Sparse Distributed Representations. SDR is a binary array with the extra requirement of similar data has to have more same bits on/off. Eg: Binary numbers are not a SDR that 0xffff and 0x7fff have 15 bits in common yet the values they represent is far apart. Yet encoding numbers into a string of 1s are. (like encoding the number 0 as “11000” and the number 1 as “01100”)

In HTM, sequences are learned trough a Temporal Memory layer. I won’t go into how it work; its compilcated.

That’s the gist. More information can be found in this YouTube playlist by Numenta.

Building and installing NuPIC.core

NuPIC is the best HTM implementation currently available. But I’m not going to use that. I’ll be using NuPIC.core which is a implementation in C++ also by Numenta.

It’s not hard to build NuPIC.core. But I’m having trouble to get the headers installed. Well…

git clone https://github.com/numenta/nupic.core
cd nupic.core
export NUPIC_CORE=`pwd`
cd $NUPIC_CORE/build/scripts
cmake $NUPIC_CORE -DCMAKE_BUILD_TYPE=Release -NUPIC_TOGGLE_INSTALL=ON -DPY_EXTENSIONS_DIR=$NUPIC_CORE/bindings/py/src/nupic/bindings .
make -j4
sudo make install

Now the problem. The headers are not installed at all. Let’s do it manually.

cd $NUPIC_CORE
sudo cp -r src/nupic /usr/local/include
sudo cp -r build/scripts/src/nupic /usr/local/include

Generating Esperanto(ish) text

Now the fun part.

First of all. How am I going to encode Esperanto into SDR? One hot encoding of corse! Haven’t I said that one hot encoding isn’t SDR? Yes… But there are no connections between any alphabets. So one hot encoding can work here. (Not really, multiple bits in the SDR has to be on at the same time for Temporal Memory to learn properly. So I made 24 bits on for each alphabet yet no two alphabet share the same bit.)

Then I constructed a helper class that wraps around HTM layers and making them easier to use. And construct a huge Temporal Memory layer. – This is serval times larger than a topical Temporal Memory.

Model(): tm({TOKEN_TYPE_NUM, LEN_PER_TOKEN}, TP_DEPTH, 2048, 8192)
{
	tm->setMinThreshold(LEN_PER_TOKEN*0.3f+1);
	tm->setActivationThreshold(LEN_PER_TOKEN*0.75f);
	tm->setMaxNewSynapseCount(1024);
	tm->setPermanenceIncrement(0.06);
	tm->setPermanenceDecrement(0.055);
	tm->setConnectedPermanence(0.26);
	tm->setPredictedSegmentDecrement((1.f/TOKEN_TYPE_NUM)*tm->getPermanenceIncrement()*2.f);
	tm->setCheckInputs(false);
}

Then just feed the model an article in Esperanto!

std::cout << "training temporal memory..." << std::endl;
for(int i=0;i<;10;i++) {
	for(auto token : dataset)
		model.train(encode(token));
	model.reset();
	std::cout << i << "\r" << std::flush;
}
std::cout << "\n";

Then I ask the model to finish the sentence "estas multaj pomo_“. Then it predicts: “estas multaj pomon feliĉa kajn vikon feliĉa kajn vikon fel…“. Ugh…. Well at least it gets the n part right. Yet it seems to have problem generating or terminating sentences on it’s own…

How about asking the model to generate a sentence? It can’t go well according to the previous test…  Here are the results:

(Starting with s)
sede la da kajn vikon feliĉa kajn vikon feliĉa kajn vikon feliĉa kajn vikon feli

(Starting with u)
udovikon feliĉa kajn vikon feliĉa kajn vikon feliĉa kajn vikon feliĉa kajn vikon

(Starting with l)
la da kajn vikon feliĉa kajn vikon feliĉa kajn vikon feliĉa kajn vikon feliĉa ka

Interesting to see that HTM has learned the structure of a Esperanto word but not the structure of a sentence. It has learned that a word ends in e, a, o, n and a world isn’t plastically too long. But it hasn’t learned that a sentence should contain at least a verb and (mostly) a noun.

Thoughts and conclusions

Hierarchical Temporal Memory isn’t the most successful model in the word. But it is amazing to see how a model based on neocortex can learn the basic structures of a simple language.

I guess I can make my model learn better putting a Spatial Pooler before the Temporal Memory or by making a better encoding method. But again, I don’t know a better way to do that.

Hopefully someone learned something from my tiny side project. 🙂

Source code available on my GitHub here.

Flag_of_Esperanto
The Esperanto flag

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: