Tag Archives: language


October 4, 2013

Undertanding the C-Bow and Skip-gram models.

From the Arxiv Preprint : Efficient Estimation of Word Representations in Vector Space byTomas Mikolov,Kai Chen, Greg Corrado,Jeffrey Dean”

Natural Language Processing or computational linguistics can attempt to predict the next word given the previous few words.

A gram being a chunk of language possibly 3 characters, probably  a word; an n-gram is a number of these occuring together.
Google’s N-gram viewer ( word frequency by year )

The Mikolov model is a hierarchical softmax encoded as a Huffman Binary Tree.

The vocabulary is represented as a Huffman Binary Tree which dramatically saves processing.

The softmax is an activation function common in neural net outputs as it has a sigmoid curve and all the outputs sum to 1 – each output represents an exclusive probability.

\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer.

The hierarchical softmax reduces the dimensionality component of the computational complexity to  the log of the Unigram_perplexity of the dimensionality.

for heirarchical softmax the paper cites :
Strategies for Training Large Scale Neural Network Language Models by Milokov et al, 2011
A Scalable Hierarchical Distributed Language Model by Hinton & Mnih, 2009
Hierarchical Probabilistic Neural Network Language Model by Bengio & Morin, 2005


Mikolov Sutskever Dean Corrado Word Vectors Machine Translation

CBOW predicts a word given the immediately preceeding and following words, 2 of each – order of occurence is not conserved.

Skip-gram predicts the surrounding words from a word. Again 2 preceeding and 2 following but this time not immediately subsequent but skipping a certain constant quantity each time.

In “Exploiting Similarities among Languages for Machine Translation by Mikolov, Le & Sutskever” PCA dimensionality reduction is applied to these distributed-representations (illustrated above).

The architectures of CBOW and Skip-gram are similar – CBOW likes a large corpus and Skip-gram prefers smaller corpora.

From Google Groups Milokov posts :

“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.”


Downloading and Run Word2Vec from svn on Ubuntu 12.10

Downloading word2vec is easy with svn :

svn checkout http://word2vec.googlecode.com/svn/trunk/ w2v

then :


which automagically elicited the following

gcc word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
word2vec.c: In function ‘TrainModelThread’:
word2vec.c:363:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.c:369:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
gcc word2phrase.c -o word2phrase -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc word-analogy.c -o word-analogy -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
chmod +x *.sh

then I ran the 1st demo:


which resulted in downloading a file train8 from matt mahoney which is the 1st billion charachters from WIkipedia.

Then a training phase ramped both my CPUs to 100% and got me training at 27.16k words per second. w00t !

Enter word or sentence (EXIT to break): scotland

Word: scotland  Position in vocabulary: 1105

Word       Cosine distance
england        0.797283
wales        0.674080
scots        0.622237
ireland        0.607295
somerset        0.576749
cornwall        0.564767
scottish        0.555280
britain        0.540651
lulach        0.523911
tudor        0.508784
queen        0.508427
brittany        0.489793
elizabeth        0.477710
edward        0.476515
wessex        0.473759
earls        0.472889
dunkeld        0.465925
peerage        0.457085
jannaeus        0.456586
henry        0.449374
viii        0.448380
navarre        0.446313
king        0.442532
crowned        0.441560
shropshire        0.440221
ulster        0.439687
thrones        0.439417
victoria        0.439166
vii        0.438213
moray        0.438193
essex        0.436330
isles        0.432125
yorkshire        0.430498
gruoch        0.429486
aberdeen        0.428796
hertfordshire        0.428162
royalists        0.427168
pinewood        0.427026
afonso        0.426304
conqueror        0.426080

and running :


results in :

make: Nothing to be done for `all’.
Starting training using file text8
words processed: 14700K     Vocab size: 3944K
real    0m20.519s
user    0m17.849s
sys    0m2.164s

and running :


results in :

make: Nothing to be done for `all’.
Note that for the word analogy to perform well, the models should be trained on much larger data sets
Example input: paris france berlin
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.023731  Progress: 5.14%  Words/thread/sec: 25.92k  ^C
real    0m26.163s
user    0m39.730s
sys    0m0.452s


make: Nothing to be done for `all’.
Note that for the word analogy to perform well, the models should be trained on much larger data sets
Example input: paris france berlin
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.000121  Progress: 99.58%  Words/thread/sec: 24.97k
real    6m13.423s
user    11m13.314s
sys    0m1.568s
Enter three words (EXIT to break): king queen paris

Word: king  Position in vocabulary: 187

Word: queen  Position in vocabulary: 903

Word: paris  Position in vocabulary: 1055

Word              Distance
la        0.460125
commune        0.435542
lausanne        0.430589
vevey        0.429535
bologna        0.418084
poissy        0.416120
lyon        0.415239
sur        0.405876
madrid        0.403547
les        0.401034
nantes        0.400419
le        0.390821
du        0.390574
sorbonne        0.390027
boulogne        0.389495
conservatoire        0.388148
hospital        0.383842
complutense        0.382987
salle        0.382847
nationale        0.382411
distillery        0.381971
villa        0.379842
cegep        0.378001
louvre        0.377497
technologie        0.376842
baume        0.376681
marie        0.375048
tijuana        0.373825
chapelle        0.373256
universitaire        0.369375
dame        0.368908
universite        0.367017
grenoble        0.366289
henares        0.364779
lleida        0.359695
ghent        0.359106
puerta        0.358269
escuela        0.357838
fran        0.357790
apartment        0.357674

Enter three words (EXIT to break): lord cat crazy

Word: lord  Position in vocabulary: 757

Word: cat  Position in vocabulary: 2601

Word: crazy  Position in vocabulary: 9156

Word              Distance
rip        0.558223
dog        0.526483
ass        0.515598
haired        0.508471
bionic        0.506194
noodle        0.503654
bites        0.499798
prionailurus        0.493895
leopard        0.493136
blonde        0.492464
sloth        0.491286
coyote        0.486539
stump        0.486430
felis        0.486219
eyed        0.485090
bonzo        0.484414
iris        0.482825
slim        0.482549
candy        0.481998
rhino        0.481062
nails        0.478900
blossom        0.478873
hitch        0.476750
mom        0.475003
ugly        0.474717
shit        0.474513
mickey        0.473128
goat        0.471524
inu        0.469157
cheesy        0.467842
daddy        0.467046
kid        0.466047
watermelon        0.464281
naughty        0.463567
funny        0.463252
rabbit        0.462738
kiss        0.461589
reaper        0.460838
chupacabra        0.458305
girl        0.458293


3 Google Engineers observe consistant hyperspace between languages when dimensionality is reduced by a deep neural net

“How Google Converted Language Translation Into a Problem of Vector Space Mathematics”

“To translate one language into another, find the linear transformation that maps one to the other. Simple, say a team of Google engineers”


Machine TranslationZarjaz ! Spludig vir thrig !

Visualised using t-sne a clustering dimensionality reduction visualisation.

t-sne was developed by students of Geoffrey Hinton’s Colorado Deep Neural Net Group – indeed one of the three authors of the paper Ilya Sutskever studied with Geoffrey Hinton at Colorado and is his partner in a deep neural network startup that was bought up by google almost apon it’s inception.

All Mimsy in the Borogroves.

The full paper at arxiv :

Exploiting Similarities among Languages for Machine Translation

from Arxiv

The code & vectors (word2vec in C) that access this hyperspace for the purposes of translation has been made available by the researchers so you can explore it for yourself. I am sure that many more papers will stem from observations made from this hyper-plane of meaning.

Commented Python Version of word2vec – associated blog announcement
HNews Discussion of word2vec
Example Word2vec App – finds dissimilar word in list.
Python Wrapper & yhat R classifier web interdace for word2vec

2nd paper : Efficient Estimation of Word Representations in Vector Space – anothe google paper using word2vec vector(”King”) – vector(”Man”) + vector(”Woman”) = vector(“Queen”)

Reduction of the exceedingly large space of all possible letters to a point of view where meaning is a movement common to every language excitingly points back before babel to the Ur-language; the Platonic Forms themselves at least semanticly.

Babelfish Babel Fish from the HitchHickers Guide to the Galaxy

I believe this space is somehow derived from intuitions made about the space defined by the weights of a deep neural network for machine translation of the google corpus of all the web pages and books of humankind.

[this proved correct the vectors are the compact represention from the middle of a deep neural net]

So in a way this is a view of human language from a machine learining to translate human languages.

In other words this deep-neural-net AI is the guy who works in Searle’s Chinese Room translating things he doesn’t understand – has no soul nor realises anything yet his visual translation dictionary appears to reveal the universality of movement that is meaning common to all human languages and discoverable within human cultural output.

Is this an argument for strong AI or weak ?

I think a more biologically inspired analogy is a Corporation. A Business where many desk workers receive and process messages between themselves and the outside world. Few of the workers posses a complete overview of the decision process and those that do are hopelessly out of touch with how anything is really actually achieved. Yet each presses on through self-interest, necessity, hubris and a little alturism generated by related-genetic-self interest tropically seeking the prime directives established in the founding document and more loosly in the unwritten orally transmitted office-culture. Is a Corporation intelligent / self-aware, or even conscious ? Probably not, but it may think of itself as so and act as though it is. So I hold a corporation is to a human mind what a mind is to an individual neuron and thus ‘intelligence’ does not truly exist in any one individual, machine, procedure or rule but is apparent in the whole system just as mind exists in the human brain. But yet I think that does not account for soul and it a not unuseful model of human behaviours that we appear possessed of souls as well as minds.

Hyperbolically, reductio ad absurdum, this implies perhaps that humanity is itself a type of mind = perhaps yet more visible this noosphere now more highly connected as a peer-to-peer internet, web and cellphone connected network. Indeed the interaction of us with the biosphere has a type of mind and so forth that the universe itself is itself predictable as analagous to a thinking being. Is the hubris of this Universal ‘God’ as is our hubris to say ‘I am’ yet made of individual cells.

Paraclecus, Yeats and Benoit Mandelbrot : as above so below; And find Everything in a grain of sand; How long is a coastline depends on how closely one looks.

I would suggest this is the evolution of a machine intelligence.

Certainly looks like The hyperplane of meaning.

In other news Google Translate for Animals ;-)

“Every conscious state has this qualitative feel to it. there’s something it feels like to drink beer and its different from  listening to music or scratch your head or pick your favourite feeling … all of these states has a certain qualitative feel to them.

Now because of that they have a second feature to them namely that they are subjective. They are ontolologically subjective in the sense that they only exist insofar as they are expereienced by a human or animal” – John Serle

Kant’s Transcendantal unity of aperception is Searle’s Unity of Consciousness and perhaps the body, soul and mind are analgous.




#ILoveScience #Verstehen
#ILoveArt #heArt