# sinclair

## October 4, 2013

Undertanding the C-Bow and Skip-gram models.

From the Arxiv Preprint : Efficient Estimation of Word Representations in Vector Space byTomas Mikolov,Kai Chen, Greg Corrado,Jeffrey Dean”

Natural Language Processing or computational linguistics can attempt to predict the next word given the previous few words.

A gram being a chunk of language possibly 3 characters, probably  a word; an n-gram is a number of these occuring together.
Google’s N-gram viewer ( word frequency by year )

The Mikolov model is a hierarchical softmax encoded as a Huffman Binary Tree.

The vocabulary is represented as a Huffman Binary Tree which dramatically saves processing.

The softmax is an activation function common in neural net outputs as it has a sigmoid curve and all the outputs sum to 1 – each output represents an exclusive probability.

$\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}$
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer.

The hierarchical softmax reduces the dimensionality component of the computational complexity to  the log of the Unigram_perplexity of the dimensionality.

for heirarchical softmax the paper cites :
Strategies for Training Large Scale Neural Network Language Models by Milokov et al, 2011
A Scalable Hierarchical Distributed Language Model by Hinton & Mnih, 2009
Hierarchical Probabilistic Neural Network Language Model by Bengio & Morin, 2005

CBOW predicts a word given the immediately preceeding and following words, 2 of each – order of occurence is not conserved.

Skip-gram predicts the surrounding words from a word. Again 2 preceeding and 2 following but this time not immediately subsequent but skipping a certain constant quantity each time.

In “Exploiting Similarities among Languages for Machine Translation by Mikolov, Le & Sutskever” PCA dimensionality reduction is applied to these distributed-representations (illustrated above).

The architectures of CBOW and Skip-gram are similar – CBOW likes a large corpus and Skip-gram prefers smaller corpora.

From Google Groups Milokov posts :

“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.”

# “How Google Converted Language Translation Into a Problem of Vector Space Mathematics”

“To translate one language into another, find the linear transformation that maps one to the other. Simple, say a team of Google engineers”

Zarjaz ! Spludig vir thrig !

Visualised using t-sne a clustering dimensionality reduction visualisation.

t-sne was developed by students of Geoffrey Hinton’s Colorado Deep Neural Net Group – indeed one of the three authors of the paper Ilya Sutskever studied with Geoffrey Hinton at Colorado and is his partner in a deep neural network startup that was bought up by google almost apon it’s inception.

All Mimsy in the Borogroves.

The full paper at arxiv :

# Exploiting Similarities among Languages for Machine Translation

from Arxiv

The code & vectors (word2vec in C) that access this hyperspace for the purposes of translation has been made available by the researchers so you can explore it for yourself. I am sure that many more papers will stem from observations made from this hyper-plane of meaning.

Reduction of the exceedingly large space of all possible letters to a point of view where meaning is a movement common to every language excitingly points back before babel to the Ur-language; the Platonic Forms themselves at least semanticly.

I believe this space is somehow derived from intuitions made about the space defined by the weights of a deep neural network for machine translation of the google corpus of all the web pages and books of humankind.

[this proved correct the vectors are the compact represention from the middle of a deep neural net]

So in a way this is a view of human language from a machine learining to translate human languages.

In other words this deep-neural-net AI is the guy who works in Searle’s Chinese Room translating things he doesn’t understand – has no soul nor realises anything yet his visual translation dictionary appears to reveal the universality of movement that is meaning common to all human languages and discoverable within human cultural output.

Is this an argument for strong AI or weak ?

I think a more biologically inspired analogy is a Corporation. A Business where many desk workers receive and process messages between themselves and the outside world. Few of the workers posses a complete overview of the decision process and those that do are hopelessly out of touch with how anything is really actually achieved. Yet each presses on through self-interest, necessity, hubris and a little alturism generated by related-genetic-self interest tropically seeking the prime directives established in the founding document and more loosly in the unwritten orally transmitted office-culture. Is a Corporation intelligent / self-aware, or even conscious ? Probably not, but it may think of itself as so and act as though it is. So I hold a corporation is to a human mind what a mind is to an individual neuron and thus ‘intelligence’ does not truly exist in any one individual, machine, procedure or rule but is apparent in the whole system just as mind exists in the human brain. But yet I think that does not account for soul and it a not unuseful model of human behaviours that we appear possessed of souls as well as minds.

Hyperbolically, reductio ad absurdum, this implies perhaps that humanity is itself a type of mind = perhaps yet more visible this noosphere now more highly connected as a peer-to-peer internet, web and cellphone connected network. Indeed the interaction of us with the biosphere has a type of mind and so forth that the universe itself is itself predictable as analagous to a thinking being. Is the hubris of this Universal ‘God’ as is our hubris to say ‘I am’ yet made of individual cells.

Paraclecus, Yeats and Benoit Mandelbrot : as above so below; And find Everything in a grain of sand; How long is a coastline depends on how closely one looks.

I would suggest this is the evolution of a machine intelligence.

Certainly looks like The hyperplane of meaning.

In other news Google Translate for Animals ;-)

“Every conscious state has this qualitative feel to it. there’s something it feels like to drink beer and its different from  listening to music or scratch your head or pick your favourite feeling … all of these states has a certain qualitative feel to them.

Now because of that they have a second feature to them namely that they are subjective. They are ontolologically subjective in the sense that they only exist insofar as they are expereienced by a human or animal” – John Serle

Kant’s Transcendantal unity of aperception is Searle’s Unity of Consciousness and perhaps the body, soul and mind are analgous.

#ILoveScience #Verstehen
#ILoveArt #heArt