October 4, 2013

Undertanding the C-Bow and Skip-gram models.

From the Arxiv Preprint : Efficient Estimation of Word Representations in Vector Space byTomas Mikolov,Kai Chen, Greg Corrado,Jeffrey Dean”

Natural Language Processing or computational linguistics can attempt to predict the next word given the previous few words.

A gram being a chunk of language possibly 3 characters, probably  a word; an n-gram is a number of these occuring together.
Google’s N-gram viewer ( word frequency by year )

The Mikolov model is a hierarchical softmax encoded as a Huffman Binary Tree.

The vocabulary is represented as a Huffman Binary Tree which dramatically saves processing.

The softmax is an activation function common in neural net outputs as it has a sigmoid curve and all the outputs sum to 1 – each output represents an exclusive probability.

\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer.

The hierarchical softmax reduces the dimensionality component of the computational complexity to  the log of the Unigram_perplexity of the dimensionality.

for heirarchical softmax the paper cites :
Strategies for Training Large Scale Neural Network Language Models by Milokov et al, 2011
A Scalable Hierarchical Distributed Language Model by Hinton & Mnih, 2009
Hierarchical Probabilistic Neural Network Language Model by Bengio & Morin, 2005


Mikolov Sutskever Dean Corrado Word Vectors Machine Translation

CBOW predicts a word given the immediately preceeding and following words, 2 of each – order of occurence is not conserved.

Skip-gram predicts the surrounding words from a word. Again 2 preceeding and 2 following but this time not immediately subsequent but skipping a certain constant quantity each time.

In “Exploiting Similarities among Languages for Machine Translation by Mikolov, Le & Sutskever” PCA dimensionality reduction is applied to these distributed-representations (illustrated above).

The architectures of CBOW and Skip-gram are similar – CBOW likes a large corpus and Skip-gram prefers smaller corpora.

From Google Groups Milokov posts :

“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.”


Leave a Reply

Your email address will not be published. Required fields are marked *