Monthly Archives: October 2013

What are Cosine Distance, Cosine Similarity ?

Cosine Similarity is the cosine of the angular difference between two vectors which is equal to the dot product divided by the sum of the magnitudes. ( wikipedia / wolfram )

\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }

It is used in word2vec to find words that are close by.

It does not account for magnitude only angular difference but it can be calculated fast on sparse matrixes with only non-zero entries needing calculation and so has found a place in text classification.


CSV to JSON-P, a Javascript Array converter in awk

Instead of converting a CSV in to JSON it is sometimes more convenient to convert a CSV to a Javascript Array with Awk.

To import word vectors from word2vec into Javascript I used a quick awk script to add the syntactic sugar to make an array of objects :

now the array can be used in javascript. This is called JSON-P, CSV to Javscript import.

JSON-P is good because: the data is ready for use by scripts with no additional steps. The MIME type is text/javascript just include it as a script tag in html and the data is ready. Make the type file .js for maximum compatibility.

<script type=”text/javascript” src=”vectors.js”></script>

including JSON files & Javascript in HTML

JSON – Javascript Object Notation is convenient data format – based on javascript syntax.

How to read it into Javascript is another matter. From Wikipedia :

“The official MIME type for JSON text is “application/json“.[15] Although most modern implementations have adopted the official MIME type, many applications continue to provide legacy support for other MIME types. Many service providers, browsers, servers, web applications, libraries, frameworks, and APIs use, expect, or recognize the (unofficial) MIME type “text/json” or the content-type “text/javascript“.”

var my_JSON_object;
var http_request = new XMLHttpRequest();"GET", url, true);
http_request.onreadystatechange = function () {
    var done = 4, ok = 200;
    if (http_request.readyState == done && http_request.status == ok) {
        my_JSON_object = JSON.parse('''http_request.responseText''');

after tearing your hair out why not just use JSON-P – works everywhere instantly.



Python Simple HTTP Server – Web Serve Files over HTTP to properly simulate file permissions

Problem : Everything worked till I put it online ; site stopped working when uploaded to the server ?

When working with Javascript Files, Libraries, Includes, jQuery, WebGL textures and the like of HTTP included files from a web page I find it best to open files over HTTP from a web server as there can be cross site permission restrictions on file types and such like – I have run into this problem when including images as WebGL textures using three.js.

Python Simple HTTP Server – runs in a directory and serves those files and folders over HTTP on localhost port 8000

navigate to the file directory in a BASH prompt ( or shell or terminal ) and run :

python -m SimpleHTTPServer

and open http://localhost:8000 in a web-browser

which will load an index page with the files as links or index.html if that is present

And the HTTP transactions are logged to the console :D

Vary the PORT by adding a port number 8020 to the command thus :

python -m SImpleHTTPServer 8030

and you can serve multiple directories

As a bonus you can connect over your local network at the IP ( is localhost ) address of the Server thus

or use a dynamic dns service or DMZ or port forwards on your router to serve the site globally ( prolly super insecure so do not ever do this ever :( )


[Solved] Ubuntu 12.04 – How to remove volume change sound effect noise

Ubuntu 12.04 makes a very loud and terribly annoying put putting sound effect when the volume is changed.

To remove Applications -> System Tools -> System Settings –> –> Sound (icon) –> Sound Effects and(tab) –> here you can change the alert sound volume and switch it off. [SOLVED]



October 4, 2013

Undertanding the C-Bow and Skip-gram models.

From the Arxiv Preprint : Efficient Estimation of Word Representations in Vector Space byTomas Mikolov,Kai Chen, Greg Corrado,Jeffrey Dean”

Natural Language Processing or computational linguistics can attempt to predict the next word given the previous few words.

A gram being a chunk of language possibly 3 characters, probably  a word; an n-gram is a number of these occuring together.
Google’s N-gram viewer ( word frequency by year )

The Mikolov model is a hierarchical softmax encoded as a Huffman Binary Tree.

The vocabulary is represented as a Huffman Binary Tree which dramatically saves processing.

The softmax is an activation function common in neural net outputs as it has a sigmoid curve and all the outputs sum to 1 – each output represents an exclusive probability.

\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer.

The hierarchical softmax reduces the dimensionality component of the computational complexity to  the log of the Unigram_perplexity of the dimensionality.

for heirarchical softmax the paper cites :
Strategies for Training Large Scale Neural Network Language Models by Milokov et al, 2011
A Scalable Hierarchical Distributed Language Model by Hinton & Mnih, 2009
Hierarchical Probabilistic Neural Network Language Model by Bengio & Morin, 2005


Mikolov Sutskever Dean Corrado Word Vectors Machine Translation

CBOW predicts a word given the immediately preceeding and following words, 2 of each – order of occurence is not conserved.

Skip-gram predicts the surrounding words from a word. Again 2 preceeding and 2 following but this time not immediately subsequent but skipping a certain constant quantity each time.

In “Exploiting Similarities among Languages for Machine Translation by Mikolov, Le & Sutskever” PCA dimensionality reduction is applied to these distributed-representations (illustrated above).

The architectures of CBOW and Skip-gram are similar – CBOW likes a large corpus and Skip-gram prefers smaller corpora.

From Google Groups Milokov posts :

“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.”