What are Cosine Distance, Cosine Similarity ?

Cosine Similarity is the cosine of the angular difference between two vectors which is equal to the dot product divided by the sum of the magnitudes. ( wikipedia / wolfram )

\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }

It is used in word2vec to find words that are close by.

It does not account for magnitude only angular difference but it can be calculated fast on sparse matrixes with only non-zero entries needing calculation and so has found a place in text classification.


CSV to JSON-P, a Javascript Array converter in awk

Instead of converting a CSV in to JSON it is sometimes more convenient to convert a CSV to a Javascript Array with Awk.

To import word vectors from word2vec into Javascript I used a quick awk script to add the syntactic sugar to make an array of objects :

now the array can be used in javascript. This is called JSON-P, CSV to Javscript import.

JSON-P is good because: the data is ready for use by scripts with no additional steps. The MIME type is text/javascript just include it as a script tag in html and the data is ready. Make the type file .js for maximum compatibility.

<script type=”text/javascript” src=”vectors.js”></script>