Category Archives: Uncategorized

This Flashing Rock Just Appeared Beside Curiosity Rover on Mars

Mars is Dynamic.

mars-rover-opportunity-alien-rock

This Alien Rock  Appeared on Mars beside Earth’s Opportunity Rover on sol 3540.

Photographs sent back from sol 3541 appear to show the rock flashing in the quickly changing light over two minutes each frame is 20 seconds apart.

The Mysterious Martian Rock is not present on sol 3528 so it appears to have arrived suddenly, official theories are said to be either a meteorite event or the rover caused it.

What are Cosine Distance, Cosine Similarity ?

Cosine Similarity is the cosine of the angular difference between two vectors which is equal to the dot product divided by the sum of the magnitudes. ( wikipedia / wolfram )

\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }

It is used in word2vec to find words that are close by.

It does not account for magnitude only angular difference but it can be calculated fast on sparse matrixes with only non-zero entries needing calculation and so has found a place in text classification.

 

including JSON files & Javascript in HTML

JSON – Javascript Object Notation is convenient data format – based on javascript syntax.

How to read it into Javascript is another matter. From Wikipedia :

“The official MIME type for JSON text is “application/json“.[15] Although most modern implementations have adopted the official MIME type, many applications continue to provide legacy support for other MIME types. Many service providers, browsers, servers, web applications, libraries, frameworks, and APIs use, expect, or recognize the (unofficial) MIME type “text/json” or the content-type “text/javascript“.”

var my_JSON_object;
var http_request = new XMLHttpRequest();
http_request.open("GET", url, true);
http_request.onreadystatechange = function () {
    var done = 4, ok = 200;
    if (http_request.readyState == done && http_request.status == ok) {
        my_JSON_object = JSON.parse('''http_request.responseText''');
    }
};
http_request.send(null);

after tearing your hair out why not just use JSON-P – works everywhere instantly.

 

 

sinclair

October 4, 2013

Undertanding the C-Bow and Skip-gram models.

From the Arxiv Preprint : Efficient Estimation of Word Representations in Vector Space byTomas Mikolov,Kai Chen, Greg Corrado,Jeffrey Dean”

Natural Language Processing or computational linguistics can attempt to predict the next word given the previous few words.

A gram being a chunk of language possibly 3 characters, probably  a word; an n-gram is a number of these occuring together.
Google’s N-gram viewer ( word frequency by year )

The Mikolov model is a hierarchical softmax encoded as a Huffman Binary Tree.

The vocabulary is represented as a Huffman Binary Tree which dramatically saves processing.

The softmax is an activation function common in neural net outputs as it has a sigmoid curve and all the outputs sum to 1 – each output represents an exclusive probability.

\sigma(\textbf{q}, i) = \frac{\exp(q_i)}{\sum_{j=1}^n\exp(q_j)} \text{,}
where the vector q is the net input to a softmax node, and n is the number of nodes in the softmax layer.

The hierarchical softmax reduces the dimensionality component of the computational complexity to  the log of the Unigram_perplexity of the dimensionality.

for heirarchical softmax the paper cites :
Strategies for Training Large Scale Neural Network Language Models by Milokov et al, 2011
A Scalable Hierarchical Distributed Language Model by Hinton & Mnih, 2009
Hierarchical Probabilistic Neural Network Language Model by Bengio & Morin, 2005

 

Mikolov Sutskever Dean Corrado Word Vectors Machine Translation

CBOW predicts a word given the immediately preceeding and following words, 2 of each – order of occurence is not conserved.

Skip-gram predicts the surrounding words from a word. Again 2 preceeding and 2 following but this time not immediately subsequent but skipping a certain constant quantity each time.

In “Exploiting Similarities among Languages for Machine Translation by Mikolov, Le & Sutskever” PCA dimensionality reduction is applied to these distributed-representations (illustrated above).

The architectures of CBOW and Skip-gram are similar – CBOW likes a large corpus and Skip-gram prefers smaller corpora.

From Google Groups Milokov posts :

“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.”

 

Downloading and Run Word2Vec from svn on Ubuntu 12.10

Downloading word2vec is easy with svn :

svn checkout http://word2vec.googlecode.com/svn/trunk/ w2v

then :

make

which automagically elicited the following

gcc word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
word2vec.c: In function ‘TrainModelThread’:
word2vec.c:363:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.c:369:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
gcc word2phrase.c -o word2phrase -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc word-analogy.c -o word-analogy -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result
chmod +x *.sh

then I ran the 1st demo:

./demo-word.sh

which resulted in downloading a file train8 from matt mahoney which is the 1st billion charachters from WIkipedia.

Then a training phase ramped both my CPUs to 100% and got me training at 27.16k words per second. w00t !

Enter word or sentence (EXIT to break): scotland

Word: scotland  Position in vocabulary: 1105

Word       Cosine distance
————————————————————————
england        0.797283
wales        0.674080
scots        0.622237
ireland        0.607295
somerset        0.576749
cornwall        0.564767
scottish        0.555280
britain        0.540651
lulach        0.523911
tudor        0.508784
queen        0.508427
brittany        0.489793
elizabeth        0.477710
edward        0.476515
wessex        0.473759
earls        0.472889
dunkeld        0.465925
peerage        0.457085
jannaeus        0.456586
henry        0.449374
viii        0.448380
navarre        0.446313
king        0.442532
crowned        0.441560
shropshire        0.440221
ulster        0.439687
thrones        0.439417
victoria        0.439166
vii        0.438213
moray        0.438193
essex        0.436330
isles        0.432125
yorkshire        0.430498
gruoch        0.429486
aberdeen        0.428796
hertfordshire        0.428162
royalists        0.427168
pinewood        0.427026
afonso        0.426304
conqueror        0.426080

and running :

./demo-phrases.sh

results in :

make: Nothing to be done for `all’.
Starting training using file text8
words processed: 14700K     Vocab size: 3944K
real    0m20.519s
user    0m17.849s
sys    0m2.164s

and running :

./demo-analogy.sh

results in :

make: Nothing to be done for `all’.
—————————————————————————————————–
Note that for the word analogy to perform well, the models should be trained on much larger data sets
Example input: paris france berlin
—————————————————————————————————–
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.023731  Progress: 5.14%  Words/thread/sec: 25.92k  ^C
real    0m26.163s
user    0m39.730s
sys    0m0.452s

running

./demo-analogy.sh
make: Nothing to be done for `all’.
—————————————————————————————————–
Note that for the word analogy to perform well, the models should be trained on much larger data sets
Example input: paris france berlin
—————————————————————————————————–
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
Alpha: 0.000121  Progress: 99.58%  Words/thread/sec: 24.97k
real    6m13.423s
user    11m13.314s
sys    0m1.568s
Enter three words (EXIT to break): king queen paris

Word: king  Position in vocabulary: 187

Word: queen  Position in vocabulary: 903

Word: paris  Position in vocabulary: 1055

Word              Distance
————————————————————————
la        0.460125
commune        0.435542
lausanne        0.430589
vevey        0.429535
bologna        0.418084
poissy        0.416120
lyon        0.415239
sur        0.405876
madrid        0.403547
les        0.401034
nantes        0.400419
le        0.390821
du        0.390574
sorbonne        0.390027
boulogne        0.389495
conservatoire        0.388148
hospital        0.383842
complutense        0.382987
salle        0.382847
nationale        0.382411
distillery        0.381971
villa        0.379842
cegep        0.378001
louvre        0.377497
technologie        0.376842
baume        0.376681
marie        0.375048
tijuana        0.373825
chapelle        0.373256
universitaire        0.369375
dame        0.368908
universite        0.367017
grenoble        0.366289
henares        0.364779
lleida        0.359695
ghent        0.359106
puerta        0.358269
escuela        0.357838
fran        0.357790
apartment        0.357674

Enter three words (EXIT to break): lord cat crazy

Word: lord  Position in vocabulary: 757

Word: cat  Position in vocabulary: 2601

Word: crazy  Position in vocabulary: 9156

Word              Distance
————————————————————————
rip        0.558223
dog        0.526483
ass        0.515598
haired        0.508471
bionic        0.506194
noodle        0.503654
bites        0.499798
prionailurus        0.493895
leopard        0.493136
blonde        0.492464
sloth        0.491286
coyote        0.486539
stump        0.486430
felis        0.486219
eyed        0.485090
bonzo        0.484414
iris        0.482825
slim        0.482549
candy        0.481998
rhino        0.481062
nails        0.478900
blossom        0.478873
hitch        0.476750
mom        0.475003
ugly        0.474717
shit        0.474513
mickey        0.473128
goat        0.471524
inu        0.469157
cheesy        0.467842
daddy        0.467046
kid        0.466047
watermelon        0.464281
naughty        0.463567
funny        0.463252
rabbit        0.462738
kiss        0.461589
reaper        0.460838
chupacabra        0.458305
girl        0.458293

 

Wasting my young Years – London Grammar – Rich Voice

The rich voice of London Grammar Chanteuse helps shuts out all the noise and distraction and helps me focus.

The dreamy intense video combines is from footage produced from 625 pinholes exposing a single roll of film in one go. A hand-crafted multi-lens timeslice camera by the duo Bison.

 

Alchemy on Ubuntu 12.04

Alchemy is a great drawing tool for inspiration.

It provides very unique methods of drawing that embrace chaos and very finely balanced tools & effects – random, collage, scribble, displace, pull-shapes, whistle input – no undo but regular autosave. Alchemy is the ultimate doodler.

Download Alchemy Here ( i got Alchemy-007-tar.gz – for LINUX )

I needed java to run Alchemy.

sudo apt-get install default-jdk

I tried the openjdk-6-jdk but there was an error so I installed default-jdk which worked.

untar

tar -xzvf Alchemy-007.tar.gz

make executable

cd Alchemy

chmod +X Alchemy

./Alchemy

and it should work

Configuration: an Acer 5920G; with a Wacom Bamboo Tablet; running Ubuntu 12.04 with Gnome Desktop ; Alchemy-007.tar.gz ; java version “1.6.0_27

Neither OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.2), nor
OpenJDK Server VM (build 20.0-b12, mixed mode) appeared to work with Alchemy.