Wednesday, April 16, 2014

Apache Mahout for Document Similarity.

Using Apache Mahout for Document Similarity.

Below are steps to run on the text file collection.


  • sh mahout seqdirectory -c UTF-8 -i /Users/xxxx/myfiles/ -o seqfiles
  • sh mahout seq2sparse -i seqfiles/ -o vectors/  -ow -chunk 100  -x 90  -seq  -ml 50  -n 2  -s 5 -md 5  -ng 3  -nv
  • sh mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
  • sh mahout rowsimilarity -i matrix/matrix -o similarity  --similarityClassname SIMILARITY_COSINE -m 10 -ess
  • sh mahout seqdumper -i similarity > similarity.txt
  • sh mahout seqdumper -i matrix/docIndex > docIndex.txt
Apache Mahout does following Steps 

  • Tokenize and Transform
  • Generate word vectors and weights
  • Find Document similarity based on TF-IDF( Term Frequence - Inverted Document Frequency) using COSINE_SIMILARITY

Tokenization

The first step is to convert the text document into sequence of tokens. Content is tokenized based on single word. All the tokens are then transformed to lower cases so that the content is standardized.
Create N-Grams : A term n-Gram is defined as a series of consecutive tokens of length n. n-Gram consists of all series of consecutive tokens of length n.

Generate Vectors and Weights

Each document is converted to a word vector of n-Grams with weight assigned to each n-Gram. This weight is assigned based on TF-IDF (Term Frequency - Inverse Document Frequency), This is a measure of importance of a term in a document. it is measured by the frequency of the term in the document but is offset by the total frequency of the term in the whole document set.  this creates a balance in weighing terms, as some terms can occur more commonly than other and can be less significant in arriving at similarity.

Document Similarity

Each Document word vector is compared to every other document’s vector. Document similarity is based on cosine similarity measure between the word vector’s.  two vectors with the same orientation have a cosine similarity of 1 , two vectors at 90 have similarity of 0. A matrix similarity between each document with every other document is generated.

Sample docIndex.txt Below

Key: 0: Value: /File558
Key: 1: Value: /File4340
Key: 2: Value: /File4208
Key: 3: Value: /File471

Sample Similarity.txt Below

Key: 0: Value: {0:1.0000000000000002,3865:0.15434639725421775,318:0.16924612516623572,558:0.24384373237418783,471:0.16826200999114294,4340:0.16651713654958472,7:0.164357811310303,1841:0.15904628827648598,4208:0.16296041411613846,14:0.14043468960009342}

Key: 1: Value: {1615:0.3312716794782159,2181:0.2840451034186393,2126:0.32415666313248037,3188:0.1628119850871482,1496:0.24775558568026784,1:1.0,1575:0.13525396149776772,1269:0.13286526354605824,28:0.45703740702783774,1866:0.3262754564949865}

Key: 2: Value: {2:1.0,4350:0.13571272853930183,348:0.12600225826696973,3210:0.13949921190207168,560:0.15234464042319912,3294:0.2889578044491356,802:0.17942407070282945,1633:0.1964965769704117,3355:0.1298340236494648,495:0.12627029072308343}

Key: 3: Value: {1990:0.17193706865160252,3700:0.1302978723523794,2082:0.16196813164388732,2227:0.12561545966019144,665:0.15584753719122243,1163:0.19345501767136697,6:0.22582692114456704,3:1.0,1555:0.1742692199362734,4:0.1818170186646791}

Key: 4: Value: {1990:0.13213250264185705,3:0.1818170186646791,2082:0.1349661297563062,1163:0.15679941310702006,1555:0.18261523201994426,6:0.1981917938129962,738:0.29646182670818444,1684:0.21050749439902763,4:0.9999999999999999,3150:0.1397676584929176}


you can see above similarity matrix with scores 
e;g [  558:0.24384373237418783,471:0.16826200999114294 ]

docIndex  key 0 maps to 0:1.0000000000000002  in similarity.txt output.
Similarly docIndex key 558 maps to 558:0.24384373237418783. in similarity.txt output.


--- Sample

sh mahout seqdumper -i vectors/tfidf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tf-vectors > vectors.txt

Create ElasticSearch cluster on single machine

I wanted to figure out how to create a multi-node ElasticSearch cluster on single machine. So i followed these instructions First i did...