similarity - TF-IDF in python and not desired results -
i found python tutorial on web calculating tf-idf , cosine similarity. trying play , change bit.
the problem have weird results , without sense.
for example using 3 documents. [doc1,doc2,doc3] doc1 , doc2 similars , doc3 totaly different.
the results here:
[[  0.00000000e+00   2.20351188e-01   9.04357868e-01]  [  2.20351188e-01  -2.22044605e-16   8.82546765e-01]  [  9.04357868e-01   8.82546765e-01  -2.22044605e-16]] first, thought numbers on main diagonal should 1 , not 0. after that, similarity score doc1 , doc2 around 0.22 , doc1 doc3 around 0.90. expected opposite results. please check code , maybe me understand why have results?
doc1, doc2 , doc3 tokkenized texts.
articles = [doc1,doc2,doc3]  corpus = [] article in articles:     word in article:         corpus.append(word)   def freq(word, article):     return article.count(word)  def wordcount(article):     return len(article)  def numdocscontaining(word,articles):   count = 0   article in articles:     if word in article:       count += 1   return count  def tf(word, article):     return (freq(word,article) / float(wordcount(article)))  def idf(word, articles):     return math.log(len(articles) / (1 + numdocscontaining(word,articles)))  def tfidf(word, document, documentlist):     return (tf(word,document) * idf(word,documentlist))  feature_vectors=[]  article in articles:     vec=[]     word in corpus:         if word in article:             vec.append(tfidf(word, article, corpus))         else:             vec.append(0)     feature_vectors.append(vec)  n=len(articles)  mat = numpy.empty((n, n)) in xrange(0,n):     j in xrange(0,n):        mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])  print mat 
if can try other package such sklearn try it
this code might help
from sklearn.feature_extraction.text import tfidftransformer nltk.corpus import stopwords sklearn.metrics.pairwise import cosine_similarity import numpy np import numpy.linalg la sklearn.feature_extraction.text import tfidfvectorizer   f = open("/root/myfolder/scoringdocuments/doc1") doc1 = str.decode(f.read(), "utf-8", "ignore") f = open("/root/myfolder/scoringdocuments/doc2") doc2 = str.decode(f.read(), "utf-8", "ignore") f = open("/root/myfolder/scoringdocuments/doc3") doc3 = str.decode(f.read(), "utf-8", "ignore")  train_set = [doc1, doc2, doc3]  test_set = ["age salman khan wife"] #query  stopwords = stopwords.words('english')  tfidf_vectorizer = tfidfvectorizer(stop_words = stopwords) tfidf_matrix_test =  tfidf_vectorizer.fit_transform(test_set) print tfidf_vectorizer.vocabulary_ tfidf_matrix_train = tfidf_vectorizer.transform(train_set) #finds tfidf score normalization print 'fit vectorizer train set', tfidf_matrix_train.todense() print 'transform vectorizer test set', tfidf_matrix_test.todense()  print "\n\ncosine simlarity not separated sets cosine scores ==> ", cosine_similarity(tfidf_matrix_test, tfidf_matrix_train) 
Comments
Post a Comment