similarity - TF-IDF in python and not desired results -
i found python tutorial on web calculating tf-idf , cosine similarity. trying play , change bit.
the problem have weird results , without sense.
for example using 3 documents. [doc1,doc2,doc3]
doc1 , doc2 similars , doc3 totaly different.
the results here:
[[ 0.00000000e+00 2.20351188e-01 9.04357868e-01] [ 2.20351188e-01 -2.22044605e-16 8.82546765e-01] [ 9.04357868e-01 8.82546765e-01 -2.22044605e-16]]
first, thought numbers on main diagonal should 1 , not 0. after that, similarity score doc1 , doc2 around 0.22 , doc1 doc3 around 0.90. expected opposite results. please check code , maybe me understand why have results?
doc1, doc2 , doc3 tokkenized texts.
articles = [doc1,doc2,doc3] corpus = [] article in articles: word in article: corpus.append(word) def freq(word, article): return article.count(word) def wordcount(article): return len(article) def numdocscontaining(word,articles): count = 0 article in articles: if word in article: count += 1 return count def tf(word, article): return (freq(word,article) / float(wordcount(article))) def idf(word, articles): return math.log(len(articles) / (1 + numdocscontaining(word,articles))) def tfidf(word, document, documentlist): return (tf(word,document) * idf(word,documentlist)) feature_vectors=[] article in articles: vec=[] word in corpus: if word in article: vec.append(tfidf(word, article, corpus)) else: vec.append(0) feature_vectors.append(vec) n=len(articles) mat = numpy.empty((n, n)) in xrange(0,n): j in xrange(0,n): mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j]) print mat
if can try other package such sklearn try it
this code might help
from sklearn.feature_extraction.text import tfidftransformer nltk.corpus import stopwords sklearn.metrics.pairwise import cosine_similarity import numpy np import numpy.linalg la sklearn.feature_extraction.text import tfidfvectorizer f = open("/root/myfolder/scoringdocuments/doc1") doc1 = str.decode(f.read(), "utf-8", "ignore") f = open("/root/myfolder/scoringdocuments/doc2") doc2 = str.decode(f.read(), "utf-8", "ignore") f = open("/root/myfolder/scoringdocuments/doc3") doc3 = str.decode(f.read(), "utf-8", "ignore") train_set = [doc1, doc2, doc3] test_set = ["age salman khan wife"] #query stopwords = stopwords.words('english') tfidf_vectorizer = tfidfvectorizer(stop_words = stopwords) tfidf_matrix_test = tfidf_vectorizer.fit_transform(test_set) print tfidf_vectorizer.vocabulary_ tfidf_matrix_train = tfidf_vectorizer.transform(train_set) #finds tfidf score normalization print 'fit vectorizer train set', tfidf_matrix_train.todense() print 'transform vectorizer test set', tfidf_matrix_test.todense() print "\n\ncosine simlarity not separated sets cosine scores ==> ", cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
Comments
Post a Comment