DO U KNOW?
1.BERTScore
2.
TF-IDF原理与实现corpus = ['this is the first document',
'this is the second second document',
'and the third one',
'is this the first document']
words_list = list()
for i in range(len(corpus)):
words_list.append(corpus[i].split(' '))
words_list
# [['this', 'is', 'the', 'first', 'document'],
# ['this', 'is', 'the', 'second', 'second', 'document'],
# ['and', 'the', 'third', 'one'],
# ['is', 'this', 'the', 'first', 'document']]
from collections import Counter
count_list = list()
for i in range(len(words_list)):
count = Counter(words_list[i])
count_list.append(count)
count_list
# [Counter({'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}),
# Counter({'this': 1, 'is': 1, 'the': 1, 'second': 2, 'document': 1}),
# Counter({'and': 1, 'the': 1, 'third': 1, 'one': 1}),
# Counter({'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1})]
def tf(word, count):
return count[word] / sum(count.values())
def idf(word, count_list):
n_contain = sum([1 for count in count_list if word in count])
return math.log(len(count_list) / (1 + n_contain))
def tf_idf(word, count, count_list):
return tf(word, count) * idf(word, count_list)
for i, count in enumerate(count_list):
print(f"第{i+1}个文档")
tf_score = {word: tf(word, count) for word in count}
print('tf :', tf_score)
idf_score = {word: idf(word, count_list) for word in count}
print('idf :', idf_score)
scores = {word: tf_idf(word, count, count_list) for word in count}
print('tf-idf :', scores)
sorted_word = sorted(scores.items(), key=lambda x:x[1], reverse=True)
for word, score in sorted_word:
print(f"\tword: {word}, TF-IDF: {round(score, 5)}")
# ================
第1个文档
tf : {'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}
idf : {'this': 0.0, 'is': 0.0, 'the': -0.2231435513142097, 'first': 0.28768207245178085, 'document': 0.0}
tf-idf : {'this': 0.0, 'is': 0.0, 'the': -0.044628710262841945, 'first': 0.05753641449035617, 'document': 0.0}
word: first, TF-IDF: 0.05754
word: this, TF-IDF: 0.0
word: is, TF-IDF: 0.0
word: document, TF-IDF: 0.0
word: the, TF-IDF: -0.04463
第2个文档
tf : {'this': 0.16666666666666666, 'is': 0.16666666666666666, 'the': 0.16666666666666666, 'second': 0.3333333333333333, 'document': 0.16666666666666666}
idf : {'this': 0.0, 'is': 0.0, 'the': -0.2231435513142097, 'second': 0.6931471805599453, 'document': 0.0}
tf-idf : {'this': 0.0, 'is': 0.0, 'the': -0.03719059188570162, 'second': 0.23104906018664842, 'document': 0.0}
word: second, TF-IDF: 0.23105
word: this, TF-IDF: 0.0
word: is, TF-IDF: 0.0
word: document, TF-IDF: 0.0
word: the, TF-IDF: -0.03719
第3个文档
tf : {'and': 0.25, 'the': 0.25, 'third': 0.25, 'one': 0.25}
idf : {'and': 0.6931471805599453, 'the': -0.2231435513142097, 'third': 0.6931471805599453, 'one': 0.6931471805599453}
tf-idf : {'and': 0.17328679513998632, 'the': -0.05578588782855243, 'third': 0.17328679513998632, 'one': 0.17328679513998632}
word: and, TF-IDF: 0.17329
word: third, TF-IDF: 0.17329
word: one, TF-IDF: 0.17329
word: the, TF-IDF: -0.05579
第4个文档
tf : {'is': 0.2, 'this': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}
idf : {'is': 0.0, 'this': 0.0, 'the': -0.2231435513142097, 'first': 0.28768207245178085, 'document': 0.0}
tf-idf : {'is': 0.0, 'this': 0.0, 'the': -0.044628710262841945, 'first': 0.05753641449035617, 'document': 0.0}
word: first, TF-IDF: 0.05754
word: is, TF-IDF: 0.0
word: this, TF-IDF: 0.0
word: document, TF-IDF: 0.0
3.[TF解释]
一句话中,词 $t$ 出现的次数与这句话中所有出现的词的总数之比
$$
{\rm TF}(t)=\frac{n_t}{N_{word}}
$$
4.[IDF解释]
所有句话中,词 $t$ 出现的次数越多,那么它的IDF分数就越低,意味着常见词的IDF分数低。词 $t$ 出现的次数越少,那么它的IDF分数就越高,意味着罕见词的IDF分数高。$N$ 表示所有文档(句子)的数量
$$
{\rm IDF}(t)=\log(\frac{N}{n_t+1})
$$
为什么要加一?:使用了
plus-one smoothing
,加一平滑,用于处理unknow word。遇见了不在vocabulary里的word,为了不让分母等于0,故加一
如果不想smoothing,可以表示为:
$$
{\rm IDF}(t)=-\log(\frac{n_t}{N})
$$