Noonisy
Do U Know(2022-11-19)
2022-11-19
阅读:332

DO U KNOW?


1.BERTScore

2.TF-IDF原理与实现
corpus = ['this is the first document',
        'this is the second second document',
        'and the third one',
        'is this the first document']
words_list = list()
for i in range(len(corpus)):
    words_list.append(corpus[i].split(' '))
words_list
# [['this', 'is', 'the', 'first', 'document'],
#  ['this', 'is', 'the', 'second', 'second', 'document'],
#  ['and', 'the', 'third', 'one'],
#  ['is', 'this', 'the', 'first', 'document']]

from collections import Counter
count_list = list()
for i in range(len(words_list)):
    count = Counter(words_list[i])
    count_list.append(count)
count_list
# [Counter({'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}),
#  Counter({'this': 1, 'is': 1, 'the': 1, 'second': 2, 'document': 1}),
#  Counter({'and': 1, 'the': 1, 'third': 1, 'one': 1}),
#  Counter({'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1})]

def tf(word, count):
    return count[word] / sum(count.values())

def idf(word, count_list):
    n_contain = sum([1 for count in count_list if word in count])
    return math.log(len(count_list) / (1 + n_contain))

def tf_idf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

for i, count in enumerate(count_list):
    print(f"第{i+1}个文档")
    tf_score = {word: tf(word, count) for word in count}
    print('tf :', tf_score)
    idf_score = {word: idf(word, count_list) for word in count}
    print('idf :', idf_score)
    scores = {word: tf_idf(word, count, count_list) for word in count}
    print('tf-idf :', scores)
    sorted_word = sorted(scores.items(), key=lambda x:x[1], reverse=True)
    for word, score in sorted_word:
        print(f"\tword: {word}, TF-IDF: {round(score, 5)}")

# ================
第1个文档
tf : {'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}
idf : {'this': 0.0, 'is': 0.0, 'the': -0.2231435513142097, 'first': 0.28768207245178085, 'document': 0.0}
tf-idf : {'this': 0.0, 'is': 0.0, 'the': -0.044628710262841945, 'first': 0.05753641449035617, 'document': 0.0}
    word: first, TF-IDF: 0.05754
    word: this, TF-IDF: 0.0
    word: is, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
    word: the, TF-IDF: -0.04463
第2个文档
tf : {'this': 0.16666666666666666, 'is': 0.16666666666666666, 'the': 0.16666666666666666, 'second': 0.3333333333333333, 'document': 0.16666666666666666}
idf : {'this': 0.0, 'is': 0.0, 'the': -0.2231435513142097, 'second': 0.6931471805599453, 'document': 0.0}
tf-idf : {'this': 0.0, 'is': 0.0, 'the': -0.03719059188570162, 'second': 0.23104906018664842, 'document': 0.0}
    word: second, TF-IDF: 0.23105
    word: this, TF-IDF: 0.0
    word: is, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
    word: the, TF-IDF: -0.03719
第3个文档
tf : {'and': 0.25, 'the': 0.25, 'third': 0.25, 'one': 0.25}
idf : {'and': 0.6931471805599453, 'the': -0.2231435513142097, 'third': 0.6931471805599453, 'one': 0.6931471805599453}
tf-idf : {'and': 0.17328679513998632, 'the': -0.05578588782855243, 'third': 0.17328679513998632, 'one': 0.17328679513998632}
    word: and, TF-IDF: 0.17329
    word: third, TF-IDF: 0.17329
    word: one, TF-IDF: 0.17329
    word: the, TF-IDF: -0.05579
第4个文档
tf : {'is': 0.2, 'this': 0.2, 'the': 0.2, 'first': 0.2, 'document': 0.2}
idf : {'is': 0.0, 'this': 0.0, 'the': -0.2231435513142097, 'first': 0.28768207245178085, 'document': 0.0}
tf-idf : {'is': 0.0, 'this': 0.0, 'the': -0.044628710262841945, 'first': 0.05753641449035617, 'document': 0.0}
    word: first, TF-IDF: 0.05754
    word: is, TF-IDF: 0.0
    word: this, TF-IDF: 0.0
    word: document, TF-IDF: 0.0
3.[TF解释]
一句话中,词 $t$ 出现的次数与这句话中所有出现的词的总数之比
$$ {\rm TF}(t)=\frac{n_t}{N_{word}} $$
4.[IDF解释]
所有句话中,词 $t$ 出现的次数越多,那么它的IDF分数就越低,意味着常见词的IDF分数低。词 $t$ 出现的次数越少,那么它的IDF分数就越高,意味着罕见词的IDF分数高。$N$ 表示所有文档(句子)的数量
$$ {\rm IDF}(t)=\log(\frac{N}{n_t+1}) $$
为什么要加一?:使用了plus-one smoothing,加一平滑,用于处理unknow word。遇见了不在vocabulary里的word,为了不让分母等于0,故加一

如果不想smoothing,可以表示为:
$$ {\rm IDF}(t)=-\log(\frac{n_t}{N}) $$
最后编辑于:2023 年 04 月 03 日 16:11
邮箱格式错误
网址请用http://或https://开头