两个人高清视频免费观看www,亚洲精品v日韩精品,996热re视频精品视频这里

サマリー：自然語(yǔ)言處理 -->計(jì)算機(jī)數(shù)據(jù) ，計(jì)算機(jī)可以處理vector，matrix 向量矩陣。NLTK 自然語(yǔ)言處理庫(kù)，自帶語(yǔ)料，詞性分析，分類(lèi)，分詞等功能。簡(jiǎn)單版的wrapper，比如textblob。import nltk nltk.download() #可以下載語(yǔ)料庫(kù)等。#自帶的語(yǔ)料庫(kù) from nltk.corpus import brow

自然語(yǔ)言處理 -->計(jì)算機(jī)數(shù)據(jù) ，計(jì)算機(jī)可以處理vector，matrix 向量矩陣。

NLTK 自然語(yǔ)言處理庫(kù)，自帶語(yǔ)料，詞性分析，分類(lèi)，分詞等功能。

簡(jiǎn)單版的wrapper，比如textblob。

import nltk
nltk.download() #可以下載語(yǔ)料庫(kù)等。

#自帶的語(yǔ)料庫(kù)
from nltk.corpus import brown
brown.categories()
len(brown.sents()) # 多少句話
len(brown.words()) # 多少個(gè)單詞

一簡(jiǎn)單的文本預(yù)處理流水線

1.分詞 Tokenize 長(zhǎng)句子分成有意義的小部件。

sentence = "hello word"
nltk.word_tokenize(sentence)

nltk的分詞對(duì)于中文是無(wú)效的，因?yàn)橛⑽氖窃~語(yǔ)按照空格鍵分開(kāi)的，而中文單個(gè)字分開(kāi)是無(wú)效的，比如今天天氣不錯(cuò)，要分成今天/天氣/不錯(cuò)/！

中文有兩種 1 啟發(fā)式 Heuristic ，就是比如最長(zhǎng)詞，字典作為詞庫(kù)，有今天，沒(méi)有今天天這么長(zhǎng)的，所以今天為一個(gè)詞。

　　　　　2 機(jī)器學(xué)習(xí)/統(tǒng)計(jì)方法：HMM，CRF。（coreNLP ，斯坦福）

　　　　　　中文分詞結(jié)巴。

分完詞之后再調(diào)用nltk。

社交網(wǎng)絡(luò)語(yǔ)音的分詞，會(huì)員表情符號(hào)，url，#話題，@某人需要正則表達(dá)式來(lái)預(yù)處理。

2 nltk.pos_tag(text) #text為分詞完的list，part of speech 在這句話中的部分，adj adv，det（the,a這種）

3 stemming 詞干提取如walking 到walk

lemmatize（postag）詞形歸一 #會(huì)根據(jù)詞性，把is am are 歸一成be went 歸一成go 這種

4 stop words（停止詞）， he,the這些沒(méi)有意義的詞，直接刪掉。

from nltk.corpus import stopwords
[word for word in word_list if word not in stopwords.words('english')]

插入圖片1 流程

插入圖片2 life is like a box of chocolate

二向量化

nltk在nlp的經(jīng)典應(yīng)用1情感分析 2 文本相似度 3 文本分類(lèi)（用的最多，如新聞分類(lèi)）

1.情感分析：

　　最簡(jiǎn)單的 sentiment dictionary

字典中單詞的正負(fù)性，如 like 1分 good 2分 bad -2 分 terrible -3 分。　　一句話所有的詞打分，相加看正負(fù)。

sentimen_dictionary = {}
for line in open('*.txt'):
　　word,score = line.split('\t')
　　sentiment_dictionary[word] = int(score)
total_score = sum(sentiment_dictionary.get(word,0) for word in words) #字典中有則score，沒(méi)有的Word則0分。

#有的人罵的比較黑裝粉，需要配上ML
from nltk.classify import NaiveBayesClassifier
# 隨手的簡(jiǎn)單訓(xùn)練集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'
def preprocess（s):
　#句子處理，這里是用split()，把每個(gè)單詞都分開(kāi)，沒(méi)有用到tokenize，因?yàn)槔颖容^簡(jiǎn)單。
     return {word : True for word in s.lower().split()}　　　　　　　　
    #{fname,fval} 這里用true是最簡(jiǎn)單的存儲(chǔ)形式，fval 每個(gè)文本單詞對(duì)應(yīng)的值，高級(jí)的可以用word2vec來(lái)得到fval。
#訓(xùn)練 this is terrible good awesome bad book 這樣一次單詞長(zhǎng)列（1,1,0，1,0,0，1）如s1對(duì)應(yīng)的向量
 
training_data = [ [preprocess(s1),'pos'],
                           [preprocess(s1),'pos'],
                          [preprocess(s1),'neg'],
                         [preprocess(s1),'neg']]
model = NaiveBayesClassifier.train(training_data)
print(model.classify(preprocess('this is a good book')))

2.文本相似性

　把文本變成相同長(zhǎng)度的向量，通過(guò)余弦相似度求相似性。

　 nltk中FreqDist統(tǒng)計(jì)文字出現(xiàn)的頻率

3.文本分類(lèi)

　　　　TF-IDF

　　　　TF，Term Frequency，一個(gè)term在一個(gè)文檔中出現(xiàn)的有多頻繁。

　　　　TF（t) = t出現(xiàn)在文檔中的次數(shù)/文檔中的term總數(shù)

　　　　IDF ：Inverse Document Frequency,衡量一個(gè)term有多重要，如 is the 這些不重要

　　　　把罕見(jiàn)的權(quán)值農(nóng)高。

　　　　IDF（t) = log e (文檔總數(shù)/含有t的文檔總數(shù)）

　　　　TF-IDF = TF×IDF

from nltk.text import TextCollection
# 首首先, 把所有的文文檔放到TextCollection類(lèi)中。
# 這個(gè)類(lèi)會(huì)自自動(dòng)幫你斷句句, 做統(tǒng)計(jì), 做計(jì)算
corpus = TextCollection(['this is sentence one',
    'this is sentence two',
    'this is sentence three'])
# 直接就能算出tfidf
# (term: 一一句句話中的某個(gè)term, text: 這句句話)
print(corpus.tf_idf('this', 'this is sentence four'))
# 0.444342
# 同理理, 怎么得到一一個(gè)標(biāo)準(zhǔn)大大小小的vector來(lái)表示所有的句句子子?
# 對(duì)于每個(gè)新句句子子
new_sentence = 'this is sentence five'
# 遍歷一一遍所有的vocabulary中的詞:
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence))
# 我們會(huì)得到一一個(gè)巨?長(zhǎng)(=所有vocab?長(zhǎng)度)的向量量

亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

Python文本處理nltk基礎(chǔ)

人気のある見(jiàn)出し語(yǔ)