自然语言处理nltk-CFANZ编程社区

自然语言处理nltk

对自然语言处理，转换人类的表达到计算机能识别字符是一个非常难得课题。同样的意思有各种各样的表达，语法，句型，同义词都会影响计算机的判断。nltk 模块提供一些功能，可以从文档中抓取重要的信息，然后分析。

installation

pip install nltk

tokenize(分割)

import nltk
from nltk import word_tokenize

nltk.download('punkt')
text = 'Rock is a smart guy.'
words = word_tokenize(text)
['Rock', 'is', 'a', 'smart', 'guy', '.']

过滤标点符号

[word for word in words if word.isalpha()]

过滤常用但没有意义的单词

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

自然语言处理nltk_自然语言处理

stops = set(stopwords.words('chinese'))

自然语言处理nltk_自然语言处理_02

clean_words = [word for word in words if word not in stops]
['rock', 'smart', 'guy']

转换成词干

text = 'Rock is a smart guy. he likes playing cards.'
words = word_tokenize(text)
clean_words = [word for word in words if word.isalpha() and word not in stops]

from nltk.stem import PorterStemmer
porter = PorterStemmer()
stem_words = [porter.stem(word) for word in clean_words]
['rock', 'smart', 'guy', 'like', 'play', 'card']

0 条评论