对自然语言处理,转换人类的表达到计算机能识别字符是一个非常难得课题。同样的意思有各种各样的表达,语法,句型, 同义词都会影响计算机的判断。nltk 模块提供一些功能,可以从文档中抓取重要的信息,然后分析。
installation
pip install nltk
tokenize(分割)
import nltk
from nltk import word_tokenize
nltk.download('punkt')
text = 'Rock is a smart guy.'
words = word_tokenize(text)
['Rock', 'is', 'a', 'smart', 'guy', '.']
过滤标点符号
[word for word in words if word.isalpha()]
过滤常用但没有意义的单词
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
stops = set(stopwords.words('chinese'))
clean_words = [word for word in words if word not in stops]
['rock', 'smart', 'guy']
转换成词干
text = 'Rock is a smart guy. he likes playing cards.'
words = word_tokenize(text)
clean_words = [word for word in words if word.isalpha() and word not in stops]
from nltk.stem import PorterStemmer
porter = PorterStemmer()
stem_words = [porter.stem(word) for word in clean_words]
['rock', 'smart', 'guy', 'like', 'play', 'card']