【自然语言处理】-nltk库学习笔记(一)-CFANZ编程社区

句子切分(Sentence Tokenize)

nltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词

from nltk.tokenize import sent_tokenize

text = """Hello Mr. Smith, how are you doing today? The weather is great, and 
city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""

tokenized_text = sent_tokenize(text)
print(tokenized_text)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and \ncity is awesome.The sky is pinkish-blue.', "You shouldn't eat cardboard"]

单词切分(Word Tokenize)

import nltk

sent = "Study hard and improve every day."
token = nltk.word_tokenize(sent)
print(token)

['Study', 'hard', 'and', 'improve', 'every', 'day', '.']

移除标点符号

注意：

#所有的标点字符
import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

去除中文标点符号：

英文的

import string
s = 'a,sbch.:usx/'
S = s.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))
print(S)

a sbch  usx

import string
stri = 'today is friday, so happy..!!!'
punctuation_string = string.punctuation

print("所有的英文标点符号：", punctuation_string)
for i in punctuation_string:
    stri = stri.replace(i, '')
print(stri)

所有的英文标点符号： !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
today is friday so happy

如果是中文的捏

from zhon.hanzi import punctuation

str = '今天周五，下班了，好开心呀！！'
punctuation_str = punctuation
print("中文标点符合：", punctuation_str)
for i in punctuation:
    str = str.replace(i, '')
print(str)

中文标点符合： ＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·！？｡。
今天周五下班了好开心呀

string模块

在对字符串操作时,如果感觉自己写的很复杂时，可以试试string模块，里面有很多实用的属性。

>>>import string
>>>string.punctuation#所有的标点字符
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>string.whitespace#所有的空白字符
' \t\n\r\x0b\x0c'
>>>string.ascii_uppercase#所有的大写字母
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>>string.ascii_lowercase#所有的小写字母
'abcdefghijklmnopqrstuvwxyz'
>>>string.hexdigits#所有的十六进制字符
'0123456789abcdefABCDEF'

zhon库

学习笔记——zhon库的简介、安装、使用方法之详细攻略

import re
import zhon.hanzi

rst = re.findall(zhon.hanzi.sentence, '我买了一辆车。妈妈做的菜，很好吃！')
print(rst)

['我买了一辆车。', '妈妈做的菜，很好吃！']

移除停用词

关于安装NLTK语料库

这个是详细介绍

import nltk
nltk.download()

在这里插入图片描述

关于NLTK_DATA下载的坑

经过不懈的努力终于搞好了（这些问题可能别人也会遇到）
OSError: No such file or directory: 'C:\Users\2019\AppData\Roaming\nltk_data\corpora\stopword
解决方案

使用split()函数进行分词

import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
text = """Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. """
word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_word = [w for w in word_tokens if not w in stop_words]
print(filtered_word)

['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.']