0
点赞
收藏
分享

微信扫一扫

Python实现文件中的所有词汇分割为单独的字母

一只1994 2022-10-03 阅读 207
  1. 基于Character-Based Language Model在制作之前需要对语料库中的词汇进行分割,将每个字母单拎出来存在另一个文件里使用;
  2. 下方是干分割工序的Python脚本:

# -*- coding: UTF-8 -*-
import string
import sys

def SplitIntoCharacters(sourceFilePath, outputFileName):
sourceFile = open(sourceFilePath)
newFile = open(outputFileName, 'a')
chn_punctuations = "!?。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
for word in sourceFile.read().split():
for character in word:
isPunct = character in string.punctuation or character in chn_punctuations
if not isPunct:
newCharacter = character.lower() + "\n"
newFile.writelines(newCharacter)
sourceFile.close()
newFile.close()
print("done!")


if __name__ == "__main__":
# print('args list:', str(sys.argv))
sourceFilePath = sys.argv[1]
outputFileName = sys.argv[2]
if sourceFilePath == ' ' or outputFileName == ' ':
print("Error: Source file path or the output file name is empty")
else:
SplitIntoCharacters(sourceFilePath, outputFileName)

# by Alexander Enharjan

  1. 用法是:

python3 wordSpliter (INPUT_FILE_PATH) (OUTPUT_FILE_PATH)


作者:艾孜尔江·艾尔斯兰

转载请务必标明出处!



举报

相关推荐

0 条评论