提升BERT fine-tune正确率0.2%-0.3%的一个细节

夹胡碰

关注

阅读 65

2022-07-27


字级别分词,不要用官方的tokenizer (https://github.com/google-research/bert/blob/master/tokenization.py)

自己重写一个

def tokenize_to_str_list(textString):
split_tokens = []
for i in range(len(textString)):
split_tokens.append(textString[i])
return split_tokens

def convert_to_int_list(split_tokens):
output = []
for token in split_tokens:
if token in char2id:
output.append(char2id[item])
return


精彩评论(0)

0 0 举报