大数据面试题之Presto[Trino](4)-CFANZ编程社区

1. 文本解码原理--以MindNLP为例

1.1 自回归语言模型

根据前文预测下一个单词

一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积

$W_0$ :初始上下文单词序列
$t$ : 时间步
当生成EOS标签时，停止生成。
MindNLP/huggingface Transformers提供的文本生成方法

1.2 环境准备

!pip uninstall mindvision -y
!pip uninstall mindinsight -y
!pip install mindnlp

1.3 Greedy search（贪心搜索）

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

$w_t$ = $argmax_w$ 𝑃( $w$ | $w$ _{$1 : t - 1$})

按照贪心搜索输出序列(“The”,“nice”,“woman”) 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 使用贪婪搜索策略生成文本，直到输出长度（包括上下文长度）达到50
greedy_output = model.generate(input_ids, max_length=50)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

输出：

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.036 seconds.
Prefix dict has been built successfully.
100%
 26.0/26.0 [00:00<00:00, 1.86kB/s]
100%
 0.99M/0.99M [00:00<00:00, 14.9MB/s]
100%
 446k/446k [00:00<00:00, 8.31MB/s]
100%
 1.29M/1.29M [00:00<00:00, 18.8MB/s]
100%
 665/665 [00:00<00:00, 66.8kB/s]
100%
 523M/523M [00:38<00:00, 11.4MB/s]
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll

1.4 Beam search（束搜索）

Beam search通过在每个时间步保留最可能的 num_beams 个词，并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:

(“The”,“dog”,“has”) : 0.4 * 0.9 = 0.36

(“The”,“nice”,“woman”) : 0.5 * 0.4 = 0.20

优点：一定程度保留最优路径

缺点：1. 无法解决重复问题；2. 开放域生成效果差
在这里插入图片描述

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 使用束搜索和早停策略生成文本，设置束宽为5，最大长度为50
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# 设置不重复n-gram的大小为2，以避免生成重复的文本
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

# 打印生成的文本
print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# 设置返回多个序列的数量大于1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# 打印所有生成的序列
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')

输出：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I don't think I'll ever be able to walk with her again."

"I don't think I
----------------------------------------------------------------------------------------------------
Beam search with ngram, Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like I'm
----------------------------------------------------------------------------------------------------
return_num_sequences, Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like I'm
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like she's
2: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like we're
3: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like I've
4: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"I'm not sure what to say to that," she said. "I mean, it's not like I can

1.5 Sample（样本搜索）

根据当前条件概率分布随机选择输出词 $w_t$
sample
优点：文本生成多样性高

缺点：生成文本不连续

# 导入MindSpore库，用于设置随机种子
import mindspore

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 设置随机种子，以获得可重复的结果
mindspore.set_seed(0)

# 激活采样策略，并关闭top_k采样，设置最大长度为50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

输出：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"

I realized what Neddy meant when he first launched the website. "Thank you so much for joining."

I

1.6 Temperature（温度）

降低softmax 的temperature使 P( $w$ ∣ $w$ _{$1 : t - 1$})分布更陡峭
在这里插入图片描述
增加高概率单词的似然并降低低概率单词的似然

# 导入MindSpore库，用于设置随机种子
import mindspore

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 设置随机种子，以获得可重复的结果
mindspore.set_seed(1234)

# 激活采样策略，并关闭top_k采样，设置最大长度为50，设置温度系数为0.7
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0,
    temperature=0.7
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

输出：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and have never had a problem with her until now.

A large dog named Chucky managed to get a few long stretches of grass on her back and ran around with it for about 5 minutes, ran around

1.7 TopK sample（TopK采样）

选出概率最大的 K 个词，重新归一化，最后在归一化后的 K 个词中采样
在这里插入图片描述
将采样池限制为固定大小 K ：

在分布比较尖锐的时候产生胡言乱语
在分布比较平坦的时候限制模型的创造力

# 导入MindSpore库，用于设置随机种子
import mindspore

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 设置随机种子，以获得可重复的结果
mindspore.set_seed(0)

# 激活采样策略，并关闭top_k采样，设置最大长度为50，设置top_k为50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

输出：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog.

She's always up for some action, so I have seen her do some stuff with it.

Then there's the two of us.

The two of us I'm talking about were

1.8 Top-P sample(Top-P采样）

在累积概率超过概率 p 的最小单词集中进行采样，重新归一化
在这里插入图片描述
采样池可以根据下一个词的概率分布动态增加和减少

# 导入MindSpore库，用于设置随机种子
import mindspore

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 设置随机种子，以获得可重复的结果
mindspore.set_seed(0)

# 激活采样策略，并关闭top_k采样，设置最大长度为50，设置top_p为0.92，设置top_k为0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

输出：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"

I realized what Neddy meant when he first launched the website. "Thank you so much for joining."

I

1.9 top_k_top_p（混合采样）

# 导入MindSpore库，用于设置随机种子
import mindspore

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

# 初始化分词器，并加载预训练的权重，指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# 初始化模型，并加载预训练的权重，将EOS token设置为PAD token以避免警告，指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# 对生成文本的条件上下文进行编码，得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# 设置随机种子，以获得可重复的结果
mindspore.set_seed(0)

# 激活采样策略，并设置top_k为5，top_p为0.95，设置最大长度为50，设置num_return_sequences为3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=5,
    top_p=0.95,
    num_return_sequences=3
)

# 打印生成的文本
print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
print(100 * '-')