PGL图学习之基于GNN模型新冠疫苗任务[系列九]

项目链接：https://aistudio.baidu.com/aistudio/projectdetail/5123296?contributionType=1

# 加载一些需要用到的模块，设置随机数
import json
import random
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import networkx as nx

from utils.config import prepare_config, make_dir
from utils.logger import prepare_logger, log_to_file
from data_parser import GraphParser

seed = 123
np.random.seed(seed)
random.seed(seed)

数据EDA

# https://www.kaggle.com/c/stanford-covid-vaccine/data
# 加载训练用的数据
df = pd.read_json('../data/data179441/train.json', lines=True)
# 查看一下数据集的内容
sample = df.loc[0]
print(sample)

index                                                                400
id                                                          id_2a7a4496f
sequence               GGAAAGCCCGCGGCGCCGGGCGCCGCGGCCGCCCAGGCCGCCCGGC...
structure              .....(((...)))((((((((((((((((((((.((((....)))...
predicted_loop_type    EEEEESSSHHHSSSSSSSSSSSSSSSSSSSSSSSISSSSHHHHSSS...
signal_to_noise                                                        0
SN_filter                                                              0
seq_length                                                           107
seq_scored                                                            68
reactivity_error       [146151.225, 146151.225, 146151.225, 146151.22...
deg_error_Mg_pH10      [104235.1742, 104235.1742, 104235.1742, 104235...
deg_error_pH10         [222620.9531, 222620.9531, 222620.9531, 222620...
deg_error_Mg_50C       [171525.3217, 171525.3217, 171525.3217, 171525...
deg_error_50C          [191738.0886, 191738.0886, 191738.0886, 191738...
reactivity             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
deg_Mg_pH10            [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
deg_pH10               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
deg_Mg_50C             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
deg_50C                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: 0, dtype: object

例如 deg_50C、deg_Mg_50C 这样的值全为0的行，就是我们需要预测的。

structure一行，数据中的括号是为了构成边用的。

本案例要预测RNA序列不同位置的降解速率，训练数据中提供了多个ground值，标签包括以下几项:reactivity, deg_Mg_pH10, and deg_Mg_50

reactivity - (1x68 vector 训练集，1x91测试集) 一个浮点数数组，与seq_scores有相同的长度，是前68个碱基的反应活性值，按顺序表示，用于确定RNA样本可能的二级结构。

deg_Mg_pH10 - (训练集 1x68向量，1x91测试集)一个浮点数数组，与seq_scores有相同的长度，是前68个碱基的反应活性值，按顺序表示，用于确定在高pH (pH 10)下的降解可能性。

deg_Mg_50 - (训练集 1x68向量，1x91测试集)一个浮点数数组，与seq_scores有相同的长度，是前68个碱基的反应活性值，按顺序表示，用于确定在高温(50摄氏度)下的降解可能性。

# 利用GraphParser构造图结构的数据
args = prepare_config(./config.yaml, isCreate=False, isSave=False)
parser = GraphParser(args) # GraphParser类来自data_parser.py
gdata = parser.parse(sample) # GraphParser里最主要的函数就是parse(self, sample)

{'nfeat': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 'edges': array([[  0,   1],
        [  1,   0],
        [  1,   2],
        ...,
        [142, 105],
        [106, 142],
        [142, 106]]),
 'efeat': array([[ 0.,  0.,  0.,  1.,  1.],
        [ 0.,  0.,  0., -1.,  1.],
        [ 0.,  0.,  0.,  1.,  1.],
        ...,
        [ 0.,  1.,  0.,  0.,  0.],
        [ 0.,  1.,  0.,  0.,  0.],
        [ 0.,  1.,  0.,  0.,  0.]], dtype=float32),
 'labels': array([[ 0.    ,  0.    ,  0.    ],
        [ 0.    ,  0.    ,  0.    ],
        ...,
        [ 0.    ,  0.9213,  0.    ],
        [ 6.8894,  3.5097,  5.7754],
        [ 0.    ,  1.8426,  6.0642],
          ...,        
        [ 0.    ,  0.    ,  0.    ],
        [ 0.    ,  0.    ,  0.    ]], dtype=float32),
 'mask': array([[ True],
        [ True],
     ......
       [False]])}

nfeat —— 节点特征

edges —— 边

efeat —— 边特征

labels —— 节点标签有三种，所以这可以看成是一个多分类任务

图数据可视化

# 图数据可视化
fig = plt.figure(figsize=(24, 12))
nx_G = nx.Graph()
nx_G.add_nodes_from([i for i in range(len(gdata['nfeat']))])

nx_G.add_edges_from(gdata['edges'])
node_color = ['g' for _ in range(sample['seq_length'])] + \
['y' for _ in range(len(gdata['nfeat']) - sample['seq_length'])]
options = {
    node_color: node_color,
}
pos = nx.spring_layout(nx_G, iterations=400, k=0.2)
nx.draw(nx_G, pos, **options)

plt.show()

模型训练&预测

# 我们在 layer.py 里定义了一个新的 gnn 模型(my_gnn)，消息传递的过程中加入了边的特征(edge_feat)
# 然后修改 model.py 里的 GNNModel
# 使用修改后的模型，运行 main.py。为节省时间，设置 epochs = 100

!python main.py --config config.yaml

结果返回的是 MCRMSE 和 loss

{'MCRMSE': 0.5496759, 'loss': 0.3025484172316889}

[DEBUG] 2022-11-25 17:50:42,468 [  trainer.py:   66]:	{'MCRMSE': 0.5496759, 'loss': 0.3025484172316889}
[DEBUG] 2022-11-25 17:50:42,468 [  trainer.py:   73]:	write to tensorboard ../checkpoints/covid19/eval_history/eval
[DEBUG] 2022-11-25 17:50:42,469 [  trainer.py:   73]:	write to tensorboard ../checkpoints/covid19/eval_history/eval
[INFO] 2022-11-25 17:50:42,469 [  trainer.py:   76]:	[Eval:eval]:MCRMSE:0.5496758818626404	loss:0.3025484172316889
[INFO] 2022-11-25 17:50:42,602 [monitored_executor.py:  606]:	********** Stop Loop ************
[DEBUG] 2022-11-25 17:50:42,607 [monitored_executor.py:  199]:	saving step 12500 to ../checkpoints/covid19/model_12500

!python main.py --mode infer