作者用action, reward, state等当做lalbel,进行有监督训练。 黄世宇/Shiyu Huang's Personal Page:https://huangshiyu13.github.io/