一个续写故事达到人类水平的AI，OpenAI大规模无监督语言模型GPT-2

本文作者：杨晓凡

2019-02-16 10:04

导语： 40GB训练语料，15亿参数

雷锋网 AI 科技评论按：模型大小的比拼还在继续！自谷歌大脑的 2.77 亿参数的语言模型 Transformer-XL 之后，OpenAI 也完成了自己具有 15 亿个参数的语言模型 GPT-2，而且这个模型还有一项惊人的能力，就是从短句续写成篇的文章。雷锋网 AI 科技评论简单介绍如下。

GPT-2 介绍

2018 年 6 月，OpenAI 发表论文介绍了自己的语言模型 GPT，它基于 Transformer 架构，用先在大规模语料上进行无监督预训练、再在小得多的有监督数据集上为具体任务进行精细调节（fine-tune）的方式，不依赖针对单独任务的模型设计技巧就一次性在多个任务中取得了很好的表现。这也是 2018 年中自然语言处理领域的研究趋势，就像计算机视觉领域流行 ImageNet 预训练模型一样。

这次的 GPT-2 是 GPT 的直接技术升级版本，有多 10 倍的模型参数，多达 15 亿个，并在多 10 倍的数据上进行训练。训练数据是来自互联网的 40GB 高质量语料，具体来说这些语料来自 Reddit 论坛中出现的高评分外链页面；高评分表示这些页面中的内容有较高的质量。这样筛选出的语料内容多达 800 万个页面。

模型在无监督训练阶段的目标是，给定一组单词组成的句子之后预测下文的下一个词。由于数据库有足够高的文本质量和足够高的多样性，模型也有很高的容量，即便这样简单的训练目标也得出了惊人的结果：模型不仅可以根据给定的文本流畅地续写句子，甚至可以形成成篇的文章，就像人类的续写文章一样。

模型生成文本有时会出现失败的现象，比如文字重复、错误的世界常识（比如有时候模型会写到在水下燃烧的火）、不自然地切换话题，但在成功的例子中，模型生成的文本有多样、全面的叙述，对事件的介绍煞有介事，接近人类的表达质量，而且在段落之间甚至全篇文章之中保持连续一致。这可以说非常惊人（可以看下文的样例）

OpenAI 的研究人员们发现，根据模型对内容的熟悉程度不同，进行数次尝试之后一般都可以得到较好的生成结果。比如对于训练文本中常出现的（也就是互联网上出现较多的）英国脱欧、指环王、歌手 Miley Cyrus 等话题，模型在大约一半的尝试中都可以生成有理有据的文本；反过来，对于训练文本中较少出现的技术性讨论以及哲学内容，模型生成的效果就要差很多。

和之前的 GPT 一样，预训练结束后的模型可以在领域数据集上做进一步精细调节，对文本的风格得到更好的控制，比如在亚马逊评价数据上精细调节后，模型生成的文本更关注评分、商品类别之类的东西。

除此之外，GPT-2 也可以在各种任务中做「无样本学习」，即不经过任何任务专用数据集训练就进行测试，也取得了优秀的结果。OpenAI 的研究人员们猜想这些任务都只是通用语言建模的子集，所以自然会有良好表现。

文本生成欣赏

GPT-2 的表现中最引人关注的当然还是文本生成水平，我们来看几个 OpenAI 也给出的文本生成样例。为了方便阅读，我们简单翻译了其中三组。在翻译过程中我们感觉到，模型的描写和叙事逻辑虽然还是有一些混乱，但模型不仅可以找到适合的文字风格，自己补充了许多细节，文本中也有一些上下文联系、有层层递进，语法还几乎没有错误，可以说是非常以假乱真了。

我们一起感受下那煞有介事的文风。

例一

人类给定文本

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
（在一项令人震惊的发现中，科学家们在安第斯山脉中偏远的、从未曾探索过的山谷中发现了一群独角兽。更令研究人员们惊讶的是，这群独角兽能说一口流利的英语。）

模型续写（第 10 次尝试的结果）

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. （科学家们以它们的角的名字把它们命名为 Ovid 独角兽，科学界从未知道过这些四只角的、银白色的独角兽的存在。）
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.（如今，经过大概两个世纪之后，这种古老的现象是如何发现的终于找到了答案。）
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.（La Paz 大学的进化生物学家 Jorge Pérez 博士和他的同行们在探索安第斯山脉的时候发现了一个很小的山谷，这里没有其他的动物或者人类。Pérez 注意到在这个山谷里的似乎是一个自然喷泉，周围还有两座岩石山和银色的雪。）
Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.（Pérez 和其它人更走到了山谷的更深入。“等到我们走到了其中一座山的山顶的时候，水看起来是蓝色的，上面还有一些晶体状的东西”，Pérez 说。）
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.（Pérez 和他的朋友们看到独角兽群的时候惊呆了。透过空气，不需要有多少动作就可以看到它们 —— 他们走得好近，甚至都可以摸到它们的角。）
While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”（当这些科学家们检查这些奇怪的生物的时候，他们发现这些生物的英文说得还挺规范的。Pérez 指出“我们可以看到，比如，它们有共通的‘语言’，像是一种口音或者方言”。）
Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.（Pérez 博士相信这些独角兽最初是来自阿根廷的，这些动物很有可能是人类迁移到南美洲之前的一支消失了的人种的后裔。）
While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”（虽然它们的来源还不确定，有人认为这些动物可能是在人类文明出现之前，有人和独角兽相遇而产生的。Pérez 说：“在南美洲，这种事情一点也不罕见。”）
However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.（然而，Pérez 也指出，也许只有一种办法能确定这些独角兽是不是某一支已经消失了的外星人种族的后裔，那就是通过 DNA。“但是它们用英文沟通的能力不怎么好，在我看来这是一种进化的信号，或者起码是社交组织的一种变化”，这位科学家说。）

例二

人类给定文本

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
（今天，一辆载有受控核原料的货运火车在辛辛那提被偷走了。我们不知道它在哪里。）

模型续写（第 1 次尝试的结果）

The incident occurred on the downtown train line, which runs from Covington and Ashland stations.（事件发生在闹市区的火车线路上，这条线连通考文垂和阿什兰。）
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.（在一封致俄亥俄州新闻发布窗口的邮件中，美国能源部表示它正在和联邦铁路管理局共同协作找到盗贼。）
“The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.”（“这起核原料失窃事件对于公众、环境健康、我们的劳动力和国家的经济有很糟糕的负面后果”，美国能源秘书长 Tom Hicks 在一则声明中表示，“我们的首要任务是确保失窃事件的安全，并确保它不会再次发生。” ）
The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site, according to a news release from Department officials.（按照部门官员发布的新闻表示，失窃的材料是从辛辛那提研究院三角公园核能研究院被偷走的）
The Nuclear Regulatory Commission did not immediately release any information.（核管制委员会没有立即发表任何信息。）
According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading that team’s investigation.（根据发表的信息，美国能源部核能材料安全与保卫办公室正在带领团队进行调查。）
“The safety of people, the environment and the nation’s nuclear stockpile is our highest priority,” Hicks said. “We will get to the bottom of this and make no excuses.（“人民的安全、环境、以及核能储备是我们最重要的目标，”Hicks 说，“我们会追查到底，不找任何借口。”）

例三

人类给定文本（这是 OpenAI 对 GPT-2 模型的简介）

We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.
A typical approach to language modeling is to learn the following task: predict the next word, given all of the previous words within some text. Last year, OpenAI’s Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned to specific tasks to achieve high performance. GPT-2 shows that much larger language models trained on a more diverse dataset derived from the internet begin to learn these NLP tasks without needing task-specific training data, instead learning from examples the system derives from the raw text. These systems also display a substantial qualitative jump in the realism and coherence of generated text.

模型续写（第 1 次尝试的结果）

Here you can see the most recent progress with Generative Pre-trained Transformer:（来和我们一起看看生成式预训练 Transformer 模型的最新进展）
Figure 1: Generative Pre-trained Transformer training on several texts.（图 1 ：生成式预训练 Transformer 在多种任务中的表现）
We are now preparing a collection of datasets for translation and machine translation in our language model. We will be using one of the large number of text samples provided by The New York Times.（我们正在准备在语言模型中收集一些数据集用于翻译和机器翻译。我们会使用纽约时报提供的大规模文本样本之一。）
We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text.（我们相信这个项目是向着开发不需要任务专用数据的大规模 NLP 系统的第一步。也就是说，我们正在开发一个生成式的机器语言系统，它生成文本不需要任何明确的规则。）
We hope for future collaborations between computer scientists, linguists, and machine learning researchers.（我们希望未来可以在计算机科学家、语言学家、机器学习研究人员们之间有更多合作。）

OpenAI 的担忧

我们看到了，模型确实可以根据任意给定的句子续写生成近似人类水准的整篇文字，OpenAI 表示具有这样能力的模型可以有写作助手、对话智能体、无监督翻译等多种用途，但他们同时也担忧具有这样能力的模型可能会遭到滥用，比如被用来恶意地创作传播虚假信息的文章，就像之前曾经同样受到热烈讨论的人脸替换模型 DeepFake，网友已经用它把明星脸替换到色情视频里；也可以模仿他人写作、大批量制作钓鱼文章等等。

由于 GPT-2 这样能力的模型可以用来生成欺骗性的、有偏见的以及暴力语言，而 OpenAI 非常看重研究内容的安全、合作、有益，他们更希望以合作的方式向其他研究人员共享他们的经验，更好地帮助整个领域的进步，而不是直接面向公众呈上所有成果，所以 OpenAI 并不打算一并发布完整的经过预训练的模型，只发布一个小得多的模型便于研究人员们用于实验。

论文地址 https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

介绍博客 https://blog.openai.com/better-language-models/

开源地址 https://github.com/openai/gpt-2

从技术角度来讲 GPT-2 的突破并不大，它只是再一次证明了足够大的网络配合足够多的数据训练就可以有良好的记忆能力，而逻辑和推理能力仍然是无法从记忆能力中自然自然地出现的。另一方面，这也再一次说明了只要肯投入足够多的计算力和数据，刷刷榜永远都不难。摊手。

雷锋网 AI 科技评论报道

雷峰网版权文章，未经授权禁止转载。详情见转载须知。