Foundation Models

Foundation models are models that are pretrained from raw data at scale and can be adapted to various downstream applications.1 For example, language models, like BERT2 and GPT family3, are pretrained from a large amount of raw text such as Wikipedia and Books, and the resulting models can be adapted (e.g. via finetuning, prompting) to help an extremely wide range of applications, including question answering, text classification, etc. These language models achieve remarkable performance on many natural language processing (NLP) tasks, becoming the foundation of today’s NLP systems. They are serving important roles in various products and tools that we use every day, such as search engines like Google4 and personal assistants like Alexa.5

What can foundation models learn from?
While text is commonly used to train language models, knowledge graphs (KG) provide complementary information to text. KGs offer structured background knowledge by representing entities as nodes and relations between them as edges, e.g. <Leonardo da Vinci — born in — Italy>. Examples of knowledge graphs include Freebase6, Wikidata7 (general-purpose facts), ConceptNet8 (commonsense), and UMLS9 (biomedical facts).

Text and KGs have complementary strengths. Text has a broad coverage of knowledge and captures rich context.10 Meanwhile, KGs are structured and offer scaffolds for logical or multi-step reasoning by providing paths between entities.11 Some KGs also include knowledge that may not be commonly stated in text; for instance, people do not often state obvious facts like “people breathe” or compositional sentences like “The birthplace of the painter of the Mona Lisa is Italy”. Hence, text and KG have the potential to mutually inform each other, i.e. text can contextualize the KG with its rich prose, and KG can ground the text with its structure for reasoning.

Training a foundation model from text and KG
The above observation motivates research in fusing the strengths of the two modalities, text and KG. In our recent work published at NeurIPS 2022, we develop DRAGON, a new method to train a foundation model jointly from text and KG.

Challenges
To pretrain a powerful foundation model, we need both (i) an expressive model that allows the two modalities to interact in a deep, bidirectional manner; and (ii) a self-supervised training objective that allows the model to learn joint reasoning over text and KG at scale without manual labels. Existing models that combine text and KGs tend to perform indirect/uni-directional fusion of the two modalities12, or are supervised by small labeled datasets13 rather than being self-supervised at scale. We will address these challenges below.


Approach: DRAGON

We introduce DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised method to pretrain a deeply fused foundation model from text and KG. As an overview, DRAGON consists of three steps. We first sample pairs of text segments and relevant KG subgraphs to create inputs for the model (Step 1). We then use a deep bidirectional model to fuse the input text and KG (Step 2), and finally pretrain the model using a joint self-supervised task over the two modalities (Step 3).

Step 1: Text-KG Input Sampling
Given a text corpus (e.g. Books) and a large KG (e.g. ConceptNet) as raw data, we want to sample informative pairs of (text segment, KG subgraph) as inputs for the model so that the text and KG are semantically related and can inform each other. To achieve this, for each text segment sampled from the text corpus, we retrieve a relevant subgraph from the KG by doing simple entity linking, i.e. string-match entities mentioned in the text segment to the KG, and extract these entities nodes as well as their neighbor nodes from the KG.14 Consequently, we obtain a pair of (text segment, local KG). Henceforth, we use “KG” to refer to this local KG for convenience.

Step 2: Deep Bidirectional Cross-Modal Model
Given the input text and KG, we want to design a model that can capture rich interactions between them. Some inspirations we can draw are deep bidirectional contextualization of inputs, which have made BERT very successful15, and graph neural networks (GNNs), which are shown to be effective for modeling graph algorithms including knowledge graph reasoning16. With these motivations, we designed a model architecture called GreaseLM17 that combines Transformer and GNN to fuse text and KG bidirectionally for multiple layers. Specifically, each layer of this model has a Transformer that encodes the input text and a GNN that encodes the input KG, which are then fused by a bidirectional modality interaction module.

Step 3: Bidirectional Self-supervision
The final step is to pretrain the model using the inputs we prepared. We train the model by unifying two self-supervised reasoning tasks over text and KG. The first task is language modeling, which predicts masked words or next words from the input text. The other task is link prediction18, which predicts edges that were held out from the input KG. The intuition is that by combining the two tasks, the model is encouraged to use both the text and the KG to reason about the missing words in text and missing links in the KG. This joint training facilitates the model to propagate information bidirectionally between the two modalities.


Let’s use DRAGON!

We pretrain DRAGON in two domains

  • General domain: we use BookCorpus19 as the text corpus and ConceptNet20 as the knowledge graph. We initialize the Transformer component of DRAGON using the RoBERTa checkpoint.
  • Biomedical domain: we use PubMed as the text corpus and UMLS21 as the knowledge graph. We initialize the Transformer component of DRAGON using the BioLinkBERT checkpoint.

DRAGON improves over vanilla language models (LMs) and previous LM+KG models
We finetune and evaluate the pretrained DRAGON on diverse downstream tasks in each domain:

We compare DRAGON with two types of baselines. The first is a vanilla language model (LM), i.e. RoBERTa or BioLinkBERT.22 The second is GreaseLM23, which takes a vanilla language model and *finetunes* it using KG, but it does not pretrain using KG. So the key difference is that DRAGON pretrains a language model using KG as well as finetuning.

The figure below shows the evaluation results. DRAGON outperforms the baseline language models and GreaseLM across the commonsense and biomedical question answering tasks. In particular, we can decompose and see the effect of using KG in pretraining with respect to the vanilla language model (purple arrow), and the effect of self-supervised pretraining with respect to GreaseLM (blue arrow). We can see that both components of DRAGON contribute to significant performance improvements.

Effective for complex reasoning
We also find that DRAGON exhibits several interesting strengths. The first strength is improved performance for complex reasoning tasks. We identified several types of complex reasoning such as questions that contain negation terms (like “no”, “never”) or conjunction terms (like “and”, “but”)—these indicate logical reasoning—and also questions that contain many prepositional phrases or many entity mentions—these indicate involving more reasoning constraints or steps. We find that DRAGON attains large performance gains on these complex questions compared to the baseline models.

The intuition is that because DRAGON is pretrained using KG, it learns to use KG as a scaffold for performing structured reasoning about entities. For instance, when the question contains a conjunction (example on the left of the figure), the model exhibits stronger attention weights over the entities related to the conjunction after several layers of the GNN over the KG, which lead to the correct answer. Similarly, when the question further contains a negation (example on the right of the figure), the model exhibits stronger attention weights over the entities that are not negated. One interpretation of these findings is that DRAGON uses the structure of KG and GNN as a scaffold for performing complex reasoning—this insight is related to recent works that provide language models with scratch space for doing intermediate reasoning.24 Another interpretation is that the GNN component of DRAGON learns to perform soft execution of natural language inputs (questions) on the KG—this insight is related to recent works showing that GNNs can learn to execute graph algorithms, including execution of complex logical queries on KG.25

DRAGON also exhibits an ability to extrapolate to more complex questions. For instance, it adjusts the entity attention weights and final predictions accordingly when an extra context (i.e. extra reasoning step) is added to the original question, as in the figure below (left → right). Meanwhile, vanilla language models (RoBERTa) and KG-augmented finetuning (GreaseLM) struggle on these QA examples. This may suggest that KG-augmented pretraining (DRAGON) is important for acquiring broader reasoning abilities that generalize to harder test examples.

Effective for few-shot and data-efficient QA
Another strength of DRAGON is few-shot and data-efficient QA. For each QA dataset, we tried finetuning DRAGON and the baseline models with only 10% or 1% of the available training data. We find that DRAGON provides large improvements in these low-resource paradigms. This suggests that DRAGON internalized more knowledge thanks to the self-supervised pretraining with knowledge graphs.


Summary

We introduced DRAGON, a new method to pretrain a deeply fused foundation model from text and knowledge graphs (KG). Specifically, we design a bidirectional cross-modal model for text and KG, and train the model using two joint self-supervised tasks: language modeling and KG link prediction.

DRAGON can be used as a drop-in replacement for existing BERT models, and can be finetuned to solve various NLP tasks. In particular, DRAGON achieves significant performance improvements for knowledge- and reasoning-intensive applications, such as complex question answering that involves commonsense/biomedical knowledge and multi-step reasoning.

The pretrained DRAGON models are made publicly available. We hope that they could be helpful for your projects and research. Finally, we think that DRAGON opens up many exciting future projects, such as generalizing it to GPT or sequence-to-sequence26 style language models to perform knowledge-grounded text generation.

This blog post is based on the paper:

The models, code and data are available on GitHub. If you have questions, please feel free to email us.

  • Michihiro Yasunaga: myasu@cs.stanford.edu

Acknowledgments

Many thanks to my collaborators and advisors, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy Liang and Jure Leskovec for their help, and to the members of the Stanford SNAP group, P-Lambda group, and AI lab for their valuable feedback.

  1. On the Opportunities and Risks of Foundation Models. Rishi Bommasani et al. 2021.

  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019.

  3. Language Models are Few-Shot Learners. Tom B. Brown, et al. 2020.

  4. Google uses BERT for its search engine: https://blog.google/products/search/search-language-understanding-bert/

  5. https://arxiv.org/pdf/2011.03023.pdf

  6. Freebase: a collaboratively created graph database for structuring human knowledge Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008.

  7. Wikidata: A free collaborative knowledgebase Denny Vrandeciˇ c and Markus Krötzsch. 2014.

  8. Conceptnet 5.5: An open multilingual graph of general knowledge Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Website here

  9. The unified medical language system (UMLS): integrating biomedical terminology. MOlivier Bodenreider. 2004.

  10. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy. 2020.

  11. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, Jure Leskovec. 2021.

  12. - ERNIE: Enhanced language representation with informative entities.
    - Kepler: A unified model for knowledge embedding and pre-trained language Representation.
    - Jaket: Joint pre-training of knowledge graph and language understanding.

  13. - Kagnet: Knowledge-aware Graph Networks for Commonsense Reasoning.
    - Scalable multi-hop relational reasoning for knowledge-aware question answering.
    - QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering.
    - GreaseLM: Graph REASoning Enhanced Language Models for Question Answering.

  14. We find that a very simple entity linking method like this string matching works sufficiently well. We only need the text segment and the retrieved KG subgraph to be roughly related.

  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019.

  16. What Can Neural Networks Reason About?. Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Stefanie Jegelka. 2020.

  17. GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, Jure Leskovec. 2022.

  18. - Translating embeddings for modeling multi-relational data.
    - Embedding entities and relations for learning and inference in knowledge bases.
    - RotatE: Knowledge graph embedding by relational rotation in complex space.

  19. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015.

  20. Conceptnet 5.5: An open multilingual graph of general knowledge Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Website here

  21. The unified medical language system (UMLS): integrating biomedical terminology. MOlivier Bodenreider. 2004.

  22. Since we pretrain DRAGON after initializing its parameters with RoBERTa or LinkBERT checkpoints, to have a fair baseline, we continue pretraining RoBERTa or LinkBERT using the vanilla language modeling objective on the same text data for the same number of steps as DRAGON.

  23. GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, Jure Leskovec. 2022.

  24. Chain of thought prompting elicits reasoning in large language models. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Show your work: Scratchpads for intermediate computation with language models. Nye et al. 2021.

  25. What Can Neural Networks Reason About?. Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Stefanie Jegelka. 2020. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, Jure Leskovec. 2021.

  26. For instance, BART and T5.