Scene Graph Representation and Learning

First workshop on graph based learning in computer vision

Held in conjunction with ICCV 2019 on October 28th in Seoul, Korea

Location: Room 318 B-C in at the COEX Convention Center


Images are more than a collection of objects or attributes --- they represent a web of relationships among interconnected objects. In an effort to formalize a representation for images, Visual Genome defined scene graphs, a structured formal graphical representation of an image that is similar to the form widely used in knowledge base representations. Each scene graph encodes objects (e.g. dog, frisbee) as nodes connected via pairwise relationships (e.g., playing with) as edges. Scene graphs have led to many state-of-the-art models in image captioning, image retrieval, visual question answering, relationship modeling and image generation. Numerous other graph-based representations have been introduced for 3D geometry, for part-based object recognition, for understanding instructional videos, and for situational role classification. This workshop aims to discuss the progress, the benefits and the shortcomings of all graph-based representations and learning algorithms.

Graphs have also enabled the innovation, adoption and use of numerous new spectral-based models like graph convolutions and graph-based evaluation metrics like SPICE. Modeling graphical data has historically been challenging for the machine learning community, especially when dealing with large amounts of data. Traditionally, methods have relied on Laplacian regularization through label propagation, manifold regularization or learning embeddings. Soon, operators on local neighborhoods of nodes became popular with their ability to scale to larger amounts of data and parallizable computation. Today's choice of architecture, the graph convolution, has become the de facto choice when dealing with graphical data; it was also inspired by these Laplacian-based, local operations. Graph convolutions, and similar techniques are slowly making their way into computer vision tasks and have recently been combined with RCNN to perform scene graph detection.

At this workshop, we hope to discuss the importance of structure in computer vision. How should we be representing scenes, videos, and 3D spaces? What connections to language and knowledge bases could aid vision tasks? How can we rethink the machine learning community’s traditional relation-based representation learning? How can we both use and build upon spectral methods like random walks over graphs, message passing protocols, set-invariant neural architectures, and equivariant structured outputs? What are the shortcomings with our current representations and learning based methods and how can we remedy these problems? What tasks and directions should we be urging the community to move towards?

To receive notifications about updates related to the workshop, sign up using this form.

Program schedule - Room 318 B-C

09:00 - 09:15
Opening remarks
Ranjay Krishna, Stanford University,
Graphs and Relationships in Computer Vision
09:15 - 10:00
Invited talk
Justin Johnson, University of Michigan,
Graphs for Scenes and Shapes
10:00 - 10:45
Coffee + Posters
Poster Session outside room along with coffee break
10:45 - 11:30
Invited talk
Stephan Günnemann, Technische Universitat Munchen,
Adversarial Robustness of Machine Learning Models for Graphs
11:30 - 12:15
Invited talk
Mohammad Rastegari, Allen Intitute for Artificial Intelligence,
Scene Navigation by Knowledge Graph and Interaction
12:15 - 14:00
Lunch break
Lunch on your own
14:00 - 14:45
Invited talk
Devi Parikh and Jianwei Yang, Georgia Tech and Facebook,
Beyond a self-sufficient pixel tensor: Modeling xternal knowledge and internal image structure
14:45 - 15:30
Invited talk
Yejin Choi, University of Washington,
Commonsense Modeling with Commonsense Graphs
15:30 - 16:30
Coffee + Posters
Poster Session outside room along with coffee break
16:30 - 17:10
Invited talk
Sanja Fidler, University of Toronto and NVIDIA,
Graph-based Visual Reasoning and Simulation
17:10 - 17:20
Closing remarks
Ranjay Krishna, Stanford University,
Closing Remarks

Email Ranjay if you want to see the recorded videos.

Images from the workshop

Important Dates and Details

  • Signup to receive updates: using this form
  • Apply to be part of Program Committee by: August 01, 2019 using this link
  • Paper submission deadline: August 15, 2019 at 11:59pm using this CMT portal using this ICCV template
  • Notification of acceptance: September 07, 2019
  • Camera ready and posters due: September 30, 2019
  • Submit questions/topics to discuss: Submit topics or questions you want to discuss during the panel session
  • Workshop: October 28, at ICCV 2019, Seoul, South Korea

Invited speakers

Devi Parikh is an Assistant Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR). Her recent work involves exploring problems at the intersection of vision and language, and leveraging human-machine collaboration for building smarter machines. She has also worked on other topics such as ensemble of classifiers, data fusion, inference in probabilistic models, 3D reassembly, barcode segmentation, computational photography, interactive computer vision, contextual reasoning, hierarchical representations of images, and human-debugging.

Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence. She is a co-recepient of the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016. She received her Ph.D. in Computer Science at Cornell University (advisor: Prof. Claire Cardie) and BS in Computer Science and Engineering at Seoul National University in Korea.

Mohammad Rastegari is a Research Scientist at AI2. His main area of research relies on the intersection of Computer Vision and Machine Learning. Previously, he was a Facebook Fellow, a visiting scholar at UC Berkeley, and he received his Ph.D. from The University of Maryland. Mohammad is one of the creators of Xnor-Net: an efficient deep learning model for resource constraint compute platforms. Mohammad finished his Bachelor's Degree from the Shomal University of Amol, and his Master's Degree from the University of Science and Research in Tehran where he was also a member of the computer vision lab at the Institute for Research in Fundamental Science.

Stephan Günnemann conducts research in the area of data mining and machine learning. The focus of his work is on the design and analysis of robust and scalable machine learning techniques with the goal to enable a reliable analysis of the massive amounts of data collected by science and industry. Prof. Günnemann is particularly interested in studying the principles for analyzing complex data such as networks, graphs and temporal data. He acquired his doctoral degree in 2012 at RWTH Aachen University in the field of computer science. In 2015, Prof. Günnemann set up an Emmy Noether research group at TUM Department of Informatics. He has been a professor of data mining & analytics at TUM since 2016.

Sanja Fidler is an Assistant Professor at University of Toronto, and a Director of AI at NVIDIA, leading a research lab in Toronto. Prior coming to Toronto, in 2012/2013, she was a Research Assistant Professor at Toyota Technological Institute at Chicago, an academic institute located in the campus of University of Chicago. She did my postdoc with Prof. Sven Dickinson at University of Toronto in 2011/2012. She finished my PhD in 2010 at University of Ljubljana in Slovenia in the group of Prof. Ales Leonardis. In 2010, she was visiting Prof. Trevor Darrell's group at UC Berkeley and ICSI. She got my BSc degree in Applied Math at University of Ljubljana. More information is in my CV..

Justin Johnson Justin is an Assistant Professor of Computer Science and Engineering at the University of Michigan. Prior to that he was a Research Scientist at Facebook AI Research, and he completed his PhD at Stanford University, advised by Fei-Fei Li. His research interests lie primarily in computer vision and include visual reasoning, image synthesis, and 3D perception..

Jianwei Yang is a Ph.D. student in Computer Science at Georgia Tech, advised by Prof. Devi Parikh, and works closely with Prof. Dhruv Batra. Prior to his Ph.D. study, he did his master study in CBSR & NLPR, CASIA and BUPT, under the supervision of Prof. Stan Z. Li and mentored by Prof. Zhen Lei. His research interests mainly span in computer vision, machine learning and vision and language. His primary research is about structural visual understanding and how to leverage it for intelligent interactions with human (language) and environment (embodiment).

Call for papers

This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of graphical representation learning. Topics include, but are not limited to:

  • Algorithmic approaches: How should we improve graph embeddings, probabilistic approaches based on latent variable modeling, message-passing neural networks, dimensionality reduction techniques etc.?
  • Evaluation methods: How to evaluate scene graph representations in terms of trade-offs with different learning objectives?
  • Theoretical aspects: When and why do learned representations aid visual relationship reasoning? How does the non-i.i.d. nature of graph-based data conflict with our current understanding of representation learning?
  • Optimization and scalability challenges: How should we handle the inherent discreteness and curse of dimensionality of graph-based data and challenges in negative sampling?
  • Domain-specific applications: How should we improve semantic visual understanding, scene graph mining, natural language processing, reinforcement learning, programming language analysis etc.?
  • Any other topic of interest for scene graph representation and learning.

If you work on one of the following research areas, this workshop is the right place for you:

  • Scene graphs
  • Relational learning or knowledge bases
  • Models that expect or generate structured inputs or outputs
  • Models that utilize structured intermediate representation
  • High-level semantic vision tasks like commonsense reasoning
  • Spectral or graph neural networks
  • Beyond objects and contextual learning
  • Any other related research area

We invite researchers and practitioners to submit their work (between 4-8 pages in the ICCV template) to this CMT portal.

Please contact Ranjay Krishna with any questions: ranjaykrishna [at] cs [dot] stanford [dot] edu.

Accepted papers

Related previous workshops

To the best of our knowledge, this is the first workshop focused primarily on graph structured representation learning for computer vision. The two related previous workshops are:

  • Vision and Language: Unlike Vision and Language, we are primarily focused on graphical representations while language is usually treated as a special case where the representation is a sequence. While scene graphs and similar representations are often canonicalized to language and meaning, their connections to language are utilized as an additional signal and not a requirement.
  • Relational Representation Learning: Relational Representation Learning is more closely related to our workshop but was organized for a non-vision community and primarily focused on graph-based data found in social networks and knowledge bases. In knowledge bases, relationships are intentionally constructed, so that pattern-based methods are successful. Relationships in text are usually document-agnostic (e.g. Tokyo-is capital of-Japan). In vision, images and visual relationships are incidental, not intentional. They are always image-specific because they only depend on the contents of the particular image they appear in. Therefore, methods that rely on knowledge from external sources or on patterns over concepts (e.g. most instances of dog next to frisbee are playing with it) do not generalize well for visual relationships or scene graphs. The inability to utilize the progress in methods for text raises the need for methods that are specialized for visual knowledge.

Program Committee

  • Justin Johnson - Facebook and University of Michigan
  • Danfei Xu - Stanford University
  • Alexandro Newell - Google and University of Michigan
  • Ronghang Hu - UC Berkeley
  • Paroma Varma - Stanford University
  • Kenji Hata - Google
  • Khaled Jedoui - Stanford University
  • Jingwei Ji - Stanford University
  • Sima Yazdani - Cisco
  • Si Liu - Beihang University
  • Xin Wang - UC Santa Barbara
  • Biplab Banerjee - IIT Bombay
  • Subarna Tripathi - Intel
  • Jianwei Yang - Georgia Tech
  • Peng Dai - Ryerson University
  • Liping Yang - Los Alamos National Laboratory
  • More to come...

If you are interted in taking a more active part in the workshop, apply to join the program committe using this link.