Scene Graph Representation and Learning

First workshop on graph based learning in computer vision

Held in conjunction with ICCV 2019 on October 28th in Seoul, Korea


Images are more than a collection of objects or attributes --- they represent a web of relationships among interconnected objects. In an effort to formalize a representation for images, Visual Genome defined scene graphs, a structured formal graphical representation of an image that is similar to the form widely used in knowledge base representations. Each scene graph encodes objects (e.g. dog, frisbee) as nodes connected via pairwise relationships (e.g., playing with) as edges. Scene graphs have led to many state-of-the-art models in image captioning, image retrieval, visual question answering, relationship modeling and image generation. Numerous other graph-based representations have been introduced for 3D geometry, for part-based object recognition, for understanding instructional videos, and for situational role classification. This workshop aims to discuss the progress, the benefits and the shortcomings of all graph-based representations and learning algorithms.

Graphs have also enabled the innovation, adoption and use of numerous new spectral-based models like graph convolutions and graph-based evaluation metrics like SPICE. Modeling graphical data has historically been challenging for the machine learning community, especially when dealing with large amounts of data. Traditionally, methods have relied on Laplacian regularization through label propagation, manifold regularization or learning embeddings. Soon, operators on local neighborhoods of nodes became popular with their ability to scale to larger amounts of data and parallizable computation. Today's choice of architecture, the graph convolution, has become the de facto choice when dealing with graphical data; it was also inspired by these Laplacian-based, local operations. Graph convolutions, and similar techniques are slowly making their way into computer vision tasks and have recently been combined with RCNN to perform scene graph detection.

At this workshop, we hope to discuss the importance of structure in computer vision. How should we be representing scenes, videos, and 3D spaces? What connections to language and knowledge bases could aid vision tasks? How can we rethink the machine learning community’s traditional relation-based representation learning? How can we both use and build upon spectral methods like random walks over graphs, message passing protocols, set-invariant neural architectures, and equivariant structured outputs? What are the shortcomings with our current representations and learning based methods and how can we remedy these problems? What tasks and directions should we be urging the community to move towards?

To receive notifications about updates related to the workshop, sign up using this form.

Important Dates and Details

  • Signup to receive updates: using this form
  • Apply to be part of Program Committee by: August 01, 2019 using this link
  • Paper submission deadline: August 15, 2019 at 11:59pm using this CMT portal using this ICCV template
  • Notification of acceptance: September 07, 2019
  • Camera ready and posters due: October 15, 2019
  • Submit questions/topics to discuss: Submit topics or questions you want to discuss during the panel session
  • Workshop: October 28, at ICCV 2019, Seoul, South Korea

Invited speakers

Fei-Fei Li is currently the Co-Director of the Stanford Human-Centered AI (HAI) Institute, a Stanford University Institute to advance AI research, education, policy and practice to benefit humanity, by bringing together interdisciplinary scholarship across the university. She is also a Co-Director and Co-PI of the Stanford Vision and Learning Lab, where she works with the most brilliant students and colleagues worldwide to build smart algorithms that enable computers and robots to see and think. Fei-Fei is a Co-Founder and Chairperson of the national nonprofit organization AI4ALL, dedicated to increase diversity and inclusion in AI education.

Devi Parikh is an Assistant Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR). Her recent work involves exploring problems at the intersection of vision and language, and leveraging human-machine collaboration for building smarter machines. She has also worked on other topics such as ensemble of classifiers, data fusion, inference in probabilistic models, 3D reassembly, barcode segmentation, computational photography, interactive computer vision, contextual reasoning, hierarchical representations of images, and human-debugging.

Ali Farhadi is an Associate Professor in the Department of Computer Science and Engineering at the University of Washington. He also lead the PRIOR team at the Allen Institute for Artificial Intelligence. He is mainly interested in computer vision, machine learning, the intersection of natural language and vision, analysis of the role of semantics in visual understanding, and visual reasoning.

Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence. She is a co-recepient of the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016. She received her Ph.D. in Computer Science at Cornell University (advisor: Prof. Claire Cardie) and BS in Computer Science and Engineering at Seoul National University in Korea.

Vittorio Ferrari leads a research group on visual learning at Google. He received his PhD from ETH Zurich in 2004, then was a post-doc at INRIA Grenoble (2006-2007) and at the University of Oxford (2007-2008). Between 2008 and 2012 he was an Assistant Professor at ETH Zurich. In 2012-2018 he was faculty at the University of Edinburgh, where he became a Full Professor in 2016. In 2012 he received the prestigious ERC Starting Grant, and the best paper award from the European Conference in Computer Vision. He is the author of over 110 technical publications. He regularly serves as an Area Chair for the major computer vision conferences, he was a Program Chair for ECCV 2018 and will be a General Chair for ECCV 2020. He is an Associate Editor of IEEE Pattern Analysis and Machine Intelligence. His current research interests are in learning visual models with minimal human supervision, human-machine collaboration, and semantic segmentation.

Stephan Günnemann conducts research in the area of data mining and machine learning. The focus of his work is on the design and analysis of robust and scalable machine learning techniques with the goal to enable a reliable analysis of the massive amounts of data collected by science and industry. Prof. Günnemann is particularly interested in studying the principles for analyzing complex data such as networks, graphs and temporal data. He acquired his doctoral degree in 2012 at RWTH Aachen University in the field of computer science. In 2015, Prof. Günnemann set up an Emmy Noether research group at TUM Department of Informatics. He has been a professor of data mining & analytics at TUM since 2016.

Laurens van der Maaten is a Research Scientist at Facebook AI Research in New York, working on machine learning and computer vision. Before, he worked as an Assistant Professor at Delft University of Technology, as a post-doctoral researcher at UC San Diego, and as a Ph.D. student at Tilburg University. He is interested in a variety of topics in machine learning and computer vision. Currently, he is working on embedding models, large-scale weakly supervised learning, visual reasoning, and cost-sensitive learning.

Kate Saenko is an Associate Professor at the Department of Computer Science at Boston University, and the director of the Computer Vision and Learning Group and member of the IVC Group. She received her PhD from MIT. Previously, she was an Assistant Professor at the Department of Computer Science at UMass Lowell, a Postdoctoral Researcher at the International Computer Science Institute, a Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests are in the broad area of Artificial Intelligence with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.

Call for papers

This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of graphical representation learning. Topics include, but are not limited to:

  • Algorithmic approaches: How should we improve graph embeddings, probabilistic approaches based on latent variable modeling, message-passing neural networks, dimensionality reduction techniques etc.?
  • Evaluation methods: How to evaluate scene graph representations in terms of trade-offs with different learning objectives?
  • Theoretical aspects: When and why do learned representations aid visual relationship reasoning? How does the non-i.i.d. nature of graph-based data conflict with our current understanding of representation learning?
  • Optimization and scalability challenges: How should we handle the inherent discreteness and curse of dimensionality of graph-based data and challenges in negative sampling?
  • Domain-specific applications: How should we improve semantic visual understanding, scene graph mining, natural language processing, reinforcement learning, programming language analysis etc.?
  • Any other topic of interest for scene graph representation and learning.

If you work on one of the following research areas, this workshop is the right place for you:

  • Scene graphs
  • Relational learning or knowledge bases
  • Models that expect or generate structured inputs or outputs
  • Models that utilize structured intermediate representation
  • High-level semantic vision tasks like commonsense reasoning
  • Spectral or graph neural networks
  • Beyond objects and contextual learning
  • Any other related research area

We invite researchers and practitioners to submit their work (between 4-8 pages in the ICCV template) to this CMT portal.

Please contact Ranjay Krishna with any questions: ranjaykrishna [at] cs [dot] stanford [dot] edu.

Related previous workshops

To the best of our knowledge, this is the first workshop focused primarily on graph structured representation learning for computer vision. The two related previous workshops are:

  • Vision and Language: Unlike Vision and Language, we are primarily focused on graphical representations while language is usually treated as a special case where the representation is a sequence. While scene graphs and similar representations are often canonicalized to language and meaning, their connections to language are utilized as an additional signal and not a requirement.
  • Relational Representation Learning: Relational Representation Learning is more closely related to our workshop but was organized for a non-vision community and primarily focused on graph-based data found in social networks and knowledge bases. In knowledge bases, relationships are intentionally constructed, so that pattern-based methods are successful. Relationships in text are usually document-agnostic (e.g. Tokyo-is capital of-Japan). In vision, images and visual relationships are incidental, not intentional. They are always image-specific because they only depend on the contents of the particular image they appear in. Therefore, methods that rely on knowledge from external sources or on patterns over concepts (e.g. most instances of dog next to frisbee are playing with it) do not generalize well for visual relationships or scene graphs. The inability to utilize the progress in methods for text raises the need for methods that are specialized for visual knowledge.

Program Committee

  • Justin Johnson - Facebook and University of Michigan
  • Danfei Xu - Stanford University
  • Alexandro Newell - Google and University of Michigan
  • Ronghang Hu - UC Berkeley
  • Paroma Varma - Stanford University
  • Kenji Hata - Google
  • Khaled Jedoui - Stanford University
  • Jingwei Ji - Stanford University
  • Sima Yazdani - Cisco
  • Si Liu - Beihang University
  • Xin Wang - UC Santa Barbara
  • Biplab Banerjee - IIT Bombay
  • Subarna Tripathi - Intel
  • Jianwei Yang - Georgia Tech
  • Peng Dai - Ryerson University
  • Liping Yang - Los Alamos National Laboratory
  • More to come...

If you are interted in taking a more active part in the workshop, apply to join the program committe using this link.