Scene Graph Representation and Learning
First workshop on graph based learning in computer vision
Held in conjunction with ICCV 2019 on October 28th in Seoul, Korea
Images are more than a collection of objects or attributes --- they represent a web of relationships among interconnected objects. In an effort to formalize a representation for images, Visual Genome defined scene graphs, a structured formal graphical representation of an image that is similar to the form widely used in knowledge base representations. Each scene graph encodes objects (e.g. dog, frisbee) as nodes connected via pairwise relationships (e.g., playing with) as edges. Scene graphs have led to many state-of-the-art models in image captioning, image retrieval, visual question answering, relationship modeling and image generation. Numerous other graph-based representations have been introduced for 3D geometry, for part-based object recognition, for understanding instructional videos, and for situational role classification. This workshop aims to discuss the progress, the benefits and the shortcomings of all graph-based representations and learning algorithms.
Graphs have also enabled the innovation, adoption and use of numerous new spectral-based models like graph convolutions and graph-based evaluation metrics like SPICE. Modeling graphical data has historically been challenging for the machine learning community, especially when dealing with large amounts of data. Traditionally, methods have relied on Laplacian regularization through label propagation, manifold regularization or learning embeddings. Soon, operators on local neighborhoods of nodes became popular with their ability to scale to larger amounts of data and parallizable computation. Today's choice of architecture, the graph convolution, has become the de facto choice when dealing with graphical data; it was also inspired by these Laplacian-based, local operations. Graph convolutions, and similar techniques are slowly making their way into computer vision tasks and have recently been combined with RCNN to perform scene graph detection.
At this workshop, we hope to discuss the importance of structure in computer vision. How should we be representing scenes, videos, and 3D spaces? What connections to language and knowledge bases could aid vision tasks? How can we rethink the machine learning community’s traditional relation-based representation learning? How can we both use and build upon spectral methods like random walks over graphs, message passing protocols, set-invariant neural architectures, and equivariant structured outputs? What are the shortcomings with our current representations and learning based methods and how can we remedy these problems? What tasks and directions should we be urging the community to move towards?
To receive notifications about updates related to the workshop, sign up using this form.
Important Dates and Details
- Signup to receive updates: using this form
- Apply to be part of Program Committee by: August 01, 2019 using this link
- Paper submission deadline: August 15, 2019 at 11:59pm using this CMT portal using this ICCV template
- Notification of acceptance: September 07, 2019
- Camera ready and posters due: October 15, 2019
- Submit questions/topics to discuss: Submit topics or questions you want to discuss during the panel session
- Workshop: October 28, at ICCV 2019, Seoul, South Korea
Call for papers
This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of graphical representation learning. Topics include, but are not limited to:
- Algorithmic approaches: How should we improve graph embeddings, probabilistic approaches based on latent variable modeling, message-passing neural networks, dimensionality reduction techniques etc.?
- Evaluation methods: How to evaluate scene graph representations in terms of trade-offs with different learning objectives?
- Theoretical aspects: When and why do learned representations aid visual relationship reasoning? How does the non-i.i.d. nature of graph-based data conflict with our current understanding of representation learning?
- Optimization and scalability challenges: How should we handle the inherent discreteness and curse of dimensionality of graph-based data and challenges in negative sampling?
- Domain-specific applications: How should we improve semantic visual understanding, scene graph mining, natural language processing, reinforcement learning, programming language analysis etc.?
- Any other topic of interest for scene graph representation and learning.
If you work on one of the following research areas, this workshop is the right place for you:
- Scene graphs
- Relational learning or knowledge bases
- Models that expect or generate structured inputs or outputs
- Models that utilize structured intermediate representation
- High-level semantic vision tasks like commonsense reasoning
- Spectral or graph neural networks
- Beyond objects and contextual learning
- Any other related research area
Please contact Ranjay Krishna with any questions: ranjaykrishna [at] cs [dot] stanford [dot] edu.
Related previous workshops
To the best of our knowledge, this is the first workshop focused primarily on graph structured representation learning for computer vision. The two related previous workshops are:
- Vision and Language: Unlike Vision and Language, we are primarily focused on graphical representations while language is usually treated as a special case where the representation is a sequence. While scene graphs and similar representations are often canonicalized to language and meaning, their connections to language are utilized as an additional signal and not a requirement.
- Relational Representation Learning: Relational Representation Learning is more closely related to our workshop but was organized for a non-vision community and primarily focused on graph-based data found in social networks and knowledge bases. In knowledge bases, relationships are intentionally constructed, so that pattern-based methods are successful. Relationships in text are usually document-agnostic (e.g. Tokyo-is capital of-Japan). In vision, images and visual relationships are incidental, not intentional. They are always image-specific because they only depend on the contents of the particular image they appear in. Therefore, methods that rely on knowledge from external sources or on patterns over concepts (e.g. most instances of dog next to frisbee are playing with it) do not generalize well for visual relationships or scene graphs. The inability to utilize the progress in methods for text raises the need for methods that are specialized for visual knowledge.
- Justin Johnson - Facebook and University of Michigan
- Danfei Xu - Stanford University
- Alexandro Newell - Google and University of Michigan
- Ronghang Hu - UC Berkeley
- Paroma Varma - Stanford University
- Kenji Hata - Google
- Khaled Jedoui - Stanford University
- Jingwei Ji - Stanford University
- Sima Yazdani - Cisco
- Si Liu - Beihang University
- Xin Wang - UC Santa Barbara
- Biplab Banerjee - IIT Bombay
- Subarna Tripathi - Intel
- Jianwei Yang - Georgia Tech
- Peng Dai - Ryerson University
- Liping Yang - Los Alamos National Laboratory
- More to come...
If you are interted in taking a more active part in the workshop, apply to join the program committe using this link.