Scene Graph Representation and Learning
First workshop on graph based learning in computer vision
Held in conjunction with ICCV 2019 on October 28th in Seoul, Korea
Location: Room 318 B-C in at the COEX Convention Center
Overview
Images are more than a collection of objects or attributes --- they represent a web of relationships among interconnected objects. In an effort to formalize a representation for images, Visual Genome defined scene graphs, a structured formal graphical representation of an image that is similar to the form widely used in knowledge base representations. Each scene graph encodes objects (e.g. dog, frisbee) as nodes connected via pairwise relationships (e.g., playing with) as edges. Scene graphs have led to many state-of-the-art models in image captioning, image retrieval, visual question answering, relationship modeling and image generation. Numerous other graph-based representations have been introduced for 3D geometry, for part-based object recognition, for understanding instructional videos, and for situational role classification. This workshop aims to discuss the progress, the benefits and the shortcomings of all graph-based representations and learning algorithms.
Graphs have also enabled the innovation, adoption and use of numerous new spectral-based models like graph convolutions and graph-based evaluation metrics like SPICE. Modeling graphical data has historically been challenging for the machine learning community, especially when dealing with large amounts of data. Traditionally, methods have relied on Laplacian regularization through label propagation, manifold regularization or learning embeddings. Soon, operators on local neighborhoods of nodes became popular with their ability to scale to larger amounts of data and parallizable computation. Today's choice of architecture, the graph convolution, has become the de facto choice when dealing with graphical data; it was also inspired by these Laplacian-based, local operations. Graph convolutions, and similar techniques are slowly making their way into computer vision tasks and have recently been combined with RCNN to perform scene graph detection.
At this workshop, we hope to discuss the importance of structure in computer vision. How should we be representing scenes, videos, and 3D spaces? What connections to language and knowledge bases could aid vision tasks? How can we rethink the machine learning community’s traditional relation-based representation learning? How can we both use and build upon spectral methods like random walks over graphs, message passing protocols, set-invariant neural architectures, and equivariant structured outputs? What are the shortcomings with our current representations and learning based methods and how can we remedy these problems? What tasks and directions should we be urging the community to move towards?
To receive notifications about updates related to the workshop, sign up using this form.
Program schedule - Room 318 B-C
Adversarial Robustness of Machine Learning Models for Graphs
Scene Navigation by Knowledge Graph and Interaction
Beyond a self-sufficient pixel tensor: Modeling xternal knowledge and internal image structure
Email Ranjay if you want to see the recorded videos.
Images from the workshop
Important Dates and Details
- Signup to receive updates: using this form
- Apply to be part of Program Committee by: August 01, 2019 using this link
- Paper submission deadline: August 15, 2019 at 11:59pm using this CMT portal using this ICCV template
- Notification of acceptance: September 07, 2019
- Camera ready and posters due: September 30, 2019
- Submit questions/topics to discuss: Submit topics or questions you want to discuss during the panel session
- Workshop: October 28, at ICCV 2019, Seoul, South Korea
Invited speakers
Call for papers
This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of graphical representation learning. Topics include, but are not limited to:
- Algorithmic approaches: How should we improve graph embeddings, probabilistic approaches based on latent variable modeling, message-passing neural networks, dimensionality reduction techniques etc.?
- Evaluation methods: How to evaluate scene graph representations in terms of trade-offs with different learning objectives?
- Theoretical aspects: When and why do learned representations aid visual relationship reasoning? How does the non-i.i.d. nature of graph-based data conflict with our current understanding of representation learning?
- Optimization and scalability challenges: How should we handle the inherent discreteness and curse of dimensionality of graph-based data and challenges in negative sampling?
- Domain-specific applications: How should we improve semantic visual understanding, scene graph mining, natural language processing, reinforcement learning, programming language analysis etc.?
- Any other topic of interest for scene graph representation and learning.
If you work on one of the following research areas, this workshop is the right place for you:
- Scene graphs
- Relational learning or knowledge bases
- Models that expect or generate structured inputs or outputs
- Models that utilize structured intermediate representation
- High-level semantic vision tasks like commonsense reasoning
- Spectral or graph neural networks
- Beyond objects and contextual learning
- Any other related research area
We invite researchers and practitioners to submit their work (between 4-8 pages in the ICCV template) to this CMT portal.
Organizers
Please contact Ranjay Krishna with any questions: ranjaykrishna [at] cs [dot] stanford [dot] edu.
Accepted papers
- A Topological Graph-Based Representation for Denoising Low Quality Binary Images Catherine Potts, Liping Yang, Diane Oyen, Brent Wohlberg
- Differentiable Scene Graphs Moshiko Raboh, Roei Herzig, Jonathan Berant, Gal Chechik, Amir Globerson
- Visual Relationships as Functions: Enabling Few-Shot Scene Graph Prediction Apoorva Dornadula, Austin O Narcomey, Ranjay Krishna, Michael S Bernstein, Li Fei-Fei
- Graph Attention Model Embedded with Multi-Modal Knowledge for Depression Detection Wenbo Zheng
- KM^4: Visual Reasoning via Knowledge Embedding Memory Model with Mutual Modulation Wenbo Zheng
- Spatial Residual Layer and Dense Connection Block Enhanced Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition Cong Wu, Xiao-Jun Wu, Josef Kittler
- Detecting Visual Relationships Using Box Attention Alexander Kolesnikov, Alina Kuznetsova, Christoph H Lampert, Vittorio Ferrari
- Attention-Translation-Relation Network for Scalable Scene Graph Generation Nikolaos Gkanatsios, Vassilis Pitsikalis, Petros Koutras, Petros Maragos
- SynthRel 0: Towards Synthetic Relationship Datasets for Prototyping and Diagnostics Daniel K Dorda
- Scene Graph Prediction with Limited Labels Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael S Bernstein, Christopher Re, Li Fei-Fei,
- VideoGraph: Recognizing Minutes-Long Human Activities in Videos Noureldien Hussein, Stratis Gavves, Arnold Smeulders
- Triplet-Aware Scene Graph Embeddings Brigit Schroeder, Subarna Tripathi, Hanlin Tang
Related previous workshops
To the best of our knowledge, this is the first workshop focused primarily on graph structured representation learning for computer vision. The two related previous workshops are:
- Vision and Language: Unlike Vision and Language, we are primarily focused on graphical representations while language is usually treated as a special case where the representation is a sequence. While scene graphs and similar representations are often canonicalized to language and meaning, their connections to language are utilized as an additional signal and not a requirement.
- Relational Representation Learning: Relational Representation Learning is more closely related to our workshop but was organized for a non-vision community and primarily focused on graph-based data found in social networks and knowledge bases. In knowledge bases, relationships are intentionally constructed, so that pattern-based methods are successful. Relationships in text are usually document-agnostic (e.g. Tokyo-is capital of-Japan). In vision, images and visual relationships are incidental, not intentional. They are always image-specific because they only depend on the contents of the particular image they appear in. Therefore, methods that rely on knowledge from external sources or on patterns over concepts (e.g. most instances of dog next to frisbee are playing with it) do not generalize well for visual relationships or scene graphs. The inability to utilize the progress in methods for text raises the need for methods that are specialized for visual knowledge.
Program Committee
- Justin Johnson - Facebook and University of Michigan
- Danfei Xu - Stanford University
- Alexandro Newell - Google and University of Michigan
- Ronghang Hu - UC Berkeley
- Paroma Varma - Stanford University
- Kenji Hata - Google
- Khaled Jedoui - Stanford University
- Jingwei Ji - Stanford University
- Sima Yazdani - Cisco
- Si Liu - Beihang University
- Xin Wang - UC Santa Barbara
- Biplab Banerjee - IIT Bombay
- Subarna Tripathi - Intel
- Jianwei Yang - Georgia Tech
- Peng Dai - Ryerson University
- Liping Yang - Los Alamos National Laboratory
- More to come...
If you are interted in taking a more active part in the workshop, apply to join the program committe using this link.