Language is the most important channel for humans to communicate about what they see. To allow an intelligent system to effectively communicate with humans it is thus important to enable it to relate information in words and sentences with the visual world. One component in a successful communication is the ability to answer natural language questions about the visual world. A second component is the ability of the system to explain in natural language, why it gave a certain answer, allowing a human to trust and understand it. In my talk, I will show how we can build models which answer questions but at the same time are modular and expose their semantic reasoning structure. To explain the answer with natural language, I will discuss how we can learn to generate explanations given only image captions as training data by introducing a discriminative loss and using reinforcement learning.
In his research Marcus Rohrbach focuses on relating visual recognition and natural language understanding with machine learning. Currently he is a Post-Doc with Trevor Darrell at UC Berkeley. He and his collaborators received the NAACL 2016 best paper award for their work on Neural Module Networks and won the Visual Question Answering Challenge 2016. During his PhD he worked at the Max Planck Institute for Informatics, Germany, with Bernt Schiele and Manfred Pinkal. He completed it in 2014 with summa cum laude at Saarland University and received the DAGM MVTec Dissertation Award 2015 from the German Pattern Recognition Society for it. His BSc and MSc degree in Computer Science are from the University of Technology Darmstadt, Germany (2006 and 2009). After his BSc, he spent one year at the University of British Columbia, Canada, as visiting graduate student.