The breakthrough in visual understanding by deep learning is fueled by fast hardware and huge manually annotated datasets. We are conducting research on how to reduce the manual annotation effort while retaining the accuracy of deep learning models. Our research includes how to add knowledge to a deep network in a "batteries included" manner, how to add human interaction, how to exploit weak, noisy, or structured forms of annotation.
An image offers a wealth of information. Automatic analysis includes describing which objects are present (classification), localizing where exactly each object is (detection), and assigning a label to each pixel in the image (semantic segmentation). We do research on all such approaches, which includes investigating the role of context, noise, outliers, image matching, and visualizing and understanding automatic models.
A video is not merely a collection of static independent images. Video includes motion, action, interaction, dynamics, causal effects, long term behavior, emotion. Our research includes investigating how to (deep) learn motion representations, exploit dynamics, classify actions, localize actions in space and time, automatic emotion recognition, pose estimation and object tracking.