Autonomous Driving (AD) has the potential to revolutionize mobility and bring lasting benefits to society. It is thus at the forefront of AI research and has attracted the attention of both academia and industry. From a Computer Vision perspective, the most relevant task in AD is Perception, i.e. understanding the world around the car. After discussions with both academic researchers and industrial practitioners, we feel that the temporal and multi-modal aspects of perception have been overlooked. Robust tracking and more importantly prediction of movement, for both vehicles and pedestrians, are critical for AD. This issue is particularly acute in dense urban environments, which are heterogeneous multi-agent systems consisting of diverse traffic participants with a great variety of shapes, dynamics, behaviors, and intents.