Summary
Keywords
Full Transcript
What are the properties of a model that translates image-based goals to a functional latent representation? In this lecture, I cover recent topics on representation learning for image-based goals. The limitations of many methods and how to understand goals/tasks as distributions instead of fixed points. I explained how goal-conditioned RL allows us to instruct agents by specifying desired outcomes, particularly in image space. However, achieving this requires learning effective representations, as we aim to minimize the distance between the current and goal images. I highlighted that not all image features are relevant to the task, like the precise angle of a plate in a table setting. This led to a discussion on the challenges of learning good representations, especially when using raw pixel data, which can be noisy and uninformative. To address this, I explored using Variational Autoencoders (VAEs) to learn a latent representation that captures essential pose and task-relevant information. We discussed the practical aspects of using VAEs in RL, including how to compute rewards based on latent space distances and the importance of training robust representations. These concepts are then connected to recent foundational models for robotics. We discuss the ingredients for improving learning representations in large models.
