All organisms make temporal predictions, and their evolutionary fitness level generally scales with the accuracy of these predictions. In the context of visual perception, observer motion and continuous deformations of objects and textures structure the dynamics of visual signals, which allows for partial prediction of future inputs from past ones. Here, we propose a self-supervised representation-learning framework that reveals and exploits the regularities of natural videos to compute accurate predictions. The architecture is motivated by the Fourier shift theorem and its group-theoretic generalization, and is optimized for next-frame prediction. Through controlled experiments, we demonstrate that this approach can discover the representation of simple transformation groups acting in data. When trained on natural video datasets, our framework achieves better prediction performance than traditional motion compensation and conventional deep networks, while maintaining interpretability and speed. Furthermore, we implement this framework using normalized simple and direction-selective complex cell-like units, which are the elements commonly used to describe the computations of primate V1 neurons. These results highlight the potential of a principled video processing framework in elucidating how the visual system transforms sensory inputs into representations suitable for temporal prediction.