Abstract: As one of the most fundamental techniques in multi-modal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and ...