vision-and-language-pre-training
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.