visual-language-learning
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Otter: A Multi-Modal Model with In-Context Instruction Tuning