BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language | Notion

BLIP 이전 VLP 모델의 한계

모델 관점
- 대부분의 기존 방법에서 모델 구조는 Encoder base model 또는 Encoder-Decoder model
- Encoder base model(CLIP, ALBEF): Image captioning과 같은 Generation base task가 어려움
- Encoder-Decoder model(VL-T5, SimVLM): Image-text retrieval과 같은 Understanding-base task가 어려움
데이터 관점
- 대부분의 모델이 web에서 크롤링한 이미지와 text pair를 이용해 노이즈가 많음

Contribution

1. MED(Multimodal Mixture of Encoder-Decoder)

Multi-task Pre-training과 Transfer learning을 위한 새로운 모델 구조
BLIP 구조 내에서 unimodal encoder, Image-Grounded text encoder, Image-Grounded text decoder로서 사용됨
3가지 VL Objective에 따라 동시에 pre-training됨
- image-text contrastive learning
- image-text matching
- image-conditional language modeling

2. CapFilt(Captioning and Filtering)

noised image-text pair를 학습하기 위한 새로운 boostrapping 방법
pre-training MED를 두가지 모듈로 fine tuning함
- Captioner: 주어진 web image로 Synthetic caption 생성
- Filter: 원본 text(web text)와 Synthetic caption 중 noise가 있는 caption을 제거

Knowledge Distillation 활용

Untitled

Data Augmentation 활용

Untitled