shipped

Multilingual Text-to-Image Retrieval

Search a 5,000-image gallery by text, voice, or image — in six languages.

PyTorchM-CLIPFastAPIFAISSWhisperReact

Three input modalities, one shared embedding space. You can type a query in any of six trained languages (English, Arabic, Chinese, French, German, Russian), drop in an example image, or speak — Whisper-small transcribes the audio and feeds it into the same retrieval pipeline.

The core is M-CLIP (XLM-RoBERTa-Large + ViT-B/32) with the bottom 16 of 24 XLM-R layers frozen, leaving ~110M trainable parameters. Retrieval runs through a FAISS flat inner-product index over the 5,000 image vectors, followed by an exact-cosine re-ranking pass. Training used InfoNCE with hard-negative mining at temperature 0.07.

It was evaluated honestly on 15,012 cross-lingual queries (R@1, R@5, mAP@10), reaching ~0.51 R@5 and ~0.37 mAP@10 overall. The interesting finding: scaling training data 5× (1K→5K images) did not lift overall R@1 — contrastive fine-tuning plateaus at epoch 1 at this scale, so the gains come from architectural choices, not more data.

The write-up documents a real failure mode most benchmarks hide: cross-lingual homograph collisions (short Azerbaijani queries colliding with English tokens), and the React frontend even ships a query-time heuristic warning for it. Backend is FastAPI; the frontend is React + Vite + TypeScript with a live voice waveform and result lightbox.

GitHub ↗