This page is an interactive demo of the Google AI blog post LiT: adding language understanding to image models . Please refer to that page for a detailed explanation how a LiT model works.
Below you can choose an image from a selection and then write free-form text prompts that are matched to the image. Once you hit return on your keyboard or press the "compute" button, a text encoder implemented in TensorFlow.js will compute embeddings for the provided text on your local device, and the similarity of these text embeddings to the image embedding will be displayed.
The prompts can be used to classify an image into multiple categories, listing each category individually with a prompt "an image of a X". But you can also probe the model interactively with more detailed prompts, comparing the different results when small details change in the text.
Please use this demo responsibly. The models will always compare the image to the prompts you provide, and it is therefore trivial to construct situations where the model picks from a bunch of bad options.
Note: The models available in this interactive demo are not those from the paper. We had to train much smaller text towers and tokenizers to avoid overloading your browser. Please see our GitHub repository for the models from the paper pre-trained on public datasets. Multilingual models coming soon.