Multimodal Meme Search Engine with Pinecone and CLIP

3 min readMar 24, 2024

Giphy and Tenor are the two popular places we go for our GIFs. Users interact with these mostly in the form of search, which works like any other search engine. The creator of the GIF adds some keywords and tags to it, which are then picked up by the search engine.

Like gifs, memes are a succinct form of communication, and by their very nature, are multimodal; they weave together visual and textual elements to convey humor, satire, or commentary. This inherent complexity presents a challenge for conventional search engines optimized for either text or image queries but not the seamless integration of both. Hence in this project I tried to build a meme search-engine using the multimodal capabilities of OpenAI's CLIP and the vector search database Pinecone. The goal was to input text queries, and find the most semantically related meme images in our dataset.

The dataset used in this project is a subset of the ImgFlip575K Memes Dataset.

Here is a demo of the app in action :

Enter CLIP !

CLIP (Contrastive Language–Image Pre-training) is an OpenAI model. Trained on a diverse internet dataset, CLIP generates embeddings — a form of high-dimensional vector representation — for both images and text in a shared space. This allows for performing tasks like semantic search where the inputs can be in different modalities. For more detailed information on CLIP, visit the official GitHub repository.

To leverage GPU acceleration with PyTorch and CLIP I made use of CUDA. CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). So generally all NVIDIA graphics cards currently support CUDA because it is proprietary.

Note : All the installation and the setup steps are detailed in the README of my project. The code snippets here can be incomplete and representative, refer my GitHub repo for the complete project.

Integrating CLIP into our project, we began by generating embeddings for our meme dataset:

import clip
import torch
from PIL import Image

# Load the CLIP model
model, preprocess = clip.load("ViT-B/32")
def generate_embedding(image_path):
    image = preprocess(Image.open(image_path)).unsqueeze(0)
    with torch.no_grad():
    embedding = model.encode_image(image)
    return embedding

Vector Search : What is Pinecone?

With our memes translated into CLIP embeddings, the next challenge was to search among them efficiently. Pinecone comes to the rescue.

Pinecone’s tagline is “Long-term memory for AI”. Simply put, it is a vector database. What’s that? This is how Roie from Pinecone explains : “Vector databases are purpose-built databases that are specialized to tackle the problems that arise when managing vector embeddings in production scenarios. For that reason, they offer significant advantages over traditional scalar-based databases and standalone vector indexes.” Read this excellent blog by Roie for a deep dive.

Vector databases like Pinecone have multiple use cases like semantic search, generative question answering, generative chatbot agents, and many more.

An index is the highest-level organizational unit of vector data in Pinecone. It accepts and stores vectors, serves queries over the vectors it contains, and does other vector operations over its contents. Two types of indices are currently offered — serverless and pod-based.

The “Starter plan” offers one free pod-based starter index, capable of supporting up to 100,000 vectors. A starter-index can be created as follows:

from pinecone import Pinecone, PodSpec
pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(
    name="starter-index",
    dimension=512,
    metric="cosine",
    spec=PodSpec(
        environment="gcp-starter"
        )
    )

We use dimension as 512, because that is the dimension of the embeddings generated by CLIP. After indexing our meme embeddings in Pinecone, we could perform fast and accurate similarity searches using vector queries generated from text prompts.

Crafting the Search Experience

With the infrastructure in place, we can now search ! Given a textual query, we generate its embedding using CLIP, query our Pinecone index, and retrieve the most semantically related memes:

def search_meme(query):
    query_embedding = generate_embedding_from_text(query)
    results = index.query(vector=[query_embedding], top_k=1)
    return results

Conclusion

This was a fun project, and I got to learn how to harness the power of Pinecone. I stored 650 images in the index, which I agree is very small, but the search was very smooth nonetheless with virtually no latency.

The code snippets shared here represent just a fraction of the project’s scope. Once again, I invite you to explore the full repository and consider the possibilities that these technologies hold for your own projects.

Multimodal Meme Search Engine with Pinecone and CLIP

Enter CLIP !

Vector Search : What is Pinecone?

Crafting the Search Experience

Conclusion

Written by Nagraj

No responses yet