DEV Community

Opensolr
Opensolr

Posted on

I made search engines understand emojis (and it's weirdly useful)

Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.

Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.

Try it yourself

These are live search engines running on real e-commerce data:

Type ๐Ÿ”‘ (key emoji) โ†’ get actual keys: https://search.opensolr.com/dedeman?q=๐Ÿ”‘

Type ๐Ÿšฒ (bike) โ†’ get bicycles and accessories: https://search.opensolr.com/dedeman?q=๐Ÿšฒ

Type ๐Ÿ–จ๏ธ๐Ÿ“„ (printer + paper) โ†’ get printer supplies: https://search.opensolr.com/b2b?q=๐Ÿ–จ๏ธ๐Ÿ“„

This one's my favorite - type "cute domestic pet earrings" on a jewelry store: https://search.opensolr.com/rueb?q=cute+domestic+pet+earrings

(it finds cat and dog earrings even though the product titles are in a completely different language)

How it actually works

The pipeline is:

  1. Crawl website โ†’ extract text with Trafilatura
  2. Generate 1024D embeddings via BGE-M3
  3. Store in Solr with both text + vectors
  4. At query time: run lexical search + KNN vector search
  5. Combine scores (hybrid approach)

The emoji thing works because BGE-M3 was trained on multilingual + multimodal data. The model learned that ๐Ÿ”‘ and "key" and "Schlรผssel" (German) and "cheie" (Romanian) are all semantically close.

So when someone searches ๐Ÿšฒ, the embedding is close to "bicycle", "bike", "Fahrrad", "bicicletฤƒ", etc.

The weird part

Cross-language search just... works. The Romanian e-commerce site has products in Romanian, but you can search in English or with emojis and it finds relevant stuff. No translation layer, no language detection preprocessing - the embeddings handle it.

Same with conceptual queries. "things to wear around neck" finds necklaces, pendants, chains - even though no product has "things to wear around neck" in the title.

Stack details for the curious

  • Embeddings: BGE-M3 (BAAI), 1024 dimensions
  • Inference: Running on RTX 4000 Ada, ~2-5ms per query
  • Search: Solr 9.6 with dense vector support
  • Crawling: Custom PHP + Python (Playwright for JS-heavy sites, Trafilatura for extraction)
  • Extra features: VADER for sentiment, langid for language detection, custom price extraction

Query latency is ~40-50ms total including embedding generation.

Hybrid vs pure vector

Pure vector search is cool but has issues:

  • Exact matches sometimes rank lower than "similar" results
  • Product codes/SKUs get weird results
  • Users expect "nike shoes" to prioritize exact Nike matches

Hybrid fixes this. Lexical handles exact matches, vectors handle the "I don't know the exact word but I know what I want" queries.

The Solr query can be seen in the debig view (bottom-right button) where you can see the actual vector query functions.

vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]

lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1" 
                         pf="title^1100 description^900" ...}

q = {!func}sum(
      product(1, query($vectorQuery)), 
      product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
    )
Enter fullscreen mode Exit fullscreen mode

Bonus: AI-generated hints

Added an experimental feature where the search can explain results. Search "measure ๐Ÿ”ฅ" on a technical documentation site and it tells you which specific device to use for measuring temperature/fire:

https://search.opensolr.com/fluke?q=measure+๐Ÿ”ฅ

It pulls context from indexed PDFs and generates a recommendation. Uses a local LLM (running on same GPU).

Anyway, thought some of you might find the emoji thing interesting. The cross-language aspect was unexpected - I didn't build it for that, it just emerged from using multilingual embeddings.

Happy to answer questions about the setup or hybrid search in general.

Top comments (0)