Published on

JSON Embeddings & Reranking

Authors

Introduction

In my recent talk about reranking, I encountered an intriguing question from an attendee working at a large e-commerce company. They were facing challenges in extracting useful metadata from product descriptions for querying and filtering with search engines. Here's a detailed look at the problem and the innovative solution we explored.

The Problem

The company has a vast number of products and needs to extract metadata from text descriptions to facilitate efficient querying and filtering. The initial approach involved using an expensive extraction function, which looked something like this:

s1 = "Title:[SIM-free] Samsung Galaxy S24 Ultra (SM-S9280) 12+ 256GB | Global version | Multi-language | Japanese support | Smartphone body | Gray Titanium Gray [Parallel import product]\nDescription: Model number: SM-S9280 - No eSIM, supported frequencies are:\n■4G FDD LTE B1(2100), B2(1900), B3(1800), B4(AWS), B5(850), B7(2600), B8(900), B12(700), B13(700), B18 (800), B19(800), B20(800), B25(1900), B26(850), B28(700), B66(AWS-3) ■4G TDD LTE B34(2010), B38(2600), B39 (1900), B40(2300), B41(2500) ■5G FDD Sub6 N1(2100), N2(1900), N3(1800), N5(850), N7(2600), N8(900), N12(700) ), N20(800), N25(1900), N28(700), N66(AWS-3) ■5G TDD Sub6 N38(2600), N40(2300), N41(2500), N77(3700), N78(3500) ), N79(4500)\nWallet payment cannot be used for parallel import items. [SIM-free] Samsung Galaxy S24 Ultra (SM-S9280) 12+ 256GB | Global version | Multi-language | Japanese support | Smartphone body | Titanium Gray [Parallel import product]"

s1_attributes = expensive_extraction(s1)
print(s1_attributes)
# {
#     "item_name": "Samsung Galaxy S24 Ultra",
#     "color": "Titanium Gray",
#     "model_no": "SM-S9280",
#     "storage_gb": 256,
# }

The extracted attributes were then used in a vector search engine to combine user queries with filtering using boolean operations (e.g., storage_gb>128 AND color="Titanium Gray"). However, the extraction function was costly, and many products were duplicates in terms of features.

Initial Solution and Its Limitations

To address the cost issue, they tried creating embeddings for new items and comparing them with existing embeddings using vector search. A score threshold (e.g., >.90 cosine similarity) was set to decide if a product needed re-indexing. However, this approach failed as most scores were in the .98-.99 range, making differentiation difficult.

Vector search compares two points (a query and a document) using cosine distance between embeddings. While fast, it doesn't capture deeper token interactions because most embedding models average out individual token embeddings in the last layer (pooling).

Introducing Cross Encoder Models

Cross encoder models, using a transformer architecture, compare two sentences by internally comparing tokens, capturing deeper interactions. The revised approach involved:

  1. Fetching the top 10 matching items from the Vector DB.
  2. Comparing the new item description with these items using a cross encoder model, which provided a score for each item.

However, this still didn't solve the problem as the scores remained close:

0.9697281305754024 # color changed
0.9879462204837635 # sale 20% OFF
0.9957187454592983 # I reordered the words while keeping the meaning same.

The Breakthrough: JSON Embeddings

Realizing that the embeddings might not know what to compare due to noise, I decided to compare the embeddings of json.dumps(s1_attributes) to the new product description. This approach yielded significant improvements:

0.8413048148640399 # Color changed. So huge drop
0.9568766292846294 # 20% sale. Not much diff
0.9764007056485348 # Reorder words. Very little diff

The score dropped heavily (<.85) when any attribute was affected, indicating a successful differentiation.

Experimenting with Different Formats

I also experimented with different separators to concatenate key-value pairs:

" ".join("{k}: '{v}'" for k, v in s1_attributes.items()) # approach 1
" | ".join("{k}: {v}" for k, v in s1_attributes.items()) # approach 2
", ".join("{k}: {v}" for k, v in s1_attributes.items()) # approach 3

None of these worked as well as the JSON format. The underlying Roberta language model seemed to understand the intent better with the familiar JSON syntax.

Future aspects:

I've found some datasets to evaluate this approach properly and I'm working on the same. I'll be finetuning the cross encoder so it shows larger drops when the attributes changes.

Conclusion

Using JSON embeddings for product metadata extraction and indexing proved to be a game-changer. It significantly improved the accuracy of identifying duplicate products and reduced the computational cost. This approach highlights the importance of understanding the limitations of vector search and leveraging advanced models like cross encoders to capture deeper interactions.

By applying these principles, you can enhance the efficiency and accuracy of your product metadata extraction processes, ultimately improving the user experience in e-commerce platforms.