How to build an end to end scalable visual search system with AI Computer Vision and AWS

by Nam Hoai

With the rise of e-commerce, online retail, visual search is a rapid trend because they are largely driven by visual content. In this article, I will share with you guys the visual search project that was built in 2018 for my Japanese partner.

What are the problems?

My partner is one of the biggest retailers of toys, clothing, and baby product in Japan. They have lots of stores all over the world. On holiday, the long queues of customers wait to checkout happen in lots of stores. So to solve this problem, they decide to apply the new store like Amazon Go, no queues, no checkout, just walk out of the store. In short, this not only improves their customers’ buying experience but also maximizes revenue growth.

Why do we need Visual Search?


My partner has lots of providers. Each of them distributes different types of products. The products will be updated frequently by weekly, monthly, or yearly. So visual search system needs to be updated the same as the scale of products.

What is the core technology here?


By the time I started this project in 2018, Triplet loss had proved efficient in Face Recognition. With that starting point, we decide to apply Triplet Loss as our core visual search technology because our problem is quite similar to Face Recognition. To get the highest accuracy and make the system scalable, we classify products into different categories and subcategories because Triplet loss will have the best performance when all the products are in the same domain property. For example, when you do Face Recognition, all the images are facial, right? Then in our problem, we train all the toys that have a similar domain with one AI Model. For example, the Lego toys will be trained together, the Figures will be trained together, and so on. And not only that, one of the most advantages here is when new products need to be updated, our system don’t need to re-training the whole model, the whole process which can reduce the cost. Normally, it’s very costly and takes us lots of time for training an AI model.

The architecture overview

provider As you can see, we have two main components: Serving and Training. The training component is built to fit with Admin, Operator, Providers, and Developers with different purposes. For any Machine Learning production, one of the most challenging parts is how your system can run automatically with the minimum of human interference. The serving component will host all APIs needed for our web app and mobile app.

Triplet Loss training strategy

Our strategy here to get the best model is: Clustering the categories to group categories with similar type samples. For each anchor image, we will pick 1 hardest positive, 1 hardest negative among the image batch. We will keep multiple anchors and compute the centroid of anchors for each category. To reduce computation cost we will first match with anchor centroids. If the distance is lower than TL and higher than TH, we will give an immediate decision that the product exists and doesn’t exist respectively. We are using two thresholds. Lower threshold, TL, and higher threshold And in the end, our model has very good accuracy, for the trained data, the accuracy is over 99 percent and 97 percent for un-trained data respectively.

Alt text


A Video Worth a Million words, so, please see this demo below on how we increase customer engagement with AR and AI.

Alt text


This article is focused on sharing the overview flow, architecture, and an example use case when AI is applied in the real world. If you are curious more about the technology, the development, or the business side, please follow my next articles.

1 comment

son August 10, 2021 - 7:53 PM

Bài này đi sâu hơn tính toán kỹ về chi phí thì ngon hẳn luôn


Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

You may also like

%d bloggers like this: