Optimizing online product catalog data to improve search & discovery - Alternatives and their pros and cons

Most retailers understand that having an optimized, complete and trustworthy product catalog has direct impact on the performance of their search, recommendation engines and overall customer experience. Different alternatives exist that attempt to help retailers address this challenge. This blog presents a high level comparison of the pros and cons of various alternatives based on our engagements with leading retailers across US, Europe and India and tries to make a case for crowdsourced data analytics to automate product catalog onboarding.

Firstly, the definitions of the various non ”crowdsourced data analytics” alternatives:

  • Approach 1 - Data entry crowdsourcing platforms: Crowdsourcing platforms like Mechanical Turk provides access to an army of resources across the world that can manually enter missing product attribute values. 
  • Approach 2 - Product APIs: Platforms like Wiser, Indix, Semantics3 and other web-based data platforms that claim to have pre-built data repositories that can be integrated with existing online catalogs of e-commerce vendors.
  • Approach 3 - Automated attribute prediction algorithms: Some sophisticated retailers build a bunch of NLP and ML algorithms to extract values of data attributes from image, title and description of products. Since the image, title and description provided by suppliers are generally trustworthy, this method provides accurate attribute values as long as the models are built at a precision of 90% or more.

Comparison Chart:

   Approach 1 Approach 2 Approach 3
Quality of data HIGH: Since humans enter the values, data quality is reasonably good, however human data entry errors do creep in and need to be watched out for.  MEDIUM: Quality of data is as good as the source it’s web-scraped from. Also, the data may or may not be exhaustive and in most cases several attribute values that a retailer needs may not be available HIGH: Since the models are tuned to specific attributes and taxonomies of each individual retailer the data usually is of a consistently high quality.
Speed of updates LOW: Distributed Workforces can handle at most 40,000 SKUs in a month. This isn’t always ideal since products on a catalog need to be onboarded everyday and any delay in optimizing the catalog can result in lost sales due to poor customer experience HIGH: Assuming the data provided is a good fit, the updates can be done in near real-time HIGH: Once the models are deployed on a server, millions of products can be updated within hours.
 Customizability HIGH: The crowd of data entry resources can be instructed to provide specific product values allowing precise customization for each product in each product category of the catalog LOW: What-you-get-is-what-you-see literally! Either what they have fits your needs or you can’t use them. Most large to mid-sized retailers that we speak with have proprietary taxonomies and product attributes, for which this approach is not workable. HIGH: Since the models are built on a per attribute level, each algorithm can be customized to taxonomies and attributes that a retailer cares about.
Ease of integration LOW: Some crowdsourcing platforms do provide the response as APIs and json streams but in most cases the output is delivered as csv files that then need to be manually updated. LOW: Although the outcomes are provided as APIs, if the json stream doesn’t map directly to the data schema used by the retailer in their product catalog, integration can be a nightmare. HIGH: If the models can be deployed an an API server and integrated with the product catalog database, the integration can be seamless.
Objective measure for data trustworthiness  LOW: No consistent measure apart from visual inspection and qualitative judgments LOW: No consistent measure apart from visual inspection and qualitative judgements HIGH: Every ML algorithm has a “trustworthiness” or precision score which can be measured at any time to give a precise understanding and measure of the trustworthiness of the updated data.
Cost of making updates HIGH: Depending on the SKUs, hundreds of data entry resources need to be recruited to meet the needs MEDIUM: The ready availability of the data makes this option relatively cost efficient MEDIUM: There is an initial cost that needs to be invested in building the models and some cost in maintaining and retraining the models but it’s on par with the lowest cost options.
Suitability for small retailers HIGH: Quick way of optimizing the catalog if the refreshes are few and the total number of SKUs managed are less than 50,000 MEDIUM: For smaller retailers that may not have a well defined taxonomy or catalog could leverage this approach to get a base catalog quickly MEDIUM: There is a one time effort in building models but post that the cost is very manageable.
Suitability for large retailers LOW: This approach is very difficult to scale for retailers than manage millions of SKUs on their online stores LOW: Larger retailers need solutions that can integrate with minimal changes to their existing systems. Also, they have custom attributes that they’ve designed based on customer behavior observed on their sites. This approach just doesn’t work for them for these drawbacks. HIGH: Custom designed and seamlessly integrated with their existing platforms makes this the choice for most large retailers.


















Approach 3 above provides the best results but is exorbitantly expensive and hard to execute due to the cost and scarcity of data science resources. Building and maintaining thousands of ML algorithms, one for each product attribute, is close to impossible to execute in-house and as a result gets neglected. The cost of bad customer experience though is exorbitant and clients are forced to use Approach 1 due to lack of any better alternatives.

What you really need is something with the quality and customizability of Approach 3 with the speed of Approach 2 and cost point of Approach 1. 

This is exactly what CrowdANALYTIX’s dataXTM platform achieves

Approach 4: dataXTM uses ML algorithms to automatically update product catalog data. Hundreds of models are built by a crowd of data scientists, one model corresponding to each product attribute. The models are then made accessible through an API for near real-time updates. This eliminates the need for using humans to manually update product attributes - minimizing errors and reducing costs. The details can be found in our blog here.

Leave a comment


There are no comments here to display!