Deploying Large Scale Classification Algorithms for Attribute Prediction

In our last post we talked about automated product attribute classification using advanced text based machine learning techniques using the given product features like title, description etc. & predicting product attribute values from the defined set of values. As discussed as the catalogue size and no. of suppliers keep growing the problem of maintaining the catalogue accurately grows exponentially and there are thousands of attribute values and millions of products per day to classify. 

In this post, we are going to highlight some of the keys steps we utilized to deploy machine learning algorithms to classify thousands of attributes and deploying them on dataX™, CrowdANALYTIX’s proprietary big data curation and veracity optimization platform. As shown in the figure below - client product catalog is extracted, curated and a list of products (new products which need classification or old product refreshes) is sent to dataX™. The dataX™ ecosystem is designed to onboard millions of products each day to make high precision predictions. 

High level workflow overview:

One of key challenge was building the modeling pipeline for text / NLP keeping the precision levels upto 90% and spinning multiple attribute models on demand. dataX™ performs data pre-processing and cleaning of the RAW data, preparing data for modeling and then triggers respective multiple attribute models for prediction. dataX™ can currently handle upto a million products an hour. The final attribute prediction are then merged into an output file and returned to the client processing pipeline for integrating back into the product catalog. Further the models are constantly monitored for accuracy and are retrained as and when the accuracy dips below a given threshold.

Key points for simplifying client deployment:

  • Secure REST API for input / output
  • Pre-processing, Cleaning, Normalization of the data based on attribute or model 
  • Model Pipeline / Controller 
    • involving a sequence of data pre-processing, feature extraction, model fitting, and validation stages
    • classifying text documents involve text segmentation and cleaning, normalization, feature extraction etc.
  • Auto-scaling
    • Building the necessary manifests to spin up and down cloud machines
    • Auto jobs scheduling 
    • Configuring High Availability and Disaster Recovery
  • Fast retrieval - Output in desired format - json, csv etc. 

3 Key API Steps:

The process outlined above along with dataX™ platform - we were able to successfully automate the process of predicting attribute values for upto 1 million product an hour for multiple attribute in parallel.  The ML/NLP models for each attribute are built offline using the CrowdANALYTIX data science community and platform. In the next post, we are going to show how we utilized CrowdANALYTIX data science platform to build multiple attribute models in parallel utilizing the community of data scientists.

For more details: or Contact Us

Leave a comment


There are no comments here to display!