picture credit: https://images.app.goo.gl/boAPRc6Cm8XGwt1J7

Utilizing the power of NLP on Non-English Online Craftsman Service Marketplace

Guiding customer to choose the correct ServiceId category

Sourish Dey

Published in

CodeX

4 min readMar 17, 2021

In developed economies like Germany getting an economic labor-intensive household service(Plumbing, Electrician, Painting, Repairing, etc.) on an immediate and urgent basis used to be difficult, especially for English-speaking expats like us. However gone are the days with recent explosive growth in the e-commerce industry and this niche online Service Marketplace is no exception with the emergence of a Marketplace like co-tasker, Handwerker finden, etc. These apps connect the customers with professional helpers and options to choose the best and most suitable offer. Few of them have even complete English support and interface. However, Most of them are still in German.

Business Case

Being new and not yet fully exposed to AI one of the pain points of these platforms is finding or selecting the correct service ID based on task description or task title. For example without the option/intelligence of a dropdown option to choose a suitable service ID category(eg. plumbing, balcony repair, chimney work, etc.)based on the work description, Often consumers submit the service request in the wrong serviceId category, mostly they get tempted to choose easy options such as “Sonstiges” (Others).

For example, an actual painting job gets submitted to the “Others” serviceId category. There are the following issues with this wrong service ID selection:

• jobs are in the wrong category

• Tradesmen cannot find the proper job because of the above scenario.

This has a direct impact on revenue because the tradesmen cannot find a relevant job while searching. The goal is to help consumers to pick the right category by providing a good suggestion of a correct serviceId category based on the service request details. So here our task is to build an efficient classifier that can classify the wrong correct serviceId based on the description and other job details.

The dataset and the detailed solution are kept in this GitHub repository. The dataset has mainly the following key fields and the language is in Deutsche.

Raw features

a) title: job-title(string)

b) Description: Detailed job description(string)

c) tradeClassificationType: (Categorical)

d) target_date: Target date of job completion w.r.t. the job creation date

Target -serviceId

Solution Methodology

While the detailed methodology is explained in the Jupyter Notebook in this GitHub repository, the solution mainly has two following two components:

i) NLP -Preprocessing and embedding creation using the text fields( title and Description). Here I tried two popular text embedding methodologies -a) Word2Vec, and b) FastText each with 100 dimensions, and then chose the better one.

ii) Multi-classifier -Final hyper-parameter tuned Classifier to predict the ServiceId using the embeddings and other feature engineering pipelines.

The total dataset(646 data points)was split into train(85%) and hold-out validation(15%).

Evaluation Criteria

As a multi-class problem, the main evaluation metric was the accuracy of validation data(96 data points). Also, we looked at other classification report metrics -class-wise precision, recall, f1-score, etc.

L2 Normalization — Embedding Matrix

It’s a good practice to normalize the embedding matrix. It will serve the following two purposes.
- a) L2 is invariant under rotation and hence better for gradient-based learning methods.
- b) It will reduce the correlation among vectors

Final classifier with best hyperparameters

So with further l2 normalization and hyperparameter tuning, there is the classifier accuracy is improved from 69.8% to 77.1%(10.46% improvement). Also as an imbalanced classifier, other metrics like f1-score, recall, and precision are improved both classwise and overall.

Further improvement strategy

PCA(Principal Component Analysis) based decorrelation of the embedding vector space
Try another NLP embedding feature extraction eg. TensorFlow. Keras embedding layer, GloVe(Global Vectors for Word Representation), etc., and build a final classifier using those embedding vectors
Try with different classifier algorithms- RandomForrest, LightGBM, CatGBM, etc.
As an NLP-based classifier with a massive dataset, the Keras-LSTM(Long short-term memory) (LSTM) based artificial recurrent neural network (RNN) architecture can be explored.