Info

Open Website

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

28

Comments

Overview

Made using Metric, Boost, Lapack and C++17.

The general idea that lies in the core of my way of solving task is:

1. Take pre-trained Word2Vec vocabularies for english and russian

2. Cluster it using Metric framework to a big number of classes

3. For each text calculate embeddings:
-> create zeros vector for text embeddings with size equals to number of clusters in the Word2Vec vocab
-> for each word in text take it cluster's index
-> increment position at found index in the vector for text embeddings
-> in the end we have single vector for each text

4. When we have embeddings vectors for text we can cluster it, compare with categories (converted to embeddings as text too), etc

Futher improvements:
- Tune hyperparams
- Use multithreading and batch sampling
- Parse html to omit tags etc.

Details: https://github.com/Stepka/telegram_clustering_contest
Broken "З" letter for russian at the headers (and quotes)
You have not added any comments yet...
by rating

Issues

Fair Leopard Feb 28, 2020 at 15:11
Final score for this submission (out of 100):

Languages: 13.27
News EN: 12.82
News RU: 21.67
Categories EN: 47.53
Categories RU: 37.77
Threads EN: 11.36
Threads RU: 11.47
Top news EN: 34.04
Top news RU: 9.33

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.
30
Fair Leopard Feb 6, 2020 at 16:03
In our preliminary tests, this submission received the following scores (out of 100):

Languages: 90
News EN: 86
News RU: 82
Categories EN: 68
Categories RU: 58
Threads EN: 43
Threads RU: 29
Top EN: 66
Top RU: 34

This is not the final result, please stay tuned for updates. We apologize for the delay.
20
Fair Mammoth Feb 7, 2020 at 16:25
В ходе предварительного тестирования алгоритма были выявлены следующие недостатки в ранжировании:

– Отсутствуют многие главные сюжеты в разделе ‘Main’ и внутри категорий. В разделе ‘Main’ представлены нерелевантные сюжеты.
Большое количество чрезмерно широких сюжетов, а также сюжетов, состоящих из одной статьи.

– Проблемы с форматирование заголовков: знаков препинания, заглавной буквы З в русскоязычном топе.

– Нарушена сортировка сюжетов в категориях: важные сюжеты находятся ниже менее релевантных. 

– Нарушена сортировка статей во многих крупных сюжетах: релевантные статьи смешаны с нерелевантными.
20
I don't like classification in your top categories, sorry
In Russian, many news are wrongly filtered out as non-news
Nobody added any issues yet...