Data Clustering Contest – Developer Challenges

Info

Author

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

Issues

Fair Leopard Feb 28, 2020 at 15:11

Final score for this submission (out of 100):

Languages: 93.46
News EN: 69.76
News RU: 63.66
Categories EN: 61.45
Categories RU: 65.87
Threads EN: 36.81
Threads RU: 39.08
Top news EN: 20.66
Top news RU: 20.76

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.

Fair Leopard Feb 6, 2020 at 16:03

In our preliminary tests, this submission received the following scores (out of 100):

Languages: 100
News EN: 88
News RU: 93
Categories EN: 77
Categories RU: 78
Threads EN: 71
Threads RU: 52
Top EN: 63
Top RU: 64

This is not the final result, please stay tuned for updates. We apologize for the delay.

Fair Quokka Feb 7, 2020 at 20:42

В ходе предварительного тестирования алгоритма были выявлены следующие недостатки в ранжировании:

– Отсутствуют многие главные сюжеты в разделе ‘Main’ и внутри категорий. Нерелевантные сюжеты в топе. Сюжеты отсортированы по количеству статей в них.

– Заголовки части сюжетов слишком размытые (информация не подаётся в краткой нейтральной форме).

Fair Leopard Dec 12, 2019 at 16:46

We had to fix the following issues before running the algorithm and will apply relevant penalties during the final scoring:
- invalid news output format, fixed extra comma;
- invalid threads and top output format, fixed unescaped qoute

Bossy Gnu Dec 13, 2019 at 13:51

> invalid news output format, fixed extra comma;
Fixed. Can be reproduced with only one language in a dataset (missed in my tests).
> invalid threads and top output format, fixed unescaped qoute
Fixed. OMG, shame on me! :)

Ace Cock Dec 13, 2019 at 10:57

Top threads of "Main" (both ru and en) consist of very loosely related articles.

Bossy Gnu Dec 13, 2019 at 12:45

Thank you for the comment. Yes, due to the extremely limited time I did not manage to configure the clustering algorithm perfectly. I used unmodified Chinese Whispers algotithm. And there is the well known problem - an object similar to an object which is similar to another object. This issue is fixed now. Any way IMO there are no any significant errors/problems in my implementation of the contest tasks.

Big Rat Dec 17, 2019 at 14:46

Quite impressive news categorization, and news/no-news filtering! Not so great thread grouping though, but to me approach to improve that is rather clear.
Processing speed is really great!

Bossy Gnu Dec 17, 2019 at 23:16

Thanks!
"threads" & "top" are tuned and look nice now. I am going to upload new output jsons to show how it works now.
I used quantized embeding models to keep my submission below 200MB. On the other hand it takes about 8-10 seconds for each language to restore the model while an uncopressed model can be loaded in a few tens of microseconds.
And sure, you can load language models only once and use them in a further tasks. But I have to load and restore my model for each contest step (excluding the first step - languages detection).

Nobody added any issues yet...

Info

Testing and Issues

Voting

Issues

Log In