Data Clustering Contest – Developer Challenges

Info

Author

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

213

Comments

Daring Frog Dec 16, 2019 at 17:17

At https://contest.com/docs/data_clustering it says that "We will not evaluate apps that require more than 60 seconds for each batch of 1000 files passed in source_dir.".

From this statement (due to the “batch of 1000” substring) I inferred that processing will be done in batches of 1K and hard-wired a “break” at 1K, hence the submitted binary will not process anything beyond this file count.

Daring Frog Dec 16, 2019 at 17:18

I removed the break and it processes “raw” directory (35075 files) which you provided with the following timings (on archaic MacBook Pro 2012):

top: 241 sec
threads: 235 sec
categories: 139 sec
news: 125 sec
languages: 130 sec

Daring Frog Dec 16, 2019 at 17:18

JSON responses for each request are here: https://www.dropbox.com/sh/mjvb79l9hzxyf8x/AAB4KxNd_Qv9AF_AgPlRou4Aa?dl=0

I can send you the binary without this 1K limit per directory if you want to rerun it on your end. Or you can drop the break (lines 89 and 92) and recompile (it’s straightforward and should take 5-10 minutes max).

You have not added any comments yet...

Issues

Fair Leopard Feb 28, 2020 at 15:11

Final score for this submission (out of 100):

Languages: 12.56
News EN: 38.34
News RU: 46.86
Categories EN: 12.43
Categories RU: 12.31
Threads EN: 11.58
Threads RU: 21.41
Top news EN: 18.91
Top news RU: 32.84

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.

Fair Leopard Feb 6, 2020 at 16:03

In our preliminary tests, this submission received the following scores (out of 100):

Languages: 98
News EN: 67
News RU: 75
Categories EN: 35
Categories RU: 34
Threads EN: 54
Threads RU: 38
Top EN: 57
Top RU: 64

This is not the final result, please stay tuned for updates. We apologize for the delay.

Fair Mammoth Feb 7, 2020 at 20:40

В ходе предварительного тестирования алгоритма были выявлены следующие недостатки в ранжировании:

– Отсутствуют некоторые главные сюжеты в разделе ‘Main’ и внутри категорий. Сюжеты отсортированы по количеству статей внутри.

– Заголовки части сюжетов слишком размытые (информация не подаётся в краткой нейтральной форме). Отсутствуют знаки препинания во многих названиях сюжетов: например, пропущены запятые и дефисы.

– Нарушена сортировка статей в сюжетах: релевантные статьи смешаны с нерелевантными.

Fair Leopard Dec 17, 2019 at 12:58

This entry had to be reuploaded after the deadline due to an issue and will not receive any prizes in the current stage.

It's author, however, may get the chance to participate in the next round of the Data Clustering Competition.

Large Crab Dec 12, 2019 at 21:43

Poor news detection

Nobody added any issues yet...

Info

Testing and Issues

Voting

Comments

Issues

Log In