Data Clustering Contest – Developer Challenges

Info

Author

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

Comments

Large Dodo Dec 13, 2019 at 07:27

[PART 1]
Hi Telegram Contest

This website says I'm the author of this submission: https://entry1220-dcround1.usercontent.dev/ru/

But the output from https://entry1220-dcround1.usercontent.dev/ru/ ** IS DRAMATICALLY WRONG ** from the output I run locally using my submission against data set at https://data-static.usercontent.dev/DataClusteringDataset.tar.gz

DETAILS AS FOLLOWING on my local machine:

OS:

PRETTY_NAME="Debian GNU/Linux 10 (buster)"

NAME="Debian GNU/Linux"

VERSION_ID="10"

VERSION="10 (buster)"

VERSION_CODENAME=buster

CPU:2

MEMORY: 8G

APT DEPENDENCE: none

DATASET PATH:
~/telegram_data_clustering/contest_data/

Large Dodo Dec 13, 2019 at 07:28

[PART 2]

Command:
./tgnews languages /home/tg/telegram_data_clustering/contest_data/

Time consumed:
real 0m9.571s
user 0m10.620s
sys 0m12.054s

---

Command:
./tgnews news ~/telegram_data_clustering/contest_data/
real 0m13.483s
user 0m13.624s
sys 0m12.634s

---

Command:
./tgnews categories ~/telegram_data_clustering/contest_data/

real 0m15.876s
user 0m16.558s
sys 0m13.379s

---

Command:
./tgnews threads ~/telegram_data_clustering/contest_data/

real 0m27.066s
user 0m33.836s
sys 0m21.709s

---

Command:
./tgnews top ~/telegram_data_clustering/contest_data/

real 0m35.285s
user 0m42.908s
sys 0m23.710s

The comment attachment doesn’t allow json/text/zip files, so I uploaded the output to a public telegram channel: https://t.me/data_cluster_output/3

You have not added any comments yet...

Issues

Fair Leopard Feb 28, 2020 at 15:11

Final score for this submission (out of 100):

Languages: 13.33
News EN: 12.92
News RU: 13.04
Categories EN: 13.1
Categories RU: 13.19
Threads EN: 12.97
Threads RU: 0
Unfortunately, this submission didn't get a high enough score to be evaluated for Top news (task 5).

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.

Fair Leopard Feb 6, 2020 at 16:03

In our preliminary tests, this submission received the following scores (out of 100):

Languages: 89
News EN: 87
News RU: 93
Categories EN: 10
Categories RU: 6
Threads EN: 46
Threads RU: 0

Unfortunately, this submission didn't get a high enough score for the final task (top news) to be evaluated.

This is not the final result, please stay tuned for updates. We apologize for the delay.

Fair Leopard Dec 12, 2019 at 17:02

The following issues have been discovered during preliminary testing:
- for the first time script did not exit after returning JSON into stdout

Large Dodo Dec 13, 2019 at 07:33

this is highly impossible because process.exit is hard coded underneath print command. Thus when stdout got message, that process has exit.

Possible reasons:

1. that is NOT my submission, see my comments above.

2. the user account on that testing Debian machine has incorrect right setting to access os temp folder

Fair Leopard Dec 15, 2019 at 20:04

#comment9926
There was an issue on our side. The problem has now been fixed.

Large Dodo Dec 16, 2019 at 16:07

thank you Fair Leopard

Swift Skunk Dec 17, 2019 at 17:45

Poor categorization accuracy: almost all articles are classified as Other

Large Dodo Dec 18, 2019 at 14:33

That's true. categorization is calculated by a pre-processed datasheet. I was in a rush and only composed < 1/10 of that datasheet.
And the idea is to maintain that datasheet on a regular basis.

Swift Skunk Dec 17, 2019 at 17:48

Poor news detection accuracy: very few non-news articles are detected

Large Dodo Dec 18, 2019 at 14:34

True, I should have thought more carefully about non-news articles

Swift Skunk Dec 17, 2019 at 17:51

Clustering is too conservative. Very few threads with >1 articles

Large Dodo Dec 18, 2019 at 14:35

True. Clustering in my submission is not in good quality comparing to others. More criteria should be put into consideration.

Nobody added any issues yet...

Info

Testing and Issues

Voting

Comments

Issues

Log In