Info

Open Website

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

10

Comments

Features:
1) Almost language independant : to learn categories you need specify 5-10 main words for each category and then
feed about 1000 article for each language. (check "./config/categories.json" and "tgnews learn")
2) Python multi-threading is used, all CPUs must be loaded.
3) Can be easily transferred to scalable micro-service architecture.
3) Rather fast, linear speed for language, news and categories.
4) "pattern" package is used, many additional features can be implemented later.
5) No dependencies (MariaDB is required but I hope it is installed by default)
6) Tested with Debian 10.1 in Virtual BOX


Algorithms:
1) Language : langdetect package
2) News : simple topic parser
3) Categories : KNN based on lemmized plain text
4) Threads : merging is KNN based on plain text, sorting is based on KNN joined topics for each thread
5) Top : KNN based on joined topics
The solution requires initially stated 8 cores with 8 threads those allow running 8 parallel processes to perform with required speed, however according to tests it is looking like it was started on the single thread, so speed tests failed.
You have not added any comments yet...
by rating

Issues

Fair Leopard Feb 28, 2020 at 15:11
Final score for this submission (out of 100):

Languages: 11.81
News EN: 32.39
News RU: 55.1
Categories EN: 0
Categories RU: 0
Threads EN: 0
Threads RU: 0
Unfortunately, this submission didn't get a high enough score to be evaluated for Top news (task 5).

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.
30
Fair Leopard Feb 6, 2020 at 16:03
In our preliminary tests, this submission received the following scores (out of 100):

Languages: 98
News EN: 84
News RU: 91
Categories EN: 0
Categories RU: 0
Threads EN: 0
Threads RU: 0

Unfortunately, this submission didn't get a high enough score for the final task (top news) to be evaluated.

This is not the final result, please stay tuned for updates. We apologize for the delay.
20
Timeout everywhere except language detection
3
Mad Crow Dec 12, 2019 at 22:12
I believe something went terribly wrong by your side. I just checked on my almost 10 years old i7-3770 3.5Mhz with 4 cores 8 threads, the largest RU data set was processed in "threads" mode with debug info for 314 seconds, "top" mode took 231seconds . I believe you didn't try raw dataset, in news/categories/threads/top however it took 2120s for "top" processing 35k files to merge into top topics both RU and EN articles for each category (near the edge). Languages, News and Categories have linear equal speed and "categories" on RU dataset took 78s . Are you sure all of your cores were free for my tgnews? I understand, that python is not the fastest one and an edge is somewhere near to the size of your dataset in threads and top mode (for production needs there's a great possibility for the scalability with databases and load balancers) , but I believe in this case something went from from the beginning. I will highly appreciate your attention to this issue!
You can try it yoursel, just click "open website" button in the top right corner of this page :)
1
Mad Crow Dec 12, 2019 at 22:37
I see this but do not understand how could it happen =( I also see the problem with categories in RU dataset and it really exists but I do not understand how timeout could happen with smaller EN dataset that is smaller. Processing speed is linear and very far from the edge at least for news/categories/languages and enough to process threads and top for your dataset. That is why I wonder how could it happen.
Nobody added any issues yet...