Data Clustering Contest – Developer Challenges

Fair Leopard Feb 28, 2020 at 15:11

Final score for this submission (out of 100):

Languages: 70.08
News EN: 54.55
News RU: 76.28
Categories EN: 57.52
Categories RU: 60.87
Threads EN: 51.79
Threads RU: 44.99
Top news EN: 32.46
Top news RU: 51.14

These data reflect the relative accuracy, precision and speed of the algorithm as compared to the other submissions.

30

Fair Leopard Feb 6, 2020 at 16:03

In our preliminary tests, this submission received the following scores (out of 100):

Languages: 99
News EN: 87
News RU: 92
Categories EN: 78
Categories RU: 76
Threads EN: 75
Threads RU: 56
Top EN: 65
Top RU: 76

This is not the final result, please stay tuned for updates. We apologize for the delay.

20

Fair Quokka Feb 7, 2020 at 16:03

В ходе предварительного тестирования алгоритма были выявлены следующие недостатки в ранжировании:

– Не все важные сюжеты представлены в разделе 'Main'. Множество сюжетов, состоящих из одной статьи.

– Заголовки части сюжетов слишком размытые (информация не подаётся в краткой нейтральной форме). Нарушена сортировка статей в некоторых сюжетах: релевантные статьи смешаны с нерелевантными.

20

Hairy Lemur Dec 12, 2019 at 19:57

a lot of one 1 article threads

3

Hip Hyena Dec 12, 2019 at 20:42

I don't really see a problem with it.

There was no restriction on a minimum thread size, so there's no reason to throw away one-article threads.

Alex К Dec 21, 2019 at 07:44

I liked the results! Was it Python? But indeed, is it possible to look at the implementation? :) I quite excited about the code. @aleexk

3

Hip Hyena Dec 21, 2019 at 10:30

Thank you!

My solution is written in Go. I'm planning to publish its source code on Github after the end of this contest.

Suave Duck Dec 13, 2019 at 08:30

A lot of one 1 article in top. A lot of one 1 article threads. Threads and top not implemented correctly: "Group similar news into threads. Your algorithm must identify news articles about the same event and group them together into threads"

1

Hip Hyena Dec 13, 2019 at 13:48

First of all, that's a duplicate: Hairy Lemur already posted a comment about one-article threads. Maybe we need to group issues here too, so I don't have to repeat myself.

Secondly, I still don't see how this implementation is incorrect. If the dataset contains a single article about some event, it should be turned into a thread with only that article. It would still constitute a group of news articles about the same event (just with a size of 1).

Imagine you're a user visiting a news aggregator. You probably want to see all topics, not just the ones represented by two or more articles.

Little Swan Dec 23, 2019 at 12:11

I’m probably less interested in money. I wanted to share a story with you about Go :)

1

Hip Hyena Dec 23, 2019 at 12:31

I'm not talking about money either. All I'm saying is that you didn't prove that your code is faster.

Any language can be used write fast or slow programs. And any program can perform very differently on different hardware. That's why any testing should be performed under the exact same conditions.

Ace Cock Dec 13, 2019 at 08:29

I wonder why the biggest thread about football didn't make it to the top of all categories. I kinda agree that it's less important than world politics, I'm just curious if your importance metric is based on actual content rather than cluster weight.

Hip Hyena Dec 13, 2019 at 14:03

You're right, my ranking is based on multiple different factors, and the size of a cluster is just one of them. I was trying to match my own feeling of what's important and what's not.

Little Swan Dec 23, 2019 at 07:52

Something with languages went wrong.
On Ruby, it runs under the interpreter ~1.5-2 times faster.
More precise clustering thanks to ffi plus C++ neural network.

Hip Hyena Dec 23, 2019 at 10:23

I don't think this claim "my entry is better" should be considered an issue of my work.

Even so, your submission runs language detection in ~2.5s per thousand files (mine in ~0.2s, so it's 12x faster, not 2x slower). Cannot say anything about clustering because your submitted version does not produce top/threads at all.

Little Swan Dec 23, 2019 at 10:30

Try version 1.1.5-1.1.6 from my application.
It can still be accelerated, but the fps is better there.
And I draw attention to the interetator, and in your case the compiled language :)

Have fun!

Hip Hyena Dec 23, 2019 at 10:41

To correctly compare timings you'd need to run code on machines with the exact same specifications (same CPU and disk drive).

Saying that your code works faster (on your computer) than mine (on the test server) does not mean anything.

Little Swan Dec 23, 2019 at 10:50

Why the same cars? This is an excuse. The fact that your Go algorithm showed no advantage over the algorithm in the interpreted Ruby language suggests that the Go algorithm still needs to be parallelized. :)

Hip Hyena Dec 23, 2019 at 12:08

The advantage is clearly visible to everyone: 2.5s vs 0.2s, this is the officially confirmed timings.

If judges will decide to re-run your solution and it will be confirmed to work faster, I will stand corrected.

Little Swan Dec 24, 2019 at 06:34

Iterative: 6.880308484000125s first test dataset on language algorithm (Ruby) :)
But you and I can still accelerate this. There are still few algorithms.

On test competitive stands, Ruby algorithm can work even faster.

Have fun!

Little Swan Dec 24, 2019 at 06:46

Iterative: 11.177299876000689s second test dataset on language algorithm (Ruby) :)

On the bigdata it will be very noticeable.

Hip Hyena Dec 24, 2019 at 06:57

I really think that those comments would be more relevant under your own entry. They have nothing to do with mine.

Little Swan Dec 24, 2019 at 07:58

This is just a flaw in the Go algorithm. The problem with the correct delivery of news also suffers, but this is so in many works :)))

The current test bench for Ruby i5 solutions (6 cores), 16GB of RAM

Little Swan Dec 25, 2019 at 11:53

https://data-static.usercontent.dev/sampledata/20191129/13/589988349773351763.html

This is similar to the news, but in the first issue ru is absent on Go.

D

Deleted Account Jan 22, 2020 at 23:27

đ4a0f719

Info

Testing and Issues

Voting

Issues

Info

Testing and Issues

Voting

Issues

Log In