Info

Open Website

Testing and Issues

You can test this app and submit issues during the testing period of the Data Clustering Contest, Stage 2 contest.

Entries with serious issues will not be able to win the contest, but even minor issues might be important for overall results.

Voting

23
by rating

Issues

Fair Quokka Jul 31, 2020 at 22:14
В ходе тестирования алгоритма были выявлены следующие недостатки в ранжировании:

1. RU
– Отсутствуют некоторые главные сюжеты в разделе ‘Main’ и внутри категорий.
– Заголовки многих сюжетов не отражают их содержание.
– Нарушена сортировка статей во многих сюжетах: нерелевантные статьи отображаются выше релевантных.
– Некоторые главные сюжеты нерелевантны для широкой аудитории из России.



2. EN
– Отсутствуют многие главные сюжеты в разделе ‘Main’ и внутри категорий.
– Заголовки многих сюжетов не отражают их содержание.
– Нарушена сортировка статей в некоторых сюжетах: нерелевантные статьи отображаются выше релевантных.
– Многие главные сюжеты нерелевантны широкой англоязычной аудитории.
20
Fair Leopard Jul 7, 2020 at 16:10
In our preliminary tests, this submission received the following scores (out of 100):

Languages: 100
News EN: 75
News RU: 89
Categories EN: 71
Categories RU: 63
Threads EN: 77
Threads RU: 62
10
Gifted Lemur Jul 7, 2020 at 16:24
Looks like there is the same issue with locale settings for May 27 results as fixed in #issue11153 . Could you re-run it with correct locale?

Results for other three days looks good.
Fair Leopard Jun 24, 2020 at 00:12
We re-ran your algorithm with the correct locale settings.
4
Gifted Lemur Jun 24, 2020 at 06:40
Thanks! Looks much better now!
Fair Leopard Jul 7, 2020 at 17:48
1
Gifted Lemur Jul 7, 2020 at 20:35
Thanks!

Another question. Some of the articles are shown as just "<some-id>.html" instead of title + summary + link to full article. Does it happen because those articles should be already removed by expiration time, but my program still return them?

Also based on my app server log, some of the articles contain strange publication time (like "-001-11-30T00:00:00+00:00"), which is not correct ISO8601 time. Is it expected?
Looks like something went horribly wrong in the top section (server mode):
- all articles with any relation to Covid-19 grouped together into a huge cluster of 1751 articles.
- English section contains articles in all languages mixed together, and Russian section has no articles at all.
- many thread titles consist of question marks (probably due to non-latin characters).
- per-hour filters do not affect results.
Gifted Lemur Jun 23, 2020 at 23:16
Agree that something really strange happened in the server mode. I think all the problems are caused by some charset/encoding problem. Looks like all non-latin characters were replaced with question marks in POST query parsing because of incorrect charset.

Could @admins help to debug this issue and rerun the submission?

What locale is used on the testing boxes? Is charset specified in POST queries?
Fair Leopard Jul 8, 2020 at 20:04
#issue11314 Can you give an example of these articles with the strange publication time?
Gifted Lemur Jul 8, 2020 at 22:55
Unfortunately, I didn't print out bad articles ids when receiving them.

But I did on server restart. Server was restarted only 2020-07-04, so examples I found were added several days ago: "4125648317471239591.html" and "4125648317464283268.html".

Also is it possible to download all raw articles used for "Today" section? I still don't understand why some of the articles are shown incorrectly (e.g. "4665023522514959947.html" or "1417404810608504063.html" - see attached screenshot).
Nobody added any issues yet...