Data Clustering Contest: Round 2

The task in the second round is to create a module that could be used to power a news aggregator. The round will last for two weeks, and everyone is welcome to participate, including those who did not take part in the previous round.

General info about the contest is available on @contest. The Russian version of this document should be used as reference in case of discrepancies.

Source Data

A test data set is available in HTML format:

We will publish additional data sets over the course of the contest.

We will use a different data set for evaluating submissions. The evaluation data set may include articles from domains not present in the test data set.

The Task

  1. Create or improve clustering algorithms. Use our recommendations to improve algorithms from the previous round or build new ones. The algorithms must perform the following tasks:

    • Isolate articles in English and Russian
    • Isolate news articles
    • Group articles into categories
    • Group articles into threads
  2. Analyze and index articles. Your algorithm must analyze, store and index articles, as well as optimize the index for future requests.

  3. Return threads for a particular period, sorted by relative importance. Threads in English must be relevant to a global audience, threads in Russian must be relevant for readers from Russia.

For more details about the tasks, please see the Evaluation section below. For your convenience, we've also included recommendations on clustering and ranking.

Submitting Your Work

You should submit a standalone app with the name tgnews. The app should support two modes:

  • CLI Mode. To evaluate clustering quality, the app will be launched several times with different parameters, as described below. In CLI mode, the app should output the results in JSON format to STDOUT and terminate. The app must not cache or reuse any data across launches. Launches with different parameters may occur in an arbitrary order and this should not affect the results.
  • HTTP Server Mode. The app should work as an HTTP/1.1 server with keep-alive support. For indexing and returning sorted lists of news, the server will receive HTTP queries as described below. The app must store indexed data in the app's working directory and should be prepared for the HTTP server to be restarted at any time between queries.

General Requirements

  • In CLI mode, your app must work locally (no network usage).
  • In HTTP server mode, the app must not create outgoing connections. It may only process incoming connections, and only on the port specified by the launch parameters.
  • The tgnews process must remain active while the HTTP server is working. When tgnews is terminated, the HTTP server and all its child processes should also be terminated. If you're using an external script to launch the web-server, we recommend to use exec.
  • App speed is of utmost importance (this may give an edge to apps written in C++).
  • External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named deb-packages.txt. These dependencies will be installed using sudo apt-get install ... before your app is tested.
  • Applications will be tested under Debian GNU/Linux 10.1 (buster), x86-64 with 8 cores and 16 GB RAM. Before submitting, please make sure that your app works correctly on a clean install.
  • Apps that require more than 60 seconds to process each 1000 files passed in source_dir in CLI Mode will not be evaluated. This restriction applies to each individual script launch, regardless of the parameters.
  • In HTTP Server Mode, apps must be able to process up to 1000 requests per 60 seconds. Initialization time must not exceed 60 seconds per each 1000 files stored in the index.
  • You must submit a ZIP file up to 1500 MB with the following structure:

    submission.zip
      -> tgnews - executable file with an interface as described below
      -> src - folder with the app's source code
      -> deb-packages.txt - a text file with newline separated debian package names of all external dependencies
      -> * - any additional resources your app requires to work (please use relative paths to access them)

Before submitting your archive, we strongly suggest you to use this testing script to confirm that your app is using the correct input and output formats.


Evaluation

We will evaluate submissions in two stages.

Stage 1. Clustering

To check clustering quality, the algorithm will be launched in CLI mode with the following parameters:

1.1. Isolating articles in English and Russian
tgnews languages <source_dir>

where:

  • <source_dir> – path to the folder with HTML-files contaning article texts. The folder may include subfolders which also need to be processed.

The result must be sent to STDOUT in JSON format, using the following template:

[
  {
    "lang_code": "en",
    "articles": [
      "981787246124324.html",
      "239748235923753.html",
      ...
    ]
  },
  {
    "lang_code": "ru",
    "articles": [
      "273612748127432.html",
      ...
    ]
  },
  ...
]

where:

  • lang_code – two-letter ISO 639-1 language code (en and ru, the rest are optional)
  • articles – list of file names (without relative or absolute paths, just the file names) containing texts in lang_code language
1.2. Isolating news articles
tgnews news <source_dir>

where:

  • <source_dir> – path to the folder with HTML-files contaning article texts. The folder may include subfolders which also need to be processed. Article text may be in English and/or Russian.

The result must be sent to STDOUT in JSON format, using the following template:

{
  "articles": [
    "981787246124324.html",
    ...
  ]
}

where:

  • articles – list of file names (without relative or absolute paths, just the file names) containing news articles

See also: Recommendations for Isolating News

1.3. Grouping by category
tgnews categories <source_dir>

where:

  • <source_dir> – path to the folder with HTML-files contaning article texts. The folder may include subfolders which also need to be processed. Article text may be in English and/or Russian.

The result must be sent to STDOUT in JSON format, using the following template:

[
  {
    "category": "society",
    "articles": [
      "981787246124324.html",
      ...
    ]
  },
  {
    "category": "sports",
    "articles": [
      "2348972396239813.html",
      ...
    ]
  },
  ...
]

where:

  • category"society", "economy", "technology", "sports", "entertainment", "science" or "other"
  • articles – list of file names (without relative or absolute paths, just the file names) containing articles corresponding to category

Note that in this stage of the contest you are allowed to tag articles with multiple categories (up to three) but we only recommend doing this in rare borderline cases. The goal of this grouping is not to list all the possible topics the article might belong to, but rather identify one category for which it is most relevant.

See also: Recommendations on Grouping by Category

1.4. Grouping similar news into threads
tgnews threads <source_dir>

where:

  • <source_dir> – path to the folder with HTML-files contaning article texts. The folder may include subfolders which also need to be processed. Article text may be in English and/or Russian.

The result must be sent to STDOUT in JSON format, using the following template:

[
  {
    "title": "Telegram announces Data Clustering Contest",
    "articles": [
      "6354183719539252.html",
      ...
    ]
  },
  {
    "title": "Apple reveals new AirPods Pro",
    "articles": [
      "9436743547232134.html",
      ...
    ]
  },
  ...
]

where:

  • title – thread title, relevant for all articles in the thread
  • articles – list of file names (without relative or absolute paths, just the file names) containing articles in the thread, sorted by their relevance (most relevant at the top)

Please note that the app must not cache or reuse any data when performing clustering tasks. Clustering commands may be run in an arbitrary order and this should not affect the results.

See also: Recommendations on Grouping into Threads

Stage 2. Indexing and Ranking

To evaluate indexing and ranking, the app will be run with the following parameters:

tgnews server <port>

where:

  • <port> – port number.

The app must run an HTTP server on the port port and prepare to receive HTTP queries (e.g. by loading the index from disk). The server must respond with error 503 until it's ready. Sample response:

HTTP/1.1 503 Service Unavailable

HTTP queries that will be sent to the app during evaluation are desribed below. It is guaranteed that the app will receive no more than 100 parallel requests.

2.1. Indexing

HTTP request:

PUT /article.html HTTP/1.1
Content-Type: text/html
Cache-Control: max-age=<seconds>
Content-Length: 9

<content>

where:

  • article.html – name of the HTML file,
  • <seconds> – article TTL in seconds (from 5 minutes to 30 days),
  • <content> - contents of the HTML file.

The app should index the article file from the body of the request. It is guaranteed that all articles will contain the meta-tag <meta property="article:published_time"> with their publishing date. Article TTL for indexing purposes in seconds will be passed in the header Cache-control: max-age. If more than TTL seconds passed between the publishing date of the current article and the latest article in the index, the current article should be removed from the index.

The HTTP server should respond with the code 201 if the article hasn't been indexed before or with the code 204 if the article was updated in the index. Sample response:

HTTP/1.1 201 Created

Please note that the text of the article in the query can be in any language and the article can be either a news or a non-news article. Only news articles in English and Russian languages are relevant for ranking.

2.2. Removing from the index

HTTP request:

DELETE /article.html HTTP/1.1

where:

  • article.html – name of the HTML file.

The app must remove the article file from the index.

The HTTP server should respond with the code 204 if the article has been removed successfully or with the code 404 if the article is not present in the index. Sample response:

HTTP/1.1 204 No Content
2.3. Thread ranking

HTTP query:

GET /threads?period=<period>&lang_code=<lang_code>&category=<category> HTTP/1.1

where:

  • <period> – time range in seconds (from 5 minutes to 30 days),
  • <lang_code> – article language, en or ru,
  • <category> – category (society, economy, technology, sports, entertainment, science, other) or any.

The app should return a list of threads from the index that are related to the category specified in category, relevant for the language lang_code and for the period of period seconds prior to the publication date of the newest article in the index. Threads in the list should be ranked by relative importance.

See also: Recommendations for determining importance and relevance

If category=any, the list of threads should be created for all relevant news during the specified period, regardless of their category.

In all cases, the list should contain news articles only, and only in the specified language. If the resulting list contains more than 1000 threads, the algorithm may return the top 1000 threads (threads, not articles). It is guaranteed that the request will contain all the specified parameters and they will contain valid values.

The HTTP server should respond with the code 200. Data should be returned in JSON format in the body of the response. Sample response:

HTTP/1.1 200 OK
Content-type: application/json
Content-length: 373

{
  "threads": [
    {
      "title": "Telegram announces Data Clustering Contest",
      "category": "technology",
      "articles": [
        "6354183719539252.html",
        ...
      ]
    },
    {
      "title": "Apple reveals new AirPods Pro",
      "category": "technology",
      "articles": [
        "9436743547232134.html",
        ...
      ]
    },
    ...
  ]
}

where:

  • threads – list of threads ranked by importance (important at the top). Each thread contains a title and a list of articles. If list category=any was requested, also contains the field category.
  • title – common thread title, relevant for all articles in the thread.
  • category"society", "economy", "technology", "sports", "entertainment", "science" or "other". Should only be returned if category=any.
  • articles – list of file names (names only, no absolute or relative paths) corresponding to the thread, sorted by relevance (most relevant at the top).

Testing Script

To ensure that your app is using the correct input and output formats, you can use this PHP script (last edited: 25.05.2020 15:15 UTC), which attempts to launch the app in each of the required modes:

php dc-check.php <binary> all <port> <source_dir>

where:

  • <binary> – path to the executable binary file tgnews.
  • <port> – number of any unsued port.
  • <source_dir> – path to a directory with HTML files containing article texts, which will be used as testing targets.

You can also use the following commands to test a particular evalution stage:

php dc-check.php <binary> languages <source_dir>
php dc-check.php <binary> news <source_dir>
php dc-check.php <binary> categories <source_dir>
php dc-check.php <binary> threads <source_dir>
php dc-check.php <binary> server <port> <source_dir>

Note that the script only checks that the app launches correctly and uses the correct output format. It does NOT evaluate the quality of the app. This script was tested on servers running Debian GNU/Linux 10.1 (buster), x86-64. If you're using an external script to launch your app, we recommend to use exec.

Clustering Recommendations

When preparing evalution datasets, our moderators will use criteria similar to what is listed below:

Isolating News

News describe changes and events in a broad sense, which are either ongoing or happened in the recent past (relative to the moment of publication). After reading a news article, one can usually answer the question “what happened?” In most cases, the article's title will be an answer to that question (but there may be exceptions, so one shouldn't rely on the title alone).

Scale doesn't matter. A drunk raccoon spotted in Germany, or a power outage in a small village – both are news.

Non-News Examples:

  • Timeless opinion articles (“Why democracy isn't suitable for the Middle East”).
  • Historical articles (“How the Portuguese first went around Africa”).
  • Encyclopedia-style articles, reference, tips, how-to's (“Kenyan wildlife”, “How to choose a router”).
  • Lists without anything happening (“Six reasons to stay home for Christmas”, “7 wonders of the modern world”, “Best laptops of 2019”). But if something happened, it may be news. E.g., “Here's a list of Golden Globes nominees.”

Grouping by Category

Note that in this stage of the contest you are allowed to tag articles with multiple categories (up to three) but we only recommend doing this in rare borderline cases. The goal of this grouping is not to list all the possible topics the article might belong to, but rather identify one category for which it is most relevant.

Articles that belong in three categories are very rare. We especially don't recommend adding 'Society' as the third category. For example, “Tesla stock plummets after factory fire” could be tagged Economy and Technology but not Society.

  • Society: Politics, elections, legislation, incidents, crime and so on. The biggest category. Includes laws passed and planned, debates in government and other political occurrences, crimes, disasters of any scope, etc.

  • Economy: Markets, finance, business, companies, shares, crises. “Unemployment up 15%” is Economy.

  • Technology: Gadgets, auto, apps, internet services and so on. Games are Entertainment.

  • Sports: Sports of any kind, including E-Sports.

  • Entertainment: Movies, music, games, books, art, restaurants, celebreties, etc. Also animals (unless they were just made in a lab => 'Science', trampled 100 people in a stampede => 'Society' and so on).

  • Science: Health, biology, physics, genetics.

  • Other: Everything else. Among other things, weather forecast (unless they are catastrophic enough to make it into 'Society'), as well as esotericism, horoscopes, non-tech ads. 'Other' should never be used as a second second category.

Below is an example of how various coronavirus-related news could be sorted by category:

  • Science: News about symptoms, tips on staying healthy, search for vaccines, etc.
  • Society: Qurantines and lockdowns in various countries, politicians tested postive, an address from the Queen, etc.
  • Economy: Small business issues, compensations for entrepreneurs, etc.
  • Entertainment: Celebreties tested positive (not politicians), new Banksy graffiti about the pandemic, Jennifer Lopez cancels wedding due to coronavirus, etc.
  • Sports: Tokyo Olympics moved due to the pandemic, etc.
  • Technology: Apple and Google working on an app that warns you about contacting infected people, WHO launched a bot, etc.

Grouping into Threads

Threads are sets of articles about the same event or topic, revealing very similar information. Generally, it should be enough for the reader to read one article from a thread to learn everything (or nearly everything) from that thread. But to learn everything about the topic – they’ll need to read several threads.

For example, various articles about the Coronavirus have the same general topic, but there should be multiple different threads about it:

  1. The World Health Organization declared a global emergency over the new coronavirus.
  2. Nations should avoid overreacting on coronavirus, says China as WHO declares global emergency.
  3. The UK has confirmed four cases of the virus.
  4. Coronavirus: number of confirmed UK cases rises to eight.
  5. Japan may call off Tokyo Olympics over coronavirus fears.
  6. Tokyo Olympic organisers emphasise the Games will go ahead despite coronavirus.
  7. 40,000 workers on virus lockdown at China-backed plant in Indonesia.
  8. …and so on.

Articles that are covering multiple threads (“Top news this weekend”) should not be joined into one thread with any of the threads they cover.

Thread titles

The thread title must reflect the topic of all articles that go into it. A good thread title is short and neutral and mentions the most important people and facts in the thread.

Good thread title: “Elon Musk wants to name his newborn son X Æ A-12”

Poor thread title: "Tesla Billionaire authors the craziest celebrity baby name to date [photo]"

Ranking Recommendations

You are welcome to choose your own approach for determining the relative importance of threads and their relevance for the specified language and time period.

Relevance for Time Period

The algorithm should return not the articles that were merely published in the specified period, but articles that are relevant for this period. For example, a simple algorithm could at least include all threads in which at least one article was published in the specified period. A more substatial algorithm could also assign weights to news articles and threads based on total number or mentions, weight of the respective media publications, etc. – and make them lose relevance over time at weight-dependent speeds.

Your algorithm may NOT return news not relevant for time period.

Relevance for the English Section

Threads in English must be relevant for a wide audience of international readers. Local news that is only notable for citizens of a particular country and its immediate neighbors should be excluded. Global news is usually covered by many media organizations from many different regions (instead of, say, one or two regions).

Relevant:
“China Reports First New COVID-19 Case In Wuhan”, “Hundreds gather in Hong Kong malls as anti-gov't rallies reemerge”, “World’s biggest lockdown: 1.3 billion Indians ordered to stay home”, “Xbox exec says it ‘set some wrong expectations’ for Xbox Series X game reveals”.

Not Relevant:
“Amash's candidacy injects uncertainty for Trump in key swing state of Michigan”, “Lagos governor pledges 'maximum' health funding”, “Yogi Adityanath asks high-level teams to camp in Agra, Meerut, Kanpur”, “'Made in Qatar 2020' expo to open in Kuwait”.

Your algorithm may return news not relevant for the section, but any such results must be penalized in the algorithm’s ranking.

Relevance for the Russian Section

For this round of the contest, threads in Russian must be relevant to readers from Russia – covering news from Russia and neighboring countries, as well as world news (see above).

Relevant:
“Отрицатели коронавируса сожгли вышку сотовой связи”, “Little Big представит Россию на Евровидении”, “Извержение вулкана в Новой Зеландии”, “Власти Уханя рассказали о новой вспышке коронавируса”, “Скандальный разговор Трампа и Зеленского”, “Трамп наложил вето на запрет применения силы против Ирана”, “Умер Эдуард Лимонов”.

Not Relevant:
“Белорусская авиакомпания забрала граждан из Судана и Анголы”, “Украинская киноакадемия назвала победителей кинопремии «Золотая Дзига»”, “Масштабный пожар в Одесской области: горел торговый центр, погибших нет”, “Жээнбеков поручил Нацбанку ускорить переход финансовой системы на новые технологии”, “ С начала года в Минское агентство по госрегистрации обратились более 100 тыс. посетителей”.

Your algorithm may return news not relevant for the section, but any such results must be penalized in the algorithm’s ranking.