Data Clustering Contest 2021

The task in this contest is to create a C/C++ library that can determine the language and topic of a Telegram channel. The deadline for the first round is February, 14 at 23:50 Dubai time. Everyone is welcome to participate.

General info about this contest is available on @contest. Feel free to check out the Russian version of this document.

The Task

  1. Determine channel language. The algorithm must use the channel's name, description and the text of several posts to determine its language and return the language's 2-letter ISO code (or “other” if the language doesn’t have a two-letter code).

  2. Determine channel topic. For channels in English and Russian, the algorithm must determine the relative weight for each of the topics identified in the channel. List of possible topics:

  • Art & Design
  • Bets & Gambling
  • Books
  • Business & Entrepreneurship
  • Cars & Other Vehicles
  • Celebrities & Lifestyle
  • Cryptocurrencies
  • Culture & Events
  • Curious Facts
  • Directories of Channels & Bots
  • Economy & Finance
  • Education
  • Erotic Content
  • Fashion & Beauty
  • Fitness
  • Food & Cooking
  • Foreign Languages
  • Health & Medicine
  • History
  • Hobbies & Activities
  • Home & Architecture
  • Humor & Memes
  • Investments
  • Job Listings
  • Kids & Parenting
  • Marketing & PR
  • Motivation & Self-Development
  • Movies
  • Music
  • Offers & Promotions
  • Pets
  • Politics & Incidents
  • Psychology & Relationships
  • Real Estate
  • Recreation & Entertainment
  • Religion & Spirituality
  • Science
  • Sports
  • Technology & Internet
  • Travel & Tourism
  • Video Games
  • Other

Source Data

Contestants are welcome to use this test data set:

The data set is a text file, where each line contains information about a channel in JSON format:

{
  title:        "Channel title",
  description:  "Channel description",
  recent_posts: [
    "text #1 of message or caption of media or content of poll etc.",
    "text #2 of message or caption of media or content of poll etc.",
    ...
  ]
}

We will publish additional data sets over the course of the contest.

A different data set will be used for evaluating submissions.

Development and Testing

You can download a sample library here: libtgcat.tar.gz. tgcat.h describes the interface you are required to implement in this contest. Use the following commands to build the library:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

You can test the resulting library file libtgcat.so on the test data using the test script libtgcat-tester.tar.gz. To do this, copy libtgcat.so into the directory containing the test script, then build with cmake in the standard way:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

To test the library output, launch the resulting binary file tgcat-tester with the following parameters:

tgcat-tester <mode> <input_file> <output_file>

where:

  • <mode>language or category,
  • <input_file> – path to file containing input data,
  • <output_file> – path to file containing output data.

Output data is presented as a text file where each line represents processed channel data in JSON format:

mode=language

{
  "lang_code": "en"
}

mode=category

{
  "lang_code": "en",
  "category": {
    "Art & Design": 0.9,
    "Other": 0.1
  }
}

where:

  • lang_code – ISO 639-1 language code or “other”,
  • category – object with each key containing one of the topics (see above) and the values containing the relative weight of the topic.

Script launch example:

$ ./tgcat-tester language dc0130-input.txt dc0130-language-output.txt
Processed 50297 queries in 0.026331 seconds
$ ./tgcat-tester category dc0130-input.txt dc0130-category-output.txt
Processed 50297 queries in 0.033010 seconds

General Requirements

  • Your library must work locally (no network usage).
  • Relative speed is of critical importance.
  • External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named deb-packages.txt. These dependencies will be installed using sudo apt-get install ... before your app is tested.
  • The library will be tested on servers running Debian GNU/Linux 10 (buster), x86-64 with 8 cores and 16 GB RAM. Before submitting, please make sure that your app works correctly on a clean system.
  • You must submit a ZIP-file (the maximum limit for a file sent to the bot is 2 GB) with the following structure:

    submission.zip
      -> src - folder with the app's source code (obligatory)
      -> libtgcat.so - library (obligatory)
      -> resources - folder with additional files which your library requires to work (please use relative paths to access them) (optional)
      -> deb-packages.txt - a text file with line-break separated debian package names of all external dependencies (optional)

Evaluation

When evaluating submissions we will prioritize the speed and accuracy of the algorithms. Accuracy will have the highest priority.

Note that we will not evaluate libraries that take more than 60 seconds to process any batch of 1000 channels.