Data Clustering Contest 2021. Round 2

The task in the second round is to improve the C/C++ library you created in the previous round that can determine the topic of a Telegram channel.

The deadline for the second round is May 2, 2021 at 23:50 Dubai time. Everyone who participated in the first round is welcome to take part.

General info about this contest is available on @contest. Feel free to check out the Russian version of this document.

The Task

Improve topic detection. In this round, you will need to improve topic detection for channels in English and Russian and determine the topic of channels in three additional languages. Required languages:

  • English
  • Russian
  • Arabic
  • Persian
  • Uzbek

The algorithm must determine the relative weight for each of the topics identified in the channel.

New categories and types of content

The data sets in this round are likely to contain more samples from banned channels. The list of possible topics was expanded with Drug Sale, Forgery, Hacked Accounts, Personal Data, Pirated Content, Prostitution, Spam & Fake Followers, Weapon Sale.

List of Categories

Below is the full list of possible topics for this round. The list also includes some of the sorting recommendations used by human moderators to sort the evaluation datasets.

  • Art & Design
  • Bets & Gambling – includes sports bets
  • Books
  • Business & Entrepreneurship
  • Cars & Other Vehicles
  • Celebrities & Lifestyle
  • Cryptocurrencies
  • Culture & Events
  • Curious Facts
  • Directories of Channels & Bots
  • Drug Sale
  • Economy & Finance
  • Education
  • Erotic Content
  • Fashion & Beauty
  • Fitness
  • Forgery – includes fake documents, fake money, etc.
  • Food & Cooking
  • Foreign Language Learning
  • Hacked Accounts & Software – includes carding, passwords for subscription services, etc.
  • Health & Medicine
  • History
  • Hobbies & Activities
  • Home & Architecture
  • Humor & Memes
  • Investments
  • Job Listings
  • Kids & Parenting
  • Marketing & PR
  • Motivation & Self-development - includes inspirational quotes and poetry
  • Movies
  • Music
  • Offers & Promotions – includes products or services for sale, unless they fall under the newly added categories
  • Personal Data – includes doxxing, databases
  • Pets
  • Pirated Content – films, music, books, but not software
  • Politics & Incidents
  • Prostitution
  • Psychology & Relationships
  • Real Estate
  • Recreation & Entertainment
  • Religion & Spirituality
  • Science
  • Spam & Fake Followers – includes spam tools and services, boosting followers, likes, etc.
  • Sports – includes e-sports
  • Technology & Internet
  • Travel & Tourism
  • Video Games
  • Weapon Sale
  • Other

Sorting Recommendations

Below are some of the recommendations used by human moderators to sort evaluation datasets.

1. The most specific category should carry the most weight.

*For example, if a channel is about investing in cryptocurrencies, “Cryptocurrencies” should carry a higher weight than “Investment”. If a channel is about the evolution of cars in the 20th century, its primary category is “Cars & Vehicles”, not “History”.

2. Channels of a person or dedicated to a person may be categorized by what kind of a person it is.

For example, a politician => “Politics & Incidents”, a movie star => “Celebrities”, an athlete => “Sports”. Naturally, if it's a channel with investment tips from a football star, Recommendation 1 applies and it should go to “Investments”.

Source Data

Contestants are welcome to use this test data set:

The archive includes a text file with channel data in all languages supported in this round, along with separate files with channel data for each specific language. The language-specific files are included for your convenience only – evaluation data will not be pre-sorted by language and can potentially contain channels in any language.

Machine Translations

For each file, we've also included a version with machine translations of all texts into English. Note that your algorithm must work with texts in the original language. The English version may only be used for ease of understanding the original text.

New data

Additional data has been added in this round to help improve the quality of topic detection, such as the number of subscribers and the total number of text posts and posts containing media in a channel. We're also including metadata for media and links.

Data Format

The data set is a text file, where each line contains information about a channel in JSON format. All fields are optional.

{
  title: "Channel title",
  description: "Channel description",
  subscribers: 123400,
  counters: [
    posts: 100,
    photos: 20,
    videos: 10,
    audios: 3,
    files: 0
  ],
  recent_posts: [
    {
      type: "text",
      text: "text of message", 
      link_preview: {
        url: "https://example.com/",
        title: "Title of link preview",
        description: "Description of link preview"
      }
    },
      {
      type: "photo",
      text: "Photo caption"
    },
      {
      type: "video",
      text: "Video caption",
      duration: 65,
      file_name: "video.mp4",
      file_size: 23982347
    },
      {
      type: "audio",
      text: "Audio caption",
      performer: "Performer",
      title: "Title",
      duration: 183,
      file_size: 1236123
    },
      {
      type: "file",
      text: "File caption",
      file_name: "test.pdf",
      file_size: 1236
    },
    ...
  ]
}

where:

  • title – title of the channel,
  • description – description of the channel,
  • subscribers – rounded number of channel subscribers,
  • counters – channel counters, where:

    • posts – total number of posts in the channel,
    • photos – number of photos in the channel,
    • videos – number of videos in the channel,
    • audios – number of audio files in the channel,
    • files – number of files in the channel,
  • recent_posts – several posts from the channel, where:

    • type – type of the post: text, photo, video, audio or file,
    • text – text of the post or caption of the media,
    • link_preview – data from link preview (if exists),
    • duration – duration of the video or audio track in seconds,
    • file_name – name of the file,
    • file_size – size of the file in bytes,
    • performer – performer of the audio track,
    • title – title of the audio track.

We will publish additional data sets over the course of the contest.

A different data set will be used for evaluating submissions.

Development and Testing

You can download a sample library here: libtgcat-r2.tar.gz. tgcat.h describes the interface you are required to implement in this contest. Use the following commands to build the library:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

You can test the resulting library file libtgcat.so on the test data using the test script libtgcat-tester-r2.tar.gz. To do this, copy libtgcat.so into the directory containing the test script, then build with cmake in the standard way:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

To test the library output, launch the resulting binary file tgcat-tester with the following parameters:

tgcat-tester <mode> <input_file> <output_file>

where:

  • <mode>category,
  • <input_file> – path to file containing input data,
  • <output_file> – path to file containing output data.

Output data is presented as a text file where each line represents processed channel data in JSON format:

{
  "lang_code": "en",
  "category": {
    "Art & Design": 0.9,
    "Other": 0.1
  }
}

where:

  • lang_code – ISO 639-1 language code or “other”; your library should return the language it detected, even if it's not one of the target languages in this round.
  • category – object with each key containing one of the topics (see above) and the values containing the relative weight of the topic. Categories should be returned for each item, even if the language detected by your library is not one of the target languages in this round.

Please note that the input data can contain channels in different languages, the library must determine the language of the channel by itself before determining the relative weight for each of the topics.

Note: Your output data should include the detected language, even if it's not one of the 5 target languages. In this round, we will not evaluate the quality of language detection.

Script launch example:

$ ./tgcat-tester category dc0415-input.txt dc0415-category-output.txt
Processed 50297 queries in 0.033010 seconds

General Requirements

  • Your library must work locally (no network usage including localhost).
  • Relative speed is of critical importance.
  • External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named deb-packages.txt. These dependencies will be installed using sudo apt-get install ... before your app is tested.
  • The library will be tested on servers running Debian GNU/Linux 10 (buster), x86-64 with 8 cores and 16 GB RAM. Before submitting, please make sure that your app works correctly on a clean system.
  • Make sure the library was built on Debian GNU/Linux 10 (buster).
  • You must submit a ZIP-file (the maximum limit for a file sent to the bot is 2 GB) with the following structure:

    submission.zip
      -> src - folder with the app's source code (obligatory)
      -> libtgcat.so - library (obligatory)
      -> resources - folder with additional files which your library requires to work (please use relative paths to access them) (optional)
      -> deb-packages.txt - a text file with line-break separated debian package names of all external dependencies (optional)

Evaluation

When evaluating submissions we will prioritize the speed and accuracy of the algorithms. Accuracy will have the highest priority.

Note that we will not evaluate libraries that take more than 60 seconds to process any batch of 1000 channels.