ML Competition 2023, Round 2

The task in this competition is to create a library that detects a programming or markup language of a code snippet. The deadline is November, 20 at 23:59 Dubai time. Only contestants from the first stage of the competition are allowed to participate.

General info about this competition is available on @contest. Further submission instructions will be announced there closer to the deadline.

The Task

Implement a shared library that detects a programming or markup language of a code snippet. You can use any publicly available data and the provided dataset to train your solution.

For the second round, the list of programming and markup languages to detect was reduced to the following:

  TGLANG_LANGUAGE_C
  TGLANG_LANGUAGE_CPLUSPLUS
  TGLANG_LANGUAGE_CSHARP
  TGLANG_LANGUAGE_CSS
  TGLANG_LANGUAGE_DART
  TGLANG_LANGUAGE_DOCKER
  TGLANG_LANGUAGE_FUNC
  TGLANG_LANGUAGE_GO
  TGLANG_LANGUAGE_HTML
  TGLANG_LANGUAGE_JAVA
  TGLANG_LANGUAGE_JAVASCRIPT
  TGLANG_LANGUAGE_JSON
  TGLANG_LANGUAGE_KOTLIN
  TGLANG_LANGUAGE_LUA
  TGLANG_LANGUAGE_NGINX
  TGLANG_LANGUAGE_OBJECTIVE_C
  TGLANG_LANGUAGE_PHP
  TGLANG_LANGUAGE_POWERSHELL
  TGLANG_LANGUAGE_PYTHON
  TGLANG_LANGUAGE_RUBY
  TGLANG_LANGUAGE_RUST
  TGLANG_LANGUAGE_SHELL
  TGLANG_LANGUAGE_SOLIDITY
  TGLANG_LANGUAGE_SQL
  TGLANG_LANGUAGE_SWIFT
  TGLANG_LANGUAGE_TL
  TGLANG_LANGUAGE_TYPESCRIPT
  TGLANG_LANGUAGE_XML

Development and Testing

You can download a sample library here: libtglang-r2.tar.gz. tglang.h describes the interface you are required to implement in this contest. Use the following commands to build the library:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

You can test the resulting library file libtglang.so on the test data using the test script libtglang-tester-r2.tar.gz. To do this, copy libtglang.so into the directory containing the test script, then build with cmake in the standard way:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

To test the library output, launch the resulting binary file tglang-tester with the following parameters:

tglang-tester <input_file>

where:

  • <input_file> – path to file containing input data,

The tester will output the numeric value of the detected language as specified by TglangLanguage enum.

Script launch example:

$ ./tglang-tester code.txt
9

General Requirements

  • The shared library can be built in any programming language of your choice, as long as it provides the interface described in tglang.h and is compatible with tglang-tester.
  • Your library must work locally (no network usage including localhost).
  • Your library must not use fork, or launch other executables, but can use multiple threads.
  • Speed is of critical importance. The response time of your solution should not exceed 50 milliseconds for a text of 4096 characters. Library loading time is included in the measurement.
  • The library will be tested on servers running Debian GNU/Linux 10 (buster), x86-64 with 8 cores and 16 GB RAM. Before submitting, please make sure that your app works correctly on a clean system and is built on Debian GNU/Linux 10 (buster) for Debian GNU/Linux 10 (buster).
  • External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named deb-packages.txt. These dependencies will be installed using sudo apt-get install ... before your app is tested. If your app requires a dependency that isn't available as Debian 10 package, then its source code must be included and it is built along with your library.
  • You must submit a ZIP-file (the maximum limit for a file sent to the bot is 2 GB) with the following structure:
submission.zip
  -> libtglang.so - shared library with interface described in tglang.h (obligatory)
  -> src - folder with the app's source code (obligatory)
  -> train - folder with the source code and datasets used to train the model if any (obligatory)
  -> README.md - short description of the solution (obligatory)
  -> resources - folder with additional files which your library requires to work (please use relative paths to access them) (optional)
  -> deb-packages.txt - a text file with line-break separated debian package names of all external dependencies (optional)

Evaluation

The solutions will be tested on code snippets from public Telegram chats, which may contain anything besides valid code. In the latter case the library is expected to return TGLANG_LANGUAGE_OTHER == 0. This is expected to be the most common correct answer. For the same reasons, some programming languages may appear much more often in the testing dataset than others.

When evaluating submissions we will prioritize the accuracy and speed of the algorithms. Accuracy will have the highest priority.