CodaLab – Data Science competitions
Dec 17, 2018
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it works and how to install CodaLab On-Premise.
Competition is anchored in our personal and professional lifes. Its goal is not necessarily the desire to be better than others. On the contrary, the main goal is to constantly be able to excel while having fun. In the world of Big Data and more generally the computer world, participating in competitions has several advantages. For example, competing with the others can help build skills on new technologies and evaluate their real abilities. Indeed, by being confronted against the others, we can evaluate our own abilities. Organizing competitions internally can revitalize the group, motivate members of a team. This encourages the development of a good competitive spirit and promote, for instance, the Data Scientists to write more and more powerful codes.
In this regard, a client requested us to look for the different tools available to organize data science competitions internally. We have selected CodaLab and CodaLab Competition. CodaLab allows execution and code sharing within a team. CodaLab Competition allows organizing competitions based on a CodaLab infrastructure.
CodaLab
CodaLab was created in 2013 as a joint venture between Microsoft and Stanford University. Originally, the vision was to create an ecosystem for conducting computational research in a more efficient, reproducible, and collaborative manner; combining worksheets and competitions. Worksheets capture complex research pipelines in a reproducible way and create “executable papers”. With this Open Source Web platform, researchers and developers can collaborate to advance research areas. Mainly in the areas where machine learning and advanced computing are used. Indeed, via CodaLab it is possible to easily share its work to any community. Collaboration is then more effective. The worksheets describe complex research pipelines and create “executable documents”. CodaLab essentially offers the possibility to solve multiple and common problems in the field of data-driven research. Nevertheless, it can also solve more complex problems when the solution can be provided in the form of a zip archive.
CodaLab Competition
Since 2016, CodaLab offers the possibility of organizing online competitions directly on its servers. CodaLab Competition hosts mainly data science competitions, but it is not limited to this area of application. To participate in a competition, simply register and propose a solution. The solution can be a submission of results or codes. The simplest competitions only require the submission of results, which are compared to a solution (or key) by a scoring program. Results submission challenges are less expensive to compute than code submission. Indeed, it is just a comparison of results, involving few possibilities. Code submission allows performance testing by running the submitted code in the same state for all participants. In 2014, ChaLearn, which organizes challenges in the Machine Learning area to stimulate research, has partnered with CodaLab. The goal was the joint development of CodaLab Competition. A particularly exciting new feature of CodaLab Competition is that organizers can now connect their own computing agents to CodaLab’s backend to redirect code submissions. This feature is interesting because it allows the organization of competitions internally in an architecture specific to the company. There are certain limitations that can be overcome, for example regarding data security.
The architecture is as follows:
- The CodaLab server that mainly allows sharing via a web interface
- The CodaLab Competition service that comes on top of the CodaLab server which allows us to have the possibility of setting up competitions.
It is therefore necessary to have at first a functional CodaLab server. Let’s focus now on the architecture and installation of the latter.
CodaLab architecture
Docker
CodaLab uses Docker to manage the local development and deployment of environments because it offers an increased level of reproducibility. Previously, it took hours to install each piece of CodaLab.
Django
Django is the most important part of CodaLab Competition. Django is used to interact with the MySQL database, migrate the state of the database, and perform asynchronous tasks.
MySQL
MySQL is the database used by CodaLab.
RabbitMQ
RabbitMQ is used as a job message broker.
Celery
This is the queue where you can perform long tasks, such as:
- Create competitions
- Evaluate submissions
- Send mails
- Re execute all submissions
- Scheduling tasks
Nginx
Nginx is an HTTP server that can manage web requests. We can use it to cache static pages and manage a large influx of traffic if needed.
How does CodaLab uses Docker?
The submitted code on the CodaLab platform is run in a Docker container. This environment can be reproduced identically on a local computer by downloading the corresponding image. The default environment CodaLab contains a large number of pre-loaded programs, such as Python.. It is possible to download or customize the default docker-codalab-legacy-worker
image from the Docker hub by searching for codalab/codalab-legacy.
CodaLab Installation
About the installation, the wiki is available. It shows step by step the implementation of CodaLab on an Ubuntu machine. However after several failures during the installation, we will give you an installation manual for CentOS 7 summarizing the main actions to perform. In the first place, you have to download the source code hosted on GitHub:
git clone https://github.com/codalab/codalab-worksheets
git clone https://github.com/codalab/codalab-cli
In the following, the environment variable $HOME
will refer to the directory in which the GIT repositories of codalab-worksheets
and codalab-cli
are downloaded. The configuration files will be stored in $CODALAB_HOME
, which is by default ~/.codalab
. Specific packages must be installed beforehand.
Packages installation
Python and virtualEnv dependencies
yum install -y python-virtualenv
Nodejs
yum install -y epel-release
yum install npm
yum install -y gcc make
MySQL
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum update
yum -y install mysql-server
yum install -y python-devel mysql-devel
Docker
wget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/docker-ce-selinux-17.03.0.ce-1.el7.centos.noarch.rpm
yum install -y docker-ce-selinux-17.03.0.ce-1.el7.centos.noarch.rpm
wget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/docker-ce-17.03.0.ce-1.el7.centos.x86_64.rpm
yum install -y docker-ce-17.03.0.ce-1.el7.centos.x86_64.rpm
It is important to have a user codalab
because some commands must be executed as codalab
and not root
.
useradd codalab
usermod -aG wheel codalab
Execute installation scripts
Once downloaded all the necessary prerequisites we can start the installation. Be careful, you have to run the following commands as codalab
chown -R codalab: "codalab-cli/" "codalab-worksheets/"
cd "$HOME/codalab-worksheets" && ./setup.sh
cd "$HOME/codalab-cli" && ./setup.sh server
Database configuration
Once the installation is complete, the database must be configured and secured. A codalab
user and a database with the same name are declared and we will link them to CodaLab.
sudo mysql -u root
CREATE USER "codalab"@"localhost" IDENTIFIED BY "<passwd>" ;
CREATE DATABASE codalab_bundles;
GRANT ALL ON codalab_bundles.* TO "codalab"@"localhost";
Codalab must then be connected to the database.
cd "$HOME/codalab-cli" && codalab/bin/cl config server/engine_urlmysql://codalab:<passwd>@localhost:3306/codalab_bundles
Email service configuration
To have a registration service, you must configure the email service. It allows us to validate, by sending mails, the registration of new users. It also allows them to receive emails from the CodaLab. server. This configuration is done by registering an email address (mail server host, email address, password). It is not possible to configure the sending of mail by an SMTP server specific to the company. To overcome this problem, several solutions are available. For example, we can parse the logs and automate the sending of mails by an SMTP server in case of new registrations. We can also set up a Watchdog that will enable sending emails for each registration event. Nevertheless, the implementation of these solutions can lead to additional tasks to be performed. The standard configuration of the email address via CodaLab is as follows.
$HOME/codalab-cli/codalab/bin/cl config email/host <host>
$HOME/codalab-cli/codalab/bin/cl config email/user <username>
$HOME/codalab-cli/codalab/bin/cl config email/password <password>
$HOME/codalab-cli/codalab/bin/cl config admin-email <email>
Installation and execution of Nginx
Nginx is an HTTP server that will manage all our web requests. At first we will have to install it:
yum install -y nginx
Once installed, it must be configured to work with CodaLab:
cd "$HOME/codalab-worksheets/codalab" && ./manage config_gen
This will generate a Nginx file that will be in $HOME/codalab-worksheets/codalab/config/generated/nginx
.
- Insert
include $HOME/codalab-worksheets/codalab/config/generated/nginx
in the HTTP block of/etc/nginx/nginx.conf
.
Execution of the different services
When all these actions are carried out, we can launch the various services for the good functioning of CodaLab:
- Start the website server
cd "/opt/codalab-worksheets/codalab"
./manage runserver 127.0.0.1:2700
- Start the API service
cd "/opt/codalab-cli"
codalab/bin/cl server
- Start the bundle manager
cd "/opt/codalab-cli"
codalab/bin/cl bundle-manager
- Start the worker
cd "/opt/codalab-cli/worker/codalabworker"
./worker.sh --server http://localhost:2900 --password /home/codalab/root.password
Our CodaLab service is now configured and usable. It is available at http://localhost:8080
(or any other listening port with which Nginx is configured).
Advantages
- When organizing competitions internally, the different evaluation scripts are run and the results are collected in a fully automatic way.
- Participants can easily test their output formats (for example, on test data) without any help being given.
- It is relatively easy to define the start and end dates of the different competitions.
- CodaLab ratings may include multiple different scores and may be anonymous if desired.
Disadvantages
- CodaLab, with the integration of our own agents, is not yet very stable and we do not really have the hands on the installation. Indeed we launch different setups that take care of the whole installation.
- The documentation is not detailed enough and not very explicit.
- It is not possible to use an SMTP server for sending emails. One of the solutions would be to use a Watchdog or to parse the different logs and send emails via our SMTP server.
- The Git project is not really up to date.
Summary
CodaLab Competition is a great solution to organize competitions internally. However, you must have a functional CodaLab server. The installation of the latter is not yet very fluid. It does not always work well and the project’s Git repository is not really up to date. We had to navigate all the branches to find the right information and the right scripts. In conclusion, after consultation with the customer’s teams, the decision was made to wait until the technology matures. A compatibility test with a container orchestration solution such as Kubernetes is in the roadmap, and it may give interesting results.