Conductor is a application based on .NET core designed for distributed machine learning. It features a client-server architecture for horizontal scaling in a containerized environment. The server application defines work packages and assigns them to the clients. The clients fetch work from the server, do the heavy lifting and deliver the newly generated files (i.e. machine learning models) back to the server. The main purpose is to provide a way to horizontally scale the machine learning prozess as easy as possible. Simply add more containers.
branch | pipeline | docker:server | docker:client(CPU) | docker:client(GPU) |
---|---|---|---|---|
master | ||||
nightly |
The server is a .NET core 2.1 application with a SignalR hub on port 8080 (via kestrel/iis) with endpoint /signalr for the clients.
It creates a folder at %AppData% (Windows) or /home (Linux) as working directory. The working directory contains two folders:
- assets: contains all working packages and the results. A working package is a directory that contains files and subdirectories. The entire directory gets send to the client as payload (except "models" and "evaluation" directories").
- config: contains the config.xml configuration file
The server is designed to connect to a SignalR frontend (see CONDUCTOR_HOST in docker-compose files), which can be used to remotely control the server and request predictions. However, it is not required and the server can be managed entirely via the xml files.
The configuration file contains a list of versions (aka work packages) and the following parameters:
- ReserveNodes: a relative portion of nodes (as in percentage) to reserve for predictions. The reserve nodes do not accept training jobs. Value ranges from 0 to 1.
The client is a .NET core 2.1 application that connects to the server via SignalR. It periodically fetches work packages from the server and executes them. It is designed to scale horizontally in a containerized environment.
A version (Conductor_Shared.Version) is a work package, than contains the definition for the payload that the client executes. Parameters need to be defined as follows:
- TargetModels: amount of target models to train. The server creates work packages equal to target models minus already trained models.
- TrainingCommands: a list of training commands (Conductor_Shared.Command), that are executed to train the model. A command contains the executable and arguments.
- example: /bin/bash -c "python3 my_python_script.py my python script arguments"
- (optional) PredictionCommands: a list of prediction commands, that are executed to create ensemble predictions. Used for production use cases.
- (optional) DatasetType: Used to define evaluation metrics to use.
- (optional) Name: the name of the folder. Will be overriden if local type is used.
A version can be either defined global (via %AppDataPath%/config/config.xml) or via a version.xml in the work packages folder. The application reads in xml files.
Prebuilt docker images are available at DockerHub. See .gitlab-ci.yml for instructions on how to build them yourself. The client containers come in two flavors: CPU and GPU. The GPU containers require a working setup for nvidia-docker2. It's recommended to use a container orchestration tool (e.g. Rancher) to rapidly deploy many containers.
Example usage:
Windows: "docker-compose -f .\Conductor_Server\docker-compose.master.yml up -d"
Linux: "docker run -it -d -e CONDUCTOR_HOST=example.org:8080 chemsorly/conductor:latest-client-gpu-master"
Run the Conductor_Server.dll with the dotnet command (requires .NET core runtime 2.1)
Run the Conductor_Client.dll with the dotnet command and the server URL as first argument or via environment variable "CONDUCTOR_HOST" (requires .NET core runtime 2.1)
The applications generate several logs.
The server logs everything in the following format: [timestamp] [connected clients, queued work, active work]. The log is also saved in config/log.txt
The client forwards all console messages generated by the called application. On successful training, the log is written to the output directory.
(1) Install .NET core SDK 2.1.302
(2) Open Solution in Visual Studio
This application was initially developed during my master thesis and extended during the TransformingTransports research project, which received funding from the EU’s Horizon 2020 R&I programme under grant 731932.