MLflow
MLflow is a tool developed to manage the lifecycle of machine learning, including recording and searching experiments, reproducing runs, deploying training models in diverse environments, managing training models, and more.
Usage
Preparation: Install MLflow
In order to record experiments with MLflow, the MLflow package and related packages must be installed beforehand. The following commands can be used to install them on GENKAI.
|
Setup a tracking server
MLflow records the experiment by starting a server program called a tracking server beforehand and connecting to it from the machine learning program. When running a machine learning program on computing nodes in GENKAI, you can either use a tracking server running on an external computer or start a tracking server on a login node in GENKAI.
- Case 1: Use an external tracking server
To use an external tracking server, make sure you have the necessary information (host name, port number, user name, and password of the tracking server) for the connection beforehand.
-
Case 2: Start a tracking server on a login node in GENKAI
The procedure for starting the tracking server on the login node of GENKAI is as follows.
-
Confirm the IP address of the login node
The IP address must be specified to start the tracking server. On the other hand, GENKAI uses two servers as login nodes for load balancing, each with a different IP address. Since the server used changes each time you log in, check the IP address from the server host name each time.
You can get the host name of the server you have logged in with the following command:
$ hostname
IP address of each host is as follows:
Host name IP address genkai0001 172.16.0.1 genkai0002 172.16.0.2 -
Load modules
Load modules required to start the tracking server with the following command:
$ module load gcc pytorch/2.3.1
-
Start the tracking server
Start the tracking server with the following command. Please note that the IP address specified in
--host 172.16.0.1
should be changed according to the hostname of the login node.$ mlflow server --host 172.16.0.1 --port 60000
If you get the message
Address already in use
, the port 60000 specified by--port 60000
is currently in use, so change the number one by one, e.g., 60001, 60002, etc., and run again.If the tracking server is started correctly, you will get the following message:
[2024-09-17 15:22:09 +0900] [463931] [INFO] Starting gunicorn 23.0.0 [2024-09-17 15:22:09 +0900] [463931] [INFO] Listening at: http://172.16.0.2:60000 (463931) [2024-09-17 15:22:09 +0900] [463931] [INFO] Using worker: sync [2024-09-17 15:22:09 +0900] [463938] [INFO] Booting worker with pid: 463938 [2024-09-17 15:22:09 +0900] [463939] [INFO] Booting worker with pid: 463939 [2024-09-17 15:22:09 +0900] [463940] [INFO] Booting worker with pid: 463940 [2024-09-17 15:22:09 +0900] [463941] [INFO] Booting worker with pid: 463941
Since the tracking server does not accept commands while it is running, if you want to run a machine learning program on a computation node in GENKAI and record your experiment on this tracking server, leave this terminal as it is and open another window to log in to GENKAI.
To stop the tracking server, press
Ctrl-c
(control key pressed and c).
-
Record experiments
MLflow records the contents of the experiment to the specified tracking server. From any computation node in GENKAI, you can specify a tracking server on an external computer or on a login node in GENKAI to record them.
-
Case 1: Use an external traking server
Specify the host name and port number of the tracking server via MLflow’s API or an environment variable, and enter a user name and password if necessary, then the experiment will be recorded on that tracking server.
-
Case 2: Use a tracking server on a login node in GENKAI
Set the IP address and port number specified when starting the tracking server via the MLflow’s API or environment variables, then the experiment will be recorded by the tracking server on the login node.
If you need to enter tracking server connection information interactively, please use an interactive job. If you want to start Jupyter Notebook on a compute node and use it remotely, please refer to the following.
Referring the experiments
If you connect to the tracking server from your browser, you can refer to the recorded experiments. If you wish to view an experiment recorded on an external tracking server, please open the URI directly in your browser.
On the other hand, when referring to the experiment content recorded on the tracking server on the login node of Genkai, a connection using the SSH port forwarding function is required. In that case, the settings are as follows:
Item | Value |
---|---|
Port number of your local computer | Arbitrary number (For example, 8888) |
SSH server | genkai.hpc.kyushu-u.ac.jp |
User name of SSH server | Supercomputer account of GENKAI |
Port number of SSH server | 22 |
Remote server | Host name on which the tracking server started (genkai0001 or genkai0002) |
Port number of remote server | Port number specified when starting the tracking server |
This allows you to open http://localhost:PORT_NUMBER
(Replace PORT_NUMBER
to the port number of your local computer) in a browser on your local computer and connect to the tracking server on the GENKAI login node to view your experiments.