Debian: Slurm Workload Manager の導入

序論

SLURM (Simple Linux Utility for Resource Management) は計算資源管理ソフトの一つ. ローカルホストで計算を複数流すために導入した.

方法

aptitude でインストール

% lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.0 (jessie)
Release:        8.0
Codename:       jessie
% su
# aptitude install slurm-wlm

依存する munge (認証サービス) も同時に入る.

munge の設定

鍵生成

# /usr/sbin/create-munge-key
Generating a pseudo-random key using /dev/urandom completed.

他の鍵生成法は InstallationGuide 参照.

パーミッションの変更

# chmod 700 /etc/munge/
# chmod 711 /var/lib/munge/
# chmod 700 /var/log/munge/
# chmod 755 /var/run/munge/
# chmod 400 /etc/munge/munge.key

起動

# /etc/init.d/munge start
[ ok ] Starting munge (via systemctl): munge.service.

テスト

# munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      LOCALHOST (127.0.1.1)
ENCODE_TIME:      2015-05-30 13:12:57 +0900 (1432959177)
DECODE_TIME:      2015-05-30 13:12:57 +0900 (1432959177)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

# remunge; echo $?
2015-05-30 13:13:40 Spawning 1 thread for encoding
2015-05-30 13:13:40 Processing credentials for 1 second
2015-05-30 13:13:41 Processed 14268 credentials in 1.000s (14264 creds/sec)
0

SLURM

設定

/usr/share/doc/slurm-wlm/README.Debian を参考に /usr/share/doc/slurmctld/slurm-wlm-configurator.html を開いて設定し, /etc/slurm-llnl/slurm.conf を作成する. または /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz をコピーして編集する.

/usr/share/doc/slurmctld/slurm-wlm-configurator.html/usr/share/doc/slurm-client/examples/slurm.conf.simple.gz では項目に差異が見られる.今回は後者をもとに /etc/slurm-llnl/slurm.conf を作成した.

Control Machines

ControlMachine
ホスト名 ($HOST) に設定

Compute Machines

NodeName
localhost に設定
CPUs
他の作業ができるよう $(nprocs) - 2 = 6 に設定

Resource Selection

SelectType
1 ノードに複数のジョブを流すために Cons_res を選択

起動

# /etc/init.d/slurmd start
[ ok ] Starting slurmd (via systemctl): slurmd.service.
# /etc/init.d/slurmctld start
[ ok ] Starting slurmctld (via systemctl): slurmctld.service.
# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost

テスト

test.sh を作成し投入:

#!/bin/sh
#SBATCH -J test         # job name
#SBATCH -o test-%j.dat  # standard output
#SBATCH -e test-%j.err  # standard error
#SBATCH -n 1            # number of tasks

echo ${SLURM_JOB_NAME}
echo ${SLURM_JOB_ID} >&2
% sbatch test.sh
% cat test-2.dat
test
% cat test-2.err
2

参考