identity service VPN Console gateway Compute cluster · 3 nodes Cluster storage · OK SLURM · GPU queue near limit Backup service Metrics · monitoring Profile E · Big Data stack
bulk-import · profile E · directory + VM + VPN + console — all ready
8 min ago
⟳
Built image profile-f-nn v1 — moved to testing channel
Image builder · yoel-admin
41 min ago
⚐
AI agent: proposal to free 14 idle machines
awaiting approval · estimated saving — 224 GB RAM
1 h ago
✓
Backup service · nightly backup of all VMs completed
02:13 · 251 machines · zstd · retention 7d/4w/6m
7 h ago
Identity source. Creating a user runs the full pipeline: account → machine → network → access.
+ Create user
⬆ Import CSV
251 users · 7 profiles · directory
User
Login
Role
Profile
Machine
Status
Last login
Ron Azulai
ron.azulai
student
C · System Programming
vm-ron.azulai-c
active
today, 09:14
Rivka Gad
rivka.gad
student
B · Networks
vm-rivka.gad-b
active
today, 08:52
Abraham Lincoln
abraham.lincoln
professor
C · System Programming
vm-abraham.lincoln-prof-c
active
yesterday, 17:30
Noa Rigel
noa.rigel
student
E · Big Data
vm-noa.rigel-e
machine off
2 days ago
Raul Cohen
raul.cohen
student
A · Operating Systems
vm-raul.cohen-a
active
today, 10:01
Yoel Mangubi
ymangubi
student
B · Networks
vm-ymangubi-b
first login — password change
—
Last 100 user operations (create / reset / bulk). Passwords persist even if the browser lost the create-page render. VPN configs come from the VPN portal — admin login required there.
★ Key new capability — images used to be built manually via the console. Click “Create image” to walk the wizard.
+ Create image
7 images · build engine: Packer
🧱 Golden image builder
✕ Close
1 Base
2 Name
3 Packages
4 Configuration
5 Security
6 Manifest
Step 1 · OS and image base
The builder is multi-platform — an image can be Linux or Windows. Build from scratch or inherit from an existing one (the child gets all the parent’s packages and settings).
🐧 Linux · profile-a-os — Operating Systems (inherit)
Ubuntu 24.04 · desktop + dev stack · v3 · production channel
🐧 Linux · golden-base — clean Ubuntu 24.04 LTS
Bare minimum, no desktop
🪟 Windows · win-11-base — Windows 11
Windows ISO + autounattend.xml · RDP access via gateway
🪟 Windows · win-server-base — Windows Server 2025
For server and engineering courses
Step 2 · Image name and code
The code is used as the profile identifier when provisioning machines.
Step 3 · Packages
The catalog below is a set of suggestions, not a restriction. Any program a lecturer asks for can be added in the field below. Parent packages are already included.
Course directories on /lab-data/courses/<faculty>/<code>/. Each course gets materials/assignments/submissions[+datasets for E/F/G] with group ACLs (course-<faculty>-staff / -students). Students of profile-<x> auto-enrolled in the faculty's student group on VM creation.
+ Create course
Loading…
New course✕
Create
Faculty
Code
Path
Datasets
Actions
Loading…
Access is managed in two layers: access policy (who can log into which machine) + virtualization RBAC (who manages which VM). The dashboard merges them into one matrix.
+ Create role
One change applies to all identities, networks and machines
Who can do what
visual RBAC editor
Capability
Student
Professor
Administrator
Log into own machine (HBAC)
✓
✓
✓
Log into another student’s machine
—
✓
✓
Power-manage own machine
✓
✓
✓
Create / delete users
—
—
✓
Build golden image
—
—
✓
Access infrastructure VMs
—
—
✓
View audit log
—
—
✓
“Student” role
218
users
“Professor” role
12
users
“Administrator” role
3
users
★ Unified overview of the whole lab system. Each service is a real component of the lab, visible and managed from one dashboard.
Single-node cluster: only one course can be in Teaching or Research at a time —
switch the active one to Off before activating another.
loading courses…
Active reservations
Name
Accounts
Nodes
Start
End
State
no reservations
—+ New reservation (admin)Refresh
Drain stops new jobs from scheduling on the node; running jobs continue.
Resume brings it back to service. Admin-only.
Node
State
Capacity
GPU
gres
loading nodes…
—Refresh
Active JupyterHub sessions on slurm-login. Each session is a SLURM job
spawned via batchspawner — killing it releases its GPU/MPS share immediately.
Spawn notebook for user:StartCalls JupyterHub batchspawner → SLURM job → student's GPU notebook ready in <30 s
User
Role
Faculty
Profile
State
SLURM job
Started
Last activity
loading notebooks…
—Refresh
LoadExport CSV
Total jobs
—
in period
Completed
—
COMPLETED state
Failed / Cancelled / Timeout
—
non-success
GPU-hours · MPS-hours
—
total resource use
By user
User
Jobs
GPU-h
MPS-h
Failed/Timeout
choose period and press Load
Jobs (top 50)
JobID
Name
User
QOS
State
Elapsed
TRES
no jobs loaded yet
QOS — cluster-wide quotas / limits for SLURM jobs. Priority affects scheduling
order; MaxWall caps run-time per job; MaxTRES caps per-job resources. Modify
action is admin-only and recorded in audit.
QOS
Name
Priority
MaxWall
Max GPU/user
GPU-hrs/wk
Flags
loading QOS…
Per-user override (give one student/professor more or fewer GPUs / GPU-hours)
Apply override
Accounts (read-only — membership managed via Users pipeline)
F/G accounts are populated automatically by the Zenix create_user pipeline.
—Refresh
Apptainer containers и kernel specs на GPU host. Containers — ML stack (PyTorch/TensorFlow с CUDA). Kernel specs — registered Jupyter kernels (выбираются студентом в JupyterLab Launcher).
Apptainer containers (/opt/containers/)
Container
Path
Size
Modified
loading containers…
Registered Jupyter kernels
Display name
Slug
Container
Language
Path
loading kernels…
—Host: slurm-gpuRefresh
What this tab controls
Defaults for interactive JupyterHub notebooks — applied to every student session
(how much GPU/CPU/RAM one notebook gets, idle timeout). This is not per-user.
• Per-course limits (GPU/time quota for bsc/msc/research) → QOS & Accounts tab.
• One specific student/professor more GPUs → QOS & Accounts → Per-user override.
• Block cards for a course/thesis on a schedule → Reservations tab.
Source: /opt/jupyterhub/etc/jupyterhub_config.py on slurm-login.
Save → backup + sed + restart jupyterhub. ~10 sec downtime; active sessions lose kernel state. Admin only.
Per-session GPU allocation
Fraction of one L4 GPU compute via NVIDIA MPS. 25% = 4 students share one GPU; 100% = one student gets full GPU. Lower = more parallel users.
vCPU per session (out of 64 total on slurm-gpu). Affects data loading speed, not GPU compute. 64 / cores = max parallel users by CPU bottleneck.
RAM allocated per session (out of 480 GB total). Larger = bigger models / batches in memory. Rarely the bottleneck.
Max lifetime of an interactive JupyterHub session. After this, SLURM kills the kernel; files in /home and /models persist. Longer training should use sbatch (up to 48h via QOS).
—
Timeouts
Auto-stop a session if no cell has been executed for this long. 1800 = 30 min. Releases GPU when a student forgets to close the browser. Lower = more aggressive GPU recycling.
Max time JupyterHub waits for the SLURM job + apptainer container + kernel to start. 300 = 5 min, enough for cold container start + busy queue.
How long the proxy waits for the running notebook to respond to an HTTP ping. If a cell runs longer than this without yielding, browser sees 504 Gateway Timeout.
—ReloadSave & restart JupyterHub
Idle machines
14
CPU < 5% over 3 days
Orphaned resources
6
VPN peers without a machine
Can be reclaimed
224 GB
RAM · + 56 CPU cores
Resource reclamation report
generated by AI agent · weekly
Object
Observation
Proposal
vm-amit.bar-c
idle 3 days · CPU 1%
power off
Approve
11 student machines (course A — finished)
no session opened for 9 days
power off group
Approve
vm-dana.test-e
16 GB allocated · 3 GB used
shrink to 8 GB
Approve
6 VPN peers
machine deleted, peer remains
delete peers
Approve
⬇ Export
Time
Who
Action
Object
Result
10:38:12
yoel-admin
bulk-import
class “Big Data 2026”
success · 45
10:06:44
yoel-admin
image-build
profile-f-nn v1
success
09:21:03
AI agent
reset-password (auto)
ron.azulai
success
08:55:17
admin
rbac-modify
“Professor” role
success
02:13:40
system
backup-daily
251 VMs
success
yesterday 14:02
AI agent
idle-scan
whole cluster
14 findings
The AI agent is the platform’s last layer (Phase 6). It works on a “proposes — human approves” model. This is a preview.
🟢 Auto
Low risk, allowed by policy — acts itself, writes to audit
🟡 Proposal
Medium risk — prepares a proposal, waits for approval
🔴 Escalation
Complex case — hands to a human with a summary
Incoming proposals PREVIEW
🟡 proposalFree 14 idle machinescluster scan
14 machines with CPU load < 5% for over 3 days; no session in 9 days. Frees ≈ 224 GB RAM and 56 cores. Machines tagged keep-alive are excluded.
✓ Approve✕ Reject
🟡 proposalRequest: “need a profile B machine for a new student”email · 11:04
Classified as “onboarding”. Requester — professor of course B (identity verified). Pipeline ready: identity + profile-b machine + VPN + access. Run it?
✓ Approve✕ Reject
🟢 done autoPassword reset · ron.azulai09:21
Repetitive low-risk request, identity verified — executed automatically within policy. New password sent to the user, action recorded in audit.
loading modules…
Your personal workspace VM. Click «Open console» to launch the console gateway HTML5 client; close the browser tab when done — the VM keeps running so your work and files persist (mounted via NFS — they survive even if the VM gets re-provisioned).
loading…
—RefreshChange password
★ Instructor view. Sees only their own course — can open a course, upload a class roster CSV, see their students. Infrastructure tabs are not available to them.