kblaunch

kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.

Commands

  • Launching GPU jobs with various configurations
  • Monitoring GPU usage and job statistics
  • Setting up user configurations and preferences
  • Managing persistent volumes and Git authentication

Features

  • Interactive and batch job support
  • GPU resource management and constraints
  • Environment variable handling from multiple sources
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • VS Code integration with remote tunneling
  • Slack notifications for job status
  • Real-time cluster monitoring

Resource Types

  • A100 GPUs (40GB and 80GB variants)
  • H100 GPUs (80GB variant)
  • MIG GPU instances
  • CPU and RAM allocation
  • Persistent storage volumes

Job Priority Classes

  • default: Standard priority for most workloads
  • batch: Lower priority for long-running jobs
  • short: High priority for quick jobs (with GPU constraints)

Environment Integration

  • Kubernetes secrets
  • Local environment variables
  • .env file support
  • SSH key management
  • NFS workspace mounting
 1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
 2
 3## Commands
 4* Launching GPU jobs with various configurations
 5* Monitoring GPU usage and job statistics
 6* Setting up user configurations and preferences
 7* Managing persistent volumes and Git authentication
 8
 9## Features
10* Interactive and batch job support
11* GPU resource management and constraints
12* Environment variable handling from multiple sources
13* Persistent Volume Claims (PVC) for storage
14* Git SSH authentication
15* VS Code integration with remote tunneling
16* Slack notifications for job status
17* Real-time cluster monitoring
18
19## Resource Types
20* A100 GPUs (40GB and 80GB variants)
21* H100 GPUs (80GB variant)
22* MIG GPU instances
23* CPU and RAM allocation
24* Persistent storage volumes
25
26## Job Priority Classes
27* default: Standard priority for most workloads
28* batch: Lower priority for long-running jobs
29* short: High priority for quick jobs (with GPU constraints)
30
31## Environment Integration
32* Kubernetes secrets
33* Local environment variables
34* .env file support
35* SSH key management
36* NFS workspace mounting
37"""
38
39import importlib.metadata
40
41__version__ = importlib.metadata.version("kblaunch")
42
43__all__ = [
44    "setup",
45    "launch",
46    "monitor_gpus",
47    "monitor_users",
48    "monitor_jobs",
49    "monitor_queue",
50]
51
52from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
@app.command()
def setup():
670@app.command()
671def setup():
672    """
673    # `kblaunch setup`
674
675    Interactive setup wizard for kblaunch configuration.
676    No arguments - all configuration is done through interactive prompts.
677
678    This command walks users through the initial setup process, configuring:
679    - User identity and email
680    - Slack notifications webhook
681    - Persistent Volume Claims (PVC) for storage
682    - Git SSH authentication
683
684    The configuration is stored in ~/.cache/.kblaunch/config.json.
685
686    Configuration includes:
687    - User: Kubernetes username for job ownership
688    - Email: User email for notifications and Git configuration
689    - Slack webhook: URL for job status notifications
690    - PVC: Persistent storage configuration
691    - Git SSH: Authentication for private repositories
692    """
693    config = load_config()
694
695    # validate user
696    default_user = os.getenv("USER")
697    if "user" in config:
698        default_user = config["user"]
699    else:
700        config["user"] = default_user
701
702    if typer.confirm(
703        f"Would you like to set the user? (default: {default_user})", default=False
704    ):
705        user = typer.prompt("Please enter your user", default=default_user)
706        config["user"] = user
707
708    # Get email
709    existing_email = config.get("email", None)
710    email = typer.prompt(
711        f"Please enter your email (existing: {existing_email})", default=existing_email
712    )
713    config["email"] = email
714
715    # Get Slack webhook
716    if typer.confirm("Would you like to set up Slack notifications?", default=False):
717        existing_webhook = config.get("slack_webhook", None)
718        webhook = typer.prompt(
719            f"Enter your Slack webhook URL (existing: {existing_webhook})",
720            default=existing_webhook,
721        )
722        config["slack_webhook"] = webhook
723
724    if typer.confirm("Would you like to use a PVC?", default=False):
725        user = config["user"]
726        current_default = config.get("default_pvc", f"{user}-pvc")
727
728        pvc_name = typer.prompt(
729            f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.",
730            default=current_default,
731        )
732
733        if check_if_pvc_exists(pvc_name):
734            if typer.confirm(
735                f"Would you like to set {pvc_name} as the default PVC?",
736                default=True,
737            ):
738                config["default_pvc"] = pvc_name
739        else:
740            if typer.confirm(
741                f"PVC '{pvc_name}' does not exist. Would you like to create it?",
742                default=True,
743            ):
744                pvc_size = typer.prompt(
745                    "Enter the desired PVC size (e.g. 10Gi)", default="10Gi"
746                )
747                try:
748                    if create_pvc(user, pvc_name, pvc_size):
749                        config["default_pvc"] = pvc_name
750                except (ValueError, ApiException) as e:
751                    logger.error(f"Failed to create PVC: {e}")
752
753    # Git authentication setup
754    if typer.confirm("Would you like to set up Git SSH authentication?", default=False):
755        default_key_path = str(Path.home() / ".ssh" / "id_rsa")
756        key_path = typer.prompt(
757            "Enter the path to your SSH private key",
758            default=default_key_path,
759        )
760        secret_name = f"{config['user']}-git-ssh"
761        if create_git_secret(secret_name, key_path):
762            config["git_secret"] = secret_name
763
764    # validate slack webhook
765    if "slack_webhook" in config:
766        # test post to slack
767        try:
768            logger.info("Sending test message to Slack")
769            message = "Hello :wave: from ```kblaunch```"
770            response = requests.post(
771                config["slack_webhook"],
772                json={"text": message},
773            )
774            response.raise_for_status()
775        except Exception as e:
776            logger.error(f"Error sending test message to Slack: {e}")
777
778    # Save config
779    save_config(config)
780    logger.info(f"Configuration saved to {CONFIG_FILE}")

kblaunch setup

Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.

This command walks users through the initial setup process, configuring:

  • User identity and email
  • Slack notifications webhook
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication

The configuration is stored in ~/.cache/.kblaunch/config.json.

Configuration includes:

  • User: Kubernetes username for job ownership
  • Email: User email for notifications and Git configuration
  • Slack webhook: URL for job status notifications
  • PVC: Persistent storage configuration
  • Git SSH: Authentication for private repositories
@app.command()
def launch( email: str = <typer.models.OptionInfo object>, job_name: str = <typer.models.OptionInfo object>, docker_image: str = <typer.models.OptionInfo object>, namespace: str = <typer.models.OptionInfo object>, queue_name: str = <typer.models.OptionInfo object>, interactive: bool = <typer.models.OptionInfo object>, command: str = <typer.models.OptionInfo object>, cpu_request: str = <typer.models.OptionInfo object>, ram_request: str = <typer.models.OptionInfo object>, gpu_limit: int = <typer.models.OptionInfo object>, gpu_product: kblaunch.cli.GPU_PRODUCTS = <typer.models.OptionInfo object>, secrets_env_vars: list[str] = <typer.models.OptionInfo object>, local_env_vars: list[str] = <typer.models.OptionInfo object>, load_dotenv: bool = <typer.models.OptionInfo object>, nfs_server: str = <typer.models.OptionInfo object>, pvc_name: str = <typer.models.OptionInfo object>, dry_run: bool = <typer.models.OptionInfo object>, priority: kblaunch.cli.PRIORITY = <typer.models.OptionInfo object>, vscode: bool = <typer.models.OptionInfo object>, tunnel: bool = <typer.models.OptionInfo object>, startup_script: str = <typer.models.OptionInfo object>):
 783@app.command()
 784def launch(
 785    email: str = typer.Option(None, help="User email (overrides config)"),
 786    job_name: str = typer.Option(..., help="Name of the Kubernetes job"),
 787    docker_image: str = typer.Option(
 788        "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image"
 789    ),
 790    namespace: str = typer.Option("informatics", help="Kubernetes namespace"),
 791    queue_name: str = typer.Option("informatics-user-queue", help="Kueue queue name"),
 792    interactive: bool = typer.Option(False, help="Run in interactive mode"),
 793    command: str = typer.Option(
 794        "", help="Command to run in the container"
 795    ),  # Made optional
 796    cpu_request: str = typer.Option("1", help="CPU request"),
 797    ram_request: str = typer.Option("8Gi", help="RAM request"),
 798    gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"),
 799    gpu_product: GPU_PRODUCTS = typer.Option(
 800        "NVIDIA-A100-SXM4-40GB",
 801        help="GPU product type to use (ignored for non-GPU jobs)",
 802        show_choices=True,
 803        show_default=True,
 804    ),
 805    secrets_env_vars: list[str] = typer.Option(
 806        [],  # Use empty list as default instead of None
 807        help="List of secret environment variables to export to the container",
 808    ),
 809    local_env_vars: list[str] = typer.Option(
 810        [],  # Use empty list as default instead of None
 811        help="List of local environment variables to export to the container",
 812    ),
 813    load_dotenv: bool = typer.Option(
 814        True, help="Load environment variables from .env file"
 815    ),
 816    nfs_server: str = typer.Option(NFS_SERVER, help="NFS server"),
 817    pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"),
 818    dry_run: bool = typer.Option(False, help="Dry run"),
 819    priority: PRIORITY = typer.Option(
 820        "default", help="Priority class name", show_default=True, show_choices=True
 821    ),
 822    vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"),
 823    tunnel: bool = typer.Option(
 824        False,
 825        help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode",
 826    ),
 827    startup_script: str = typer.Option(
 828        None, help="Path to startup script to run in container"
 829    ),
 830):
 831    """
 832    # `kblaunch launch`
 833    Launch a Kubernetes job with specified configuration.
 834
 835    This command creates and deploys a Kubernetes job with the given specifications,
 836    handling GPU allocation, resource requests, and environment setup.
 837
 838    Args:
 839    * email (str, optional): User email for notifications
 840    * job_name (str, required): Name of the Kubernetes job
 841    * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
 842    * namespace (str, default="informatics"): Kubernetes namespace
 843    * queue_name (str, default="informatics-user-queue"): Kueue queue name
 844    * interactive (bool, default=False): Run in interactive mode
 845    * command (str, default=""): Command to run in container
 846    * cpu_request (str, default="1"): CPU cores request
 847    * ram_request (str, default="8Gi"): RAM request
 848    * gpu_limit (int, default=1): Number of GPUs
 849    * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
 850    * secrets_env_vars (List[str], default=[]): Secret environment variables
 851    * local_env_vars (List[str], default=[]): Local environment variables
 852    * load_dotenv (bool, default=True): Load .env file
 853    * nfs_server (str): NFS server IP
 854    * pvc_name (str, optional): PVC name
 855    * dry_run (bool, default=False): Print YAML only
 856    * priority (PRIORITY, default="default"): Job priority
 857    * vscode (bool, default=False): Install VS Code
 858    * tunnel (bool, default=False): Start VS Code tunnel
 859    * startup_script (str, optional): Path to startup script
 860
 861    Examples:
 862        ```bash
 863        # Launch an interactive GPU job
 864        kblaunch launch --job-name test-job --interactive
 865
 866        # Launch a batch GPU job with custom command
 867        kblaunch launch --job-name batch-job --command "python train.py"
 868
 869        # Launch a CPU-only job
 870        kblaunch launch --job-name cpu-job --gpu-limit 0
 871
 872        # Launch with VS Code support
 873        kblaunch launch --job-name dev-job --interactive --vscode --tunnel
 874        ```
 875
 876    Notes:
 877    - Interactive jobs keep running until manually terminated
 878    - GPU jobs require appropriate queue and priority settings
 879    - VS Code tunnel requires Slack webhook configuration
 880    """
 881    # Load config
 882    config = load_config()
 883
 884    # Use email from config if not provided
 885    if email is None:
 886        email = config.get("email")
 887        if email is None:
 888            raise typer.BadParameter(
 889                "Email not provided and not found in config. "
 890                "Please provide --email or run 'kblaunch setup'"
 891            )
 892
 893    # Add SLACK_WEBHOOK to local_env_vars if configured
 894    if "slack_webhook" in config:
 895        os.environ["SLACK_WEBHOOK"] = config["slack_webhook"]
 896        if "SLACK_WEBHOOK" not in local_env_vars:
 897            local_env_vars.append("SLACK_WEBHOOK")
 898
 899    if "user" in config and os.getenv("USER") is None:
 900        os.environ["USER"] = config["user"]
 901
 902    if pvc_name is None:
 903        pvc_name = config.get("default_pvc")
 904
 905    if pvc_name is not None:
 906        if not check_if_pvc_exists(pvc_name):
 907            logger.error(f"Provided PVC '{pvc_name}' does not exist")
 908            return
 909
 910    # Add validation for command parameter
 911    if not interactive and command == "":
 912        raise typer.BadParameter("--command is required when not in interactive mode")
 913
 914    # Validate GPU constraints only if requesting GPUs
 915    if gpu_limit > 0:
 916        try:
 917            validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value)
 918        except ValueError as e:
 919            raise typer.BadParameter(str(e))
 920
 921    is_completed = check_if_completed(job_name, namespace=namespace)
 922    if not is_completed:
 923        if typer.confirm(
 924            f"Job '{job_name}' already exists. Do you want to delete it and create a new one?",
 925            default=False,
 926        ):
 927            if not delete_namespaced_job_safely(
 928                job_name,
 929                namespace=namespace,
 930                user=config.get("user"),
 931            ):
 932                logger.error("Failed to delete existing job")
 933                return 1
 934        else:
 935            logger.info("Operation cancelled by user")
 936            return 1
 937
 938    logger.info(f"Job '{job_name}' is completed. Launching a new job.")
 939
 940    # Get local environment variables
 941    env_vars_dict = get_env_vars(
 942        local_env_vars=local_env_vars,
 943        load_dotenv=load_dotenv,
 944    )
 945
 946    # Add USER and GIT_EMAIL to env_vars if git_secret is configured
 947    if config.get("git_secret"):
 948        env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown"))
 949        env_vars_dict["GIT_EMAIL"] = email
 950
 951    secrets_env_vars_dict = get_secret_env_vars(
 952        secrets_names=secrets_env_vars,
 953        namespace=namespace,
 954    )
 955
 956    # Check for overlapping keys in local and secret environment variables
 957    intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys())
 958    if intersection:
 959        logger.warning(
 960            f"Overlapping keys in local and secret environment variables: {intersection}"
 961        )
 962    # Combine the environment variables
 963    union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys())
 964
 965    # Handle startup script
 966    script_content = None
 967    if startup_script:
 968        script_content = read_startup_script(startup_script)
 969        # Create ConfigMap for startup script
 970        try:
 971            api = client.CoreV1Api()
 972            config_map = client.V1ConfigMap(
 973                metadata=client.V1ObjectMeta(
 974                    name=f"{job_name}-startup", namespace=namespace
 975                ),
 976                data={"startup.sh": script_content},
 977            )
 978            try:
 979                api.create_namespaced_config_map(namespace=namespace, body=config_map)
 980            except ApiException as e:
 981                if e.status == 409:  # Already exists
 982                    api.patch_namespaced_config_map(
 983                        name=f"{job_name}-startup", namespace=namespace, body=config_map
 984                    )
 985                else:
 986                    raise
 987        except Exception as e:
 988            raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}")
 989
 990    if interactive:
 991        cmd = "while true; do sleep 60; done;"
 992    else:
 993        cmd = command
 994        logger.info(f"Command: {cmd}")
 995
 996    logger.info(f"Creating job for: {cmd}")
 997
 998    # Modify command to include startup script
 999    if script_content:
1000        cmd = f"bash /startup.sh && {cmd}"
1001
1002    # Build the start command with optional VS Code installation
1003    start_command = send_message_command(union)
1004    if config.get("git_secret"):
1005        start_command += setup_git_command()
1006    if vscode:
1007        start_command += install_vscode_command()
1008        if tunnel:
1009            start_command += start_vscode_tunnel_command(union)
1010    elif tunnel:
1011        logger.error("Cannot start tunnel without VS Code installation")
1012
1013    full_cmd = start_command + cmd
1014
1015    job = KubernetesJob(
1016        name=job_name,
1017        cpu_request=cpu_request,
1018        ram_request=ram_request,
1019        image=docker_image,
1020        gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None,
1021        gpu_limit=gpu_limit,
1022        gpu_product=gpu_product.value if gpu_limit > 0 else None,
1023        command=["/bin/bash", "-c", "--"],
1024        args=[full_cmd],
1025        env_vars=env_vars_dict,
1026        secret_env_vars=secrets_env_vars_dict,
1027        user_email=email,
1028        namespace=namespace,
1029        kueue_queue_name=queue_name,
1030        nfs_server=nfs_server,
1031        pvc_name=pvc_name,
1032        priority=priority.value,
1033        startup_script=script_content,
1034        git_secret=config.get("git_secret"),
1035    )
1036    job_yaml = job.generate_yaml()
1037    logger.info(job_yaml)
1038    # Run the Job on the Kubernetes cluster
1039    if not dry_run:
1040        job.run()

kblaunch launch

Launch a Kubernetes job with specified configuration.

This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.

Args:

  • email (str, optional): User email for notifications
  • job_name (str, required): Name of the Kubernetes job
  • docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
  • namespace (str, default="informatics"): Kubernetes namespace
  • queue_name (str, default="informatics-user-queue"): Kueue queue name
  • interactive (bool, default=False): Run in interactive mode
  • command (str, default=""): Command to run in container
  • cpu_request (str, default="1"): CPU cores request
  • ram_request (str, default="8Gi"): RAM request
  • gpu_limit (int, default=1): Number of GPUs
  • gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
  • secrets_env_vars (List[str], default=[]): Secret environment variables
  • local_env_vars (List[str], default=[]): Local environment variables
  • load_dotenv (bool, default=True): Load .env file
  • nfs_server (str): NFS server IP
  • pvc_name (str, optional): PVC name
  • dry_run (bool, default=False): Print YAML only
  • priority (PRIORITY, default="default"): Job priority
  • vscode (bool, default=False): Install VS Code
  • tunnel (bool, default=False): Start VS Code tunnel
  • startup_script (str, optional): Path to startup script

Examples:

# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive

# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"

# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0

# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel

Notes:

  • Interactive jobs keep running until manually terminated
  • GPU jobs require appropriate queue and priority settings
  • VS Code tunnel requires Slack webhook configuration
@monitor_app.command('gpus')
def monitor_gpus(namespace: str = <typer.models.OptionInfo object>):
1047@monitor_app.command("gpus")
1048def monitor_gpus(
1049    namespace: str = typer.Option("informatics", help="Kubernetes namespace"),
1050):
1051    """
1052    # `kblaunch monitor gpus`
1053    Display overall GPU statistics and utilization by type.
1054
1055    Shows a comprehensive view of GPU allocation and usage across the cluster,
1056    including both running and pending GPU requests.
1057
1058    Args:
1059    - namespace: Kubernetes namespace to monitor (default: informatics)
1060
1061    Output includes:
1062    - Total GPU count by type
1063    - Running vs. pending GPUs
1064    - Details of pending GPU requests
1065    - Wait times for pending requests
1066
1067    Examples:
1068        ```bash
1069        kblaunch monitor gpus
1070        kblaunch monitor gpus --namespace custom-namespace
1071        ```
1072    """
1073    try:
1074        print_gpu_total(namespace=namespace)
1075    except Exception as e:
1076        print(f"Error displaying GPU stats: {e}")

kblaunch monitor gpus

Display overall GPU statistics and utilization by type.

Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: informatics)

Output includes:

  • Total GPU count by type
  • Running vs. pending GPUs
  • Details of pending GPU requests
  • Wait times for pending requests

Examples:

kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
@monitor_app.command('users')
def monitor_users(namespace: str = <typer.models.OptionInfo object>):
1079@monitor_app.command("users")
1080def monitor_users(
1081    namespace: str = typer.Option("informatics", help="Kubernetes namespace"),
1082):
1083    """
1084    # `kblaunch monitor users`
1085    Display GPU usage statistics grouped by user.
1086
1087    Provides a user-centric view of GPU allocation and utilization,
1088    helping identify resource usage patterns across users.
1089
1090    Args:
1091    - namespace: Kubernetes namespace to monitor (default: informatics)
1092
1093    Output includes:
1094    - GPUs allocated per user
1095    - Average memory usage per user
1096    - Inactive GPU count per user
1097    - Overall usage totals
1098
1099    Examples:
1100        ```bash
1101        kblaunch monitor users
1102        kblaunch monitor users --namespace custom-namespace
1103        ```
1104    """
1105    try:
1106        print_user_stats(namespace=namespace)
1107    except Exception as e:
1108        print(f"Error displaying user stats: {e}")

kblaunch monitor users

Display GPU usage statistics grouped by user.

Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.

Args:

  • namespace: Kubernetes namespace to monitor (default: informatics)

Output includes:

  • GPUs allocated per user
  • Average memory usage per user
  • Inactive GPU count per user
  • Overall usage totals

Examples:

kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
@monitor_app.command('jobs')
def monitor_jobs(namespace: str = <typer.models.OptionInfo object>):
1111@monitor_app.command("jobs")
1112def monitor_jobs(
1113    namespace: str = typer.Option("informatics", help="Kubernetes namespace"),
1114):
1115    """
1116    # `kblaunch monitor jobs`
1117    Display detailed job-level GPU statistics.
1118
1119    Shows comprehensive information about all running GPU jobs,
1120    including resource usage and job characteristics.
1121
1122    Args:
1123    - namespace: Kubernetes namespace to monitor (default: informatics)
1124
1125    Output includes:
1126    - Job identification and ownership
1127    - Resource allocation (CPU, RAM, GPU)
1128    - GPU memory usage
1129    - Job status (active/inactive)
1130    - Job mode (interactive/batch)
1131    - Resource totals and averages
1132
1133    Examples:
1134        ```bash
1135        kblaunch monitor jobs
1136        kblaunch monitor jobs --namespace custom-namespace
1137        ```
1138    """
1139    try:
1140        print_job_stats(namespace=namespace)
1141    except Exception as e:
1142        print(f"Error displaying job stats: {e}")

kblaunch monitor jobs

Display detailed job-level GPU statistics.

Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.

Args:

  • namespace: Kubernetes namespace to monitor (default: informatics)

Output includes:

  • Job identification and ownership
  • Resource allocation (CPU, RAM, GPU)
  • GPU memory usage
  • Job status (active/inactive)
  • Job mode (interactive/batch)
  • Resource totals and averages

Examples:

kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
@monitor_app.command('queue')
def monitor_queue( namespace: str = <typer.models.OptionInfo object>, reasons: bool = <typer.models.OptionInfo object>):
1145@monitor_app.command("queue")
1146def monitor_queue(
1147    namespace: str = typer.Option("informatics", help="Kubernetes namespace"),
1148    reasons: bool = typer.Option(False, help="Display queued job event messages"),
1149):
1150    """
1151    # `kblaunch monitor queue`
1152    Display statistics about queued workloads.
1153
1154    Shows information about jobs waiting in the Kueue scheduler,
1155    including wait times and resource requests.
1156
1157    Args:
1158    - namespace: Kubernetes namespace to monitor (default: informatics)
1159    - reasons: Show detailed reason messages for queued jobs
1160
1161    Output includes:
1162    - Queue position and wait time
1163    - Resource requests (CPU, RAM, GPU)
1164    - Job priority
1165    - Queueing reasons (if --reasons flag is used)
1166
1167    Examples:
1168        ```bash
1169        kblaunch monitor queue
1170        kblaunch monitor queue --reasons
1171        kblaunch monitor queue --namespace custom-namespace
1172        ```
1173    """
1174    try:
1175        print_queue_stats(namespace=namespace, reasons=reasons)
1176    except Exception as e:
1177        print(f"Error displaying queue stats: {e}")

kblaunch monitor queue

Display statistics about queued workloads.

Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: informatics)
  • reasons: Show detailed reason messages for queued jobs

Output includes:

  • Queue position and wait time
  • Resource requests (CPU, RAM, GPU)
  • Job priority
  • Queueing reasons (if --reasons flag is used)

Examples:

kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace