kblaunch

kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.

Commands

  • Launching GPU jobs with various configurations
  • Monitoring GPU usage and job statistics
  • Setting up user configurations and preferences
  • Managing persistent volumes and Git authentication

Features

  • Interactive and batch job support
  • GPU resource management and constraints
  • Environment variable handling from multiple sources
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • VS Code integration with remote tunneling
  • Slack notifications for job status
  • Real-time cluster monitoring

Resource Types

  • A100 GPUs (40GB and 80GB variants)
  • H100 GPUs (80GB variant)
  • MIG GPU instances
  • CPU and RAM allocation
  • Persistent storage volumes

Job Priority Classes

  • default: Standard priority for most workloads
  • batch: Lower priority for long-running jobs
  • short: High priority for quick jobs (with GPU constraints)

Environment Integration

  • Kubernetes secrets
  • Local environment variables
  • .env file support
  • SSH key management
  • NFS workspace mounting
 1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
 2
 3## Commands
 4* Launching GPU jobs with various configurations
 5* Monitoring GPU usage and job statistics
 6* Setting up user configurations and preferences
 7* Managing persistent volumes and Git authentication
 8
 9## Features
10* Interactive and batch job support
11* GPU resource management and constraints
12* Environment variable handling from multiple sources
13* Persistent Volume Claims (PVC) for storage
14* Git SSH authentication
15* VS Code integration with remote tunneling
16* Slack notifications for job status
17* Real-time cluster monitoring
18
19## Resource Types
20* A100 GPUs (40GB and 80GB variants)
21* H100 GPUs (80GB variant)
22* MIG GPU instances
23* CPU and RAM allocation
24* Persistent storage volumes
25
26## Job Priority Classes
27* default: Standard priority for most workloads
28* batch: Lower priority for long-running jobs
29* short: High priority for quick jobs (with GPU constraints)
30
31## Environment Integration
32* Kubernetes secrets
33* Local environment variables
34* .env file support
35* SSH key management
36* NFS workspace mounting
37"""
38
39import importlib.metadata
40
41__version__ = importlib.metadata.version("kblaunch")
42
43__all__ = [
44    "setup",
45    "launch",
46    "monitor_gpus",
47    "monitor_users",
48    "monitor_jobs",
49    "monitor_queue",
50]
51
52from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
@app.command()
def setup():
763@app.command()
764def setup():
765    """
766     `kblaunch setup`
767
768    Interactive setup wizard for kblaunch configuration.
769    No arguments - all configuration is done through interactive prompts.
770
771    This command walks users through the initial setup process, configuring:
772    - User identity and email
773    - Namespace and queue settings
774    - Slack notifications webhook
775    - Persistent Volume Claims (PVC) for storage
776    - Git SSH authentication
777    - NFS server configuration
778
779    The configuration is stored in ~/.cache/.kblaunch/config.json.
780
781    Configuration includes:
782    - User: Kubernetes username for job ownership
783    - Email: User email for notifications and Git configuration
784    - Namespace: Kubernetes namespace for job deployment
785    - Queue: Kueue queue name for job scheduling
786    - Slack webhook: URL for job status notifications
787    - PVC: Persistent storage configuration
788    - Git SSH: Authentication for private repositories
789    - NFS: Server address for mounting storage
790    """
791    config = load_config()
792
793    # validate user
794    default_user = os.getenv("USER")
795    if "user" in config:
796        default_user = config["user"]
797    else:
798        config["user"] = default_user
799
800    if typer.confirm(
801        f"Would you like to set the user? (default: {default_user})", default=False
802    ):
803        user = typer.prompt("Please enter your user", default=default_user)
804        config["user"] = user
805
806    # Get email
807    existing_email = config.get("email", None)
808    email = typer.prompt(
809        f"Please enter your email (existing: {existing_email})", default=existing_email
810    )
811    config["email"] = email
812
813    # Configure namespace
814    existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE"))
815    if typer.confirm("Would you like to configure your namespace?", default=True):
816        namespace = typer.prompt(
817            f"Please enter your namespace (existing: {existing_namespace})",
818            default=existing_namespace,
819        )
820        config["namespace"] = namespace
821        # Now that we have namespace, ask about queue
822        existing_queue = config.get("queue", get_user_queue(namespace))
823        if typer.confirm("Would you like to configure your queue?", default=True):
824            queue = typer.prompt(
825                f"Please enter your queue name (existing: {existing_queue})",
826                default=existing_queue or f"{namespace}-user-queue",
827            )
828            config["queue"] = queue
829
830    # Get NFS Server
831    # Get the current NFS server from config or default
832    current_nfs = config.get("nfs_server", NFS_SERVER)
833    if typer.confirm("Would you like to configure the NFS server?", default=False):
834        nfs_server = typer.prompt(
835            f"Enter your NFS server address (existing: {current_nfs})",
836            default=current_nfs,
837        )
838        config["nfs_server"] = nfs_server
839
840    # Get Slack webhook
841    if typer.confirm("Would you like to set up Slack notifications?", default=False):
842        existing_webhook = config.get("slack_webhook", None)
843        webhook = typer.prompt(
844            f"Enter your Slack webhook URL (existing: {existing_webhook})",
845            default=existing_webhook,
846        )
847        config["slack_webhook"] = webhook
848
849    if typer.confirm("Would you like to use a PVC?", default=False):
850        user = config["user"]
851        current_default = config.get("default_pvc", f"{user}-pvc")
852
853        pvc_name = typer.prompt(
854            f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.",
855            default=current_default,
856        )
857
858        namespace = config.get("namespace", get_current_namespace(config))
859        if check_if_pvc_exists(pvc_name, namespace):
860            if typer.confirm(
861                f"Would you like to set {pvc_name} as the default PVC?",
862                default=True,
863            ):
864                config["default_pvc"] = pvc_name
865        else:
866            if typer.confirm(
867                f"PVC '{pvc_name}' does not exist. Would you like to create it?",
868                default=True,
869            ):
870                pvc_size = typer.prompt(
871                    "Enter the desired PVC size (e.g. 10Gi)", default="10Gi"
872                )
873                try:
874                    if create_pvc(user, pvc_name, pvc_size, namespace):
875                        config["default_pvc"] = pvc_name
876                except (ValueError, ApiException) as e:
877                    logger.error(f"Failed to create PVC: {e}")
878
879    # Git authentication setup
880    if typer.confirm("Would you like to set up Git SSH authentication?", default=False):
881        default_key_path = str(Path.home() / ".ssh" / "id_rsa")
882        key_path = typer.prompt(
883            "Enter the path to your SSH private key",
884            default=default_key_path,
885        )
886        secret_name = f"{config['user']}-git-ssh"
887        namespace = config.get("namespace", get_current_namespace(config))
888        if create_git_secret(secret_name, key_path, namespace):
889            config["git_secret"] = secret_name
890
891    # validate slack webhook
892    if "slack_webhook" in config:
893        # test post to slack
894        try:
895            logger.info("Sending test message to Slack")
896            message = "Hello :wave: from ```kblaunch```"
897            response = requests.post(
898                config["slack_webhook"],
899                json={"text": message},
900            )
901            response.raise_for_status()
902        except Exception as e:
903            logger.error(f"Error sending test message to Slack: {e}")
904
905    # Save config
906    save_config(config)
907    logger.info(f"Configuration saved to {CONFIG_FILE}")

kblaunch setup

Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.

This command walks users through the initial setup process, configuring:

  • User identity and email
  • Namespace and queue settings
  • Slack notifications webhook
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • NFS server configuration

The configuration is stored in ~/.cache/.kblaunch/config.json.

Configuration includes:

  • User: Kubernetes username for job ownership
  • Email: User email for notifications and Git configuration
  • Namespace: Kubernetes namespace for job deployment
  • Queue: Kueue queue name for job scheduling
  • Slack webhook: URL for job status notifications
  • PVC: Persistent storage configuration
  • Git SSH: Authentication for private repositories
  • NFS: Server address for mounting storage
@app.command()
def launch( email: str = <typer.models.OptionInfo object>, job_name: str = <typer.models.OptionInfo object>, docker_image: str = <typer.models.OptionInfo object>, namespace: str = <typer.models.OptionInfo object>, queue_name: str = <typer.models.OptionInfo object>, interactive: bool = <typer.models.OptionInfo object>, command: str = <typer.models.OptionInfo object>, cpu_request: str = <typer.models.OptionInfo object>, ram_request: str = <typer.models.OptionInfo object>, gpu_limit: int = <typer.models.OptionInfo object>, gpu_product: kblaunch.cli.GPU_PRODUCTS = <typer.models.OptionInfo object>, secrets_env_vars: list[str] = <typer.models.OptionInfo object>, local_env_vars: list[str] = <typer.models.OptionInfo object>, load_dotenv: bool = <typer.models.OptionInfo object>, nfs_server: Optional[str] = <typer.models.OptionInfo object>, pvc_name: str = <typer.models.OptionInfo object>, pvcs: str = <typer.models.OptionInfo object>, dry_run: bool = <typer.models.OptionInfo object>, priority: kblaunch.cli.PRIORITY = <typer.models.OptionInfo object>, vscode: bool = <typer.models.OptionInfo object>, tunnel: bool = <typer.models.OptionInfo object>, startup_script: str = <typer.models.OptionInfo object>):
 910@app.command()
 911def launch(
 912    email: str = typer.Option(None, help="User email (overrides config)"),
 913    job_name: str = typer.Option(..., help="Name of the Kubernetes job"),
 914    docker_image: str = typer.Option(
 915        "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image"
 916    ),
 917    namespace: str = typer.Option(
 918        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
 919    ),
 920    queue_name: str = typer.Option(
 921        None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)"
 922    ),
 923    interactive: bool = typer.Option(False, help="Run in interactive mode"),
 924    command: str = typer.Option(
 925        "", help="Command to run in the container"
 926    ),  # Made optional
 927    cpu_request: str = typer.Option("6", help="CPU request"),
 928    ram_request: str = typer.Option("40Gi", help="RAM request"),
 929    gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"),
 930    gpu_product: GPU_PRODUCTS = typer.Option(
 931        "NVIDIA-A100-SXM4-40GB",
 932        help="GPU product type to use (ignored for non-GPU jobs)",
 933        show_choices=True,
 934        show_default=True,
 935    ),
 936    secrets_env_vars: list[str] = typer.Option(
 937        [],  # Use empty list as default instead of None
 938        help="List of secret environment variables to export to the container",
 939    ),
 940    local_env_vars: list[str] = typer.Option(
 941        [],  # Use empty list as default instead of None
 942        help="List of local environment variables to export to the container",
 943    ),
 944    load_dotenv: bool = typer.Option(
 945        True, help="Load environment variables from .env file"
 946    ),
 947    nfs_server: Optional[str] = typer.Option(
 948        None, help="NFS server (overrides config and environment)"
 949    ),
 950    pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"),
 951    pvcs: str = typer.Option(
 952        None,
 953        help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')',
 954    ),
 955    dry_run: bool = typer.Option(False, help="Dry run"),
 956    priority: PRIORITY = typer.Option(
 957        "default", help="Priority class name", show_default=True, show_choices=True
 958    ),
 959    vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"),
 960    tunnel: bool = typer.Option(
 961        False,
 962        help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode",
 963    ),
 964    startup_script: str = typer.Option(
 965        None, help="Path to startup script to run in container"
 966    ),
 967):
 968    """
 969    `kblaunch launch`
 970    Launch a Kubernetes job with specified configuration.
 971
 972    This command creates and deploys a Kubernetes job with the given specifications,
 973    handling GPU allocation, resource requests, and environment setup.
 974
 975    Args:
 976    * email (str, optional): User email for notifications
 977    * job_name (str, required): Name of the Kubernetes job
 978    * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
 979    * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
 980    * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
 981    * interactive (bool, default=False): Run in interactive mode
 982    * command (str, default=""): Command to run in container
 983    * cpu_request (str, default="6"): CPU cores request
 984    * ram_request (str, default="40Gi"): RAM request
 985    * gpu_limit (int, default=1): Number of GPUs
 986    * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
 987    * secrets_env_vars (List[str], default=[]): Secret environment variables
 988    * local_env_vars (List[str], default=[]): Local environment variables
 989    * load_dotenv (bool, default=True): Load .env file
 990    * nfs_server (str, optional): NFS server IP (overrides config)
 991    * pvc_name (str, optional): PVC name for single PVC mounting at /pvc
 992    * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
 993    * dry_run (bool, default=False): Print YAML only
 994    * priority (PRIORITY, default="default"): Job priority
 995    * vscode (bool, default=False): Install VS Code
 996    * tunnel (bool, default=False): Start VS Code tunnel
 997    * startup_script (str, optional): Path to startup script
 998
 999    Examples:
1000        ```bash
1001        # Launch an interactive GPU job
1002        kblaunch launch --job-name test-job --interactive
1003
1004        # Launch a batch GPU job with custom command
1005        kblaunch launch --job-name batch-job --command "python train.py"
1006
1007        # Launch a CPU-only job
1008        kblaunch launch --job-name cpu-job --gpu-limit 0
1009
1010        # Launch with VS Code support
1011        kblaunch launch --job-name dev-job --interactive --vscode --tunnel
1012
1013        # Launch with multiple PVCs
1014        kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
1015        ```
1016
1017    Notes:
1018    - Interactive jobs keep running until manually terminated
1019    - GPU jobs require appropriate queue and priority settings
1020    - VS Code tunnel requires Slack webhook configuration
1021    - Multiple PVCs can be mounted with custom paths using the --pvcs option
1022    """
1023
1024    # Load config
1025    config = load_config()
1026
1027    # Determine namespace if not provided
1028    if namespace is None:
1029        namespace = get_current_namespace(config)
1030        if namespace is None:
1031            raise typer.BadParameter(
1032                "Namespace not provided.",
1033                "Please provide --namespace or run 'kblaunch setup' to configure.",
1034            )
1035
1036    # Determine queue name if not provided
1037    if queue_name is None:
1038        queue_name = get_user_queue(namespace)
1039        if queue_name is None:
1040            raise typer.BadParameter(
1041                "Queue name not provided.",
1042                "Please provide --queue-name or run 'kblaunch setup' to configure.",
1043            )
1044
1045    # Use email from config if not provided
1046    if email is None:
1047        email = config.get("email")
1048        if email is None:
1049            raise typer.BadParameter(
1050                "Email not provided and not found in config. "
1051                "Please provide --email or run 'kblaunch setup' to configure."
1052            )
1053
1054    # Determine which NFS server to use (priority: command-line > config > env var > default)
1055    if nfs_server is None:
1056        nfs_server = config.get("nfs_server", NFS_SERVER)
1057        if nfs_server is None:
1058            # warn if NFS server is not set
1059            logger.warning(
1060                "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition."
1061            )
1062
1063    # Add SLACK_WEBHOOK to local_env_vars if configured
1064    if "slack_webhook" in config:
1065        os.environ["SLACK_WEBHOOK"] = config["slack_webhook"]
1066        if "SLACK_WEBHOOK" not in local_env_vars:
1067            local_env_vars.append("SLACK_WEBHOOK")
1068
1069    if "user" in config and os.getenv("USER") is None:
1070        os.environ["USER"] = config["user"]
1071
1072    if pvc_name is None:
1073        pvc_name = config.get("default_pvc")
1074
1075    if pvc_name is not None:
1076        if not check_if_pvc_exists(pvc_name, namespace):
1077            logger.error(f"Provided PVC '{pvc_name}' does not exist")
1078            return
1079
1080    # Parse multiple PVCs if provided
1081    parsed_pvcs = []
1082    if pvcs:
1083        try:
1084            parsed_pvcs = json.loads(pvcs)
1085            # Validate the format
1086            for pvc in parsed_pvcs:
1087                if (
1088                    not isinstance(pvc, dict)
1089                    or "name" not in pvc
1090                    or "mount_path" not in pvc
1091                ):
1092                    raise typer.BadParameter(
1093                        "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys"
1094                    )
1095                # Validate that the PVC exists
1096                if not check_if_pvc_exists(pvc["name"], namespace):
1097                    logger.warning(
1098                        f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'"
1099                    )
1100                    if not typer.confirm(
1101                        f"Continue with PVC '{pvc['name']}' that doesn't exist?",
1102                        default=False,
1103                    ):
1104                        return 1
1105        except json.JSONDecodeError:
1106            raise typer.BadParameter("Invalid JSON format for pvcs parameter")
1107
1108    # Add validation for command parameter
1109    if not interactive and command == "":
1110        raise typer.BadParameter("--command is required when not in interactive mode")
1111
1112    # Validate GPU constraints only if requesting GPUs
1113    if gpu_limit > 0:
1114        try:
1115            validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value)
1116        except ValueError as e:
1117            raise typer.BadParameter(str(e))
1118
1119    is_completed = check_if_completed(job_name, namespace=namespace)
1120    if not is_completed:
1121        if typer.confirm(
1122            f"Job '{job_name}' already exists. Do you want to delete it and create a new one?",
1123            default=False,
1124        ):
1125            if not delete_namespaced_job_safely(
1126                job_name,
1127                namespace=namespace,
1128                user=config.get("user"),
1129            ):
1130                logger.error("Failed to delete existing job")
1131                return 1
1132        else:
1133            logger.info("Operation cancelled by user")
1134            return 1
1135
1136    logger.info(f"Job '{job_name}' is completed. Launching a new job.")
1137
1138    # Get local environment variables
1139    env_vars_dict = get_env_vars(
1140        local_env_vars=local_env_vars,
1141        load_dotenv=load_dotenv,
1142    )
1143
1144    # Add USER and GIT_EMAIL to env_vars if git_secret is configured
1145    if config.get("git_secret"):
1146        env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown"))
1147        env_vars_dict["GIT_EMAIL"] = email
1148
1149    secrets_env_vars_dict = get_secret_env_vars(
1150        secrets_names=secrets_env_vars,
1151        namespace=namespace,
1152    )
1153
1154    # Check for overlapping keys in local and secret environment variables
1155    intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys())
1156    if intersection:
1157        logger.warning(
1158            f"Overlapping keys in local and secret environment variables: {intersection}"
1159        )
1160    # Combine the environment variables
1161    union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys())
1162
1163    # Handle startup script
1164    script_content = None
1165    if startup_script:
1166        script_content = read_startup_script(startup_script)
1167        # Create ConfigMap for startup script
1168        try:
1169            api = client.CoreV1Api()
1170            config_map = client.V1ConfigMap(
1171                metadata=client.V1ObjectMeta(
1172                    name=f"{job_name}-startup", namespace=namespace
1173                ),
1174                data={"startup.sh": script_content},
1175            )
1176            try:
1177                api.create_namespaced_config_map(namespace=namespace, body=config_map)
1178            except ApiException as e:
1179                if e.status == 409:  # Already exists
1180                    api.patch_namespaced_config_map(
1181                        name=f"{job_name}-startup", namespace=namespace, body=config_map
1182                    )
1183                else:
1184                    raise
1185        except Exception as e:
1186            raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}")
1187
1188    if interactive:
1189        cmd = "while true; do sleep 60; done;"
1190    else:
1191        cmd = command
1192        logger.info(f"Command: {cmd}")
1193
1194    logger.info(f"Creating job for: {cmd}")
1195
1196    # Modify command to include startup script
1197    if script_content:
1198        cmd = f"bash /startup.sh && {cmd}"
1199
1200    # Build the start command with optional VS Code installation
1201    start_command = send_message_command(union)
1202    if config.get("git_secret"):
1203        start_command += setup_git_command()
1204    if vscode:
1205        start_command += install_vscode_command()
1206        if tunnel:
1207            start_command += start_vscode_tunnel_command(union)
1208    elif tunnel:
1209        logger.error("Cannot start tunnel without VS Code installation")
1210
1211    full_cmd = start_command + cmd
1212
1213    job = KubernetesJob(
1214        name=job_name,
1215        cpu_request=cpu_request,
1216        ram_request=ram_request,
1217        image=docker_image,
1218        gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None,
1219        gpu_limit=gpu_limit,
1220        gpu_product=gpu_product.value if gpu_limit > 0 else None,
1221        command=["/bin/bash", "-c", "--"],
1222        args=[full_cmd],
1223        env_vars=env_vars_dict,
1224        secret_env_vars=secrets_env_vars_dict,
1225        user_email=email,
1226        namespace=namespace,
1227        kueue_queue_name=queue_name,
1228        nfs_server=nfs_server,
1229        pvc_name=pvc_name,
1230        pvcs=parsed_pvcs,  # Pass the parsed PVCs list
1231        priority=priority.value,
1232        startup_script=script_content,
1233        git_secret=config.get("git_secret"),
1234    )
1235    job_yaml = job.generate_yaml()
1236    logger.info(job_yaml)
1237    # Run the Job on the Kubernetes cluster
1238    if not dry_run:
1239        job.run()

kblaunch launch Launch a Kubernetes job with specified configuration.

This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.

Args:

  • email (str, optional): User email for notifications
  • job_name (str, required): Name of the Kubernetes job
  • docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
  • namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
  • queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
  • interactive (bool, default=False): Run in interactive mode
  • command (str, default=""): Command to run in container
  • cpu_request (str, default="6"): CPU cores request
  • ram_request (str, default="40Gi"): RAM request
  • gpu_limit (int, default=1): Number of GPUs
  • gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
  • secrets_env_vars (List[str], default=[]): Secret environment variables
  • local_env_vars (List[str], default=[]): Local environment variables
  • load_dotenv (bool, default=True): Load .env file
  • nfs_server (str, optional): NFS server IP (overrides config)
  • pvc_name (str, optional): PVC name for single PVC mounting at /pvc
  • pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
  • dry_run (bool, default=False): Print YAML only
  • priority (PRIORITY, default="default"): Job priority
  • vscode (bool, default=False): Install VS Code
  • tunnel (bool, default=False): Start VS Code tunnel
  • startup_script (str, optional): Path to startup script

Examples:

# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive

# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"

# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0

# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel

# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'

Notes:

  • Interactive jobs keep running until manually terminated
  • GPU jobs require appropriate queue and priority settings
  • VS Code tunnel requires Slack webhook configuration
  • Multiple PVCs can be mounted with custom paths using the --pvcs option
@monitor_app.command('gpus')
def monitor_gpus(namespace: str = <typer.models.OptionInfo object>):
1327@monitor_app.command("gpus")
1328def monitor_gpus(
1329    namespace: str = typer.Option(
1330        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1331    ),
1332):
1333    """
1334    `kblaunch monitor gpus`
1335    Display overall GPU statistics and utilization by type.
1336
1337    Shows a comprehensive view of GPU allocation and usage across the cluster,
1338    including both running and pending GPU requests.
1339
1340    Args:
1341    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1342
1343    Output includes:
1344    - Total GPU count by type
1345    - Running vs. pending GPUs
1346    - Details of pending GPU requests
1347    - Wait times for pending requests
1348
1349    Examples:
1350        ```bash
1351        kblaunch monitor gpus
1352        kblaunch monitor gpus --namespace custom-namespace
1353        ```
1354    """
1355    try:
1356        namespace = namespace or get_current_namespace(config)
1357        print_gpu_total(namespace=namespace)
1358    except Exception as e:
1359        print(f"Error displaying GPU stats: {e}")

kblaunch monitor gpus Display overall GPU statistics and utilization by type.

Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Total GPU count by type
  • Running vs. pending GPUs
  • Details of pending GPU requests
  • Wait times for pending requests

Examples:

kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
@monitor_app.command('users')
def monitor_users(namespace: str = <typer.models.OptionInfo object>):
1362@monitor_app.command("users")
1363def monitor_users(
1364    namespace: str = typer.Option(
1365        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1366    ),
1367):
1368    """
1369    `kblaunch monitor users`
1370    Display GPU usage statistics grouped by user.
1371
1372    Provides a user-centric view of GPU allocation and utilization,
1373    helping identify resource usage patterns across users.
1374
1375    Args:
1376    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1377
1378    Output includes:
1379    - GPUs allocated per user
1380    - Average memory usage per user
1381    - Inactive GPU count per user
1382    - Overall usage totals
1383
1384    Examples:
1385        ```bash
1386        kblaunch monitor users
1387        kblaunch monitor users --namespace custom-namespace
1388        ```
1389    """
1390    try:
1391        namespace = namespace or get_current_namespace(config)
1392        print_user_stats(namespace=namespace)
1393    except Exception as e:
1394        print(f"Error displaying user stats: {e}")

kblaunch monitor users Display GPU usage statistics grouped by user.

Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • GPUs allocated per user
  • Average memory usage per user
  • Inactive GPU count per user
  • Overall usage totals

Examples:

kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
@monitor_app.command('jobs')
def monitor_jobs(namespace: str = <typer.models.OptionInfo object>):
1397@monitor_app.command("jobs")
1398def monitor_jobs(
1399    namespace: str = typer.Option(
1400        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1401    ),
1402):
1403    """
1404    `kblaunch monitor jobs`
1405    Display detailed job-level GPU statistics.
1406
1407    Shows comprehensive information about all running GPU jobs,
1408    including resource usage and job characteristics.
1409
1410    Args:
1411    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1412
1413    Output includes:
1414    - Job identification and ownership
1415    - Resource allocation (CPU, RAM, GPU)
1416    - GPU memory usage
1417    - Job status (active/inactive)
1418    - Job mode (interactive/batch)
1419    - Resource totals and averages
1420
1421    Examples:
1422        ```bash
1423        kblaunch monitor jobs
1424        kblaunch monitor jobs --namespace custom-namespace
1425        ```
1426    """
1427    try:
1428        namespace = namespace or get_current_namespace(config)
1429        print_job_stats(namespace=namespace)
1430    except Exception as e:
1431        print(f"Error displaying job stats: {e}")

kblaunch monitor jobs Display detailed job-level GPU statistics.

Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Job identification and ownership
  • Resource allocation (CPU, RAM, GPU)
  • GPU memory usage
  • Job status (active/inactive)
  • Job mode (interactive/batch)
  • Resource totals and averages

Examples:

kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
@monitor_app.command('queue')
def monitor_queue( namespace: str = <typer.models.OptionInfo object>, reasons: bool = <typer.models.OptionInfo object>, include_cpu: bool = <typer.models.OptionInfo object>):
1434@monitor_app.command("queue")
1435def monitor_queue(
1436    namespace: str = typer.Option(
1437        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1438    ),
1439    reasons: bool = typer.Option(False, help="Display queued job event messages"),
1440    include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"),
1441):
1442    """
1443    `kblaunch monitor queue`
1444    Display statistics about queued workloads.
1445
1446    Shows information about jobs waiting in the Kueue scheduler,
1447    including wait times and resource requests.
1448
1449    Args:
1450    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1451    - reasons: Show detailed reason messages for queued jobs
1452    - include_cpu: Include CPU jobs in the queue
1453
1454    Output includes:
1455    - Queue position and wait time
1456    - Resource requests (CPU, RAM, GPU)
1457    - Job priority
1458    - Queueing reasons (if --reasons flag is used)
1459
1460    Examples:
1461        ```bash
1462        kblaunch monitor queue
1463        kblaunch monitor queue --reasons
1464        kblaunch monitor queue --namespace custom-namespace
1465        ```
1466    """
1467    try:
1468        namespace = namespace or get_current_namespace(config)
1469        print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu)
1470    except Exception as e:
1471        print(f"Error displaying queue stats: {e}")

kblaunch monitor queue Display statistics about queued workloads.

Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
  • reasons: Show detailed reason messages for queued jobs
  • include_cpu: Include CPU jobs in the queue

Output includes:

  • Queue position and wait time
  • Resource requests (CPU, RAM, GPU)
  • Job priority
  • Queueing reasons (if --reasons flag is used)

Examples:

kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace