kblaunch
kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
Commands
- Launching GPU jobs with various configurations
- Monitoring GPU usage and job statistics
- Setting up user configurations and preferences
- Managing persistent volumes and Git authentication
Features
- Interactive and batch job support
- GPU resource management and constraints
- Environment variable handling from multiple sources
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- VS Code integration with remote tunneling
- Slack notifications for job status
- Real-time cluster monitoring
Resource Types
- A100 GPUs (40GB and 80GB variants)
- H100 GPUs (80GB variant)
- MIG GPU instances
- CPU and RAM allocation
- Persistent storage volumes
Job Priority Classes
- default: Standard priority for most workloads
- batch: Lower priority for long-running jobs
- short: High priority for quick jobs (with GPU constraints)
Environment Integration
- Kubernetes secrets
- Local environment variables
- .env file support
- SSH key management
- NFS workspace mounting
1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters. 2 3## Commands 4* Launching GPU jobs with various configurations 5* Monitoring GPU usage and job statistics 6* Setting up user configurations and preferences 7* Managing persistent volumes and Git authentication 8 9## Features 10* Interactive and batch job support 11* GPU resource management and constraints 12* Environment variable handling from multiple sources 13* Persistent Volume Claims (PVC) for storage 14* Git SSH authentication 15* VS Code integration with remote tunneling 16* Slack notifications for job status 17* Real-time cluster monitoring 18 19## Resource Types 20* A100 GPUs (40GB and 80GB variants) 21* H100 GPUs (80GB variant) 22* MIG GPU instances 23* CPU and RAM allocation 24* Persistent storage volumes 25 26## Job Priority Classes 27* default: Standard priority for most workloads 28* batch: Lower priority for long-running jobs 29* short: High priority for quick jobs (with GPU constraints) 30 31## Environment Integration 32* Kubernetes secrets 33* Local environment variables 34* .env file support 35* SSH key management 36* NFS workspace mounting 37""" 38 39import importlib.metadata 40 41__version__ = importlib.metadata.version("kblaunch") 42 43__all__ = [ 44 "setup", 45 "launch", 46 "monitor_gpus", 47 "monitor_users", 48 "monitor_jobs", 49 "monitor_queue", 50] 51 52from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
670@app.command() 671def setup(): 672 """ 673 # `kblaunch setup` 674 675 Interactive setup wizard for kblaunch configuration. 676 No arguments - all configuration is done through interactive prompts. 677 678 This command walks users through the initial setup process, configuring: 679 - User identity and email 680 - Slack notifications webhook 681 - Persistent Volume Claims (PVC) for storage 682 - Git SSH authentication 683 684 The configuration is stored in ~/.cache/.kblaunch/config.json. 685 686 Configuration includes: 687 - User: Kubernetes username for job ownership 688 - Email: User email for notifications and Git configuration 689 - Slack webhook: URL for job status notifications 690 - PVC: Persistent storage configuration 691 - Git SSH: Authentication for private repositories 692 """ 693 config = load_config() 694 695 # validate user 696 default_user = os.getenv("USER") 697 if "user" in config: 698 default_user = config["user"] 699 else: 700 config["user"] = default_user 701 702 if typer.confirm( 703 f"Would you like to set the user? (default: {default_user})", default=False 704 ): 705 user = typer.prompt("Please enter your user", default=default_user) 706 config["user"] = user 707 708 # Get email 709 existing_email = config.get("email", None) 710 email = typer.prompt( 711 f"Please enter your email (existing: {existing_email})", default=existing_email 712 ) 713 config["email"] = email 714 715 # Get Slack webhook 716 if typer.confirm("Would you like to set up Slack notifications?", default=False): 717 existing_webhook = config.get("slack_webhook", None) 718 webhook = typer.prompt( 719 f"Enter your Slack webhook URL (existing: {existing_webhook})", 720 default=existing_webhook, 721 ) 722 config["slack_webhook"] = webhook 723 724 if typer.confirm("Would you like to use a PVC?", default=False): 725 user = config["user"] 726 current_default = config.get("default_pvc", f"{user}-pvc") 727 728 pvc_name = typer.prompt( 729 f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.", 730 default=current_default, 731 ) 732 733 if check_if_pvc_exists(pvc_name): 734 if typer.confirm( 735 f"Would you like to set {pvc_name} as the default PVC?", 736 default=True, 737 ): 738 config["default_pvc"] = pvc_name 739 else: 740 if typer.confirm( 741 f"PVC '{pvc_name}' does not exist. Would you like to create it?", 742 default=True, 743 ): 744 pvc_size = typer.prompt( 745 "Enter the desired PVC size (e.g. 10Gi)", default="10Gi" 746 ) 747 try: 748 if create_pvc(user, pvc_name, pvc_size): 749 config["default_pvc"] = pvc_name 750 except (ValueError, ApiException) as e: 751 logger.error(f"Failed to create PVC: {e}") 752 753 # Git authentication setup 754 if typer.confirm("Would you like to set up Git SSH authentication?", default=False): 755 default_key_path = str(Path.home() / ".ssh" / "id_rsa") 756 key_path = typer.prompt( 757 "Enter the path to your SSH private key", 758 default=default_key_path, 759 ) 760 secret_name = f"{config['user']}-git-ssh" 761 if create_git_secret(secret_name, key_path): 762 config["git_secret"] = secret_name 763 764 # validate slack webhook 765 if "slack_webhook" in config: 766 # test post to slack 767 try: 768 logger.info("Sending test message to Slack") 769 message = "Hello :wave: from ```kblaunch```" 770 response = requests.post( 771 config["slack_webhook"], 772 json={"text": message}, 773 ) 774 response.raise_for_status() 775 except Exception as e: 776 logger.error(f"Error sending test message to Slack: {e}") 777 778 # Save config 779 save_config(config) 780 logger.info(f"Configuration saved to {CONFIG_FILE}")
kblaunch setup
Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.
This command walks users through the initial setup process, configuring:
- User identity and email
- Slack notifications webhook
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
The configuration is stored in ~/.cache/.kblaunch/config.json.
Configuration includes:
- User: Kubernetes username for job ownership
- Email: User email for notifications and Git configuration
- Slack webhook: URL for job status notifications
- PVC: Persistent storage configuration
- Git SSH: Authentication for private repositories
783@app.command() 784def launch( 785 email: str = typer.Option(None, help="User email (overrides config)"), 786 job_name: str = typer.Option(..., help="Name of the Kubernetes job"), 787 docker_image: str = typer.Option( 788 "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image" 789 ), 790 namespace: str = typer.Option("informatics", help="Kubernetes namespace"), 791 queue_name: str = typer.Option("informatics-user-queue", help="Kueue queue name"), 792 interactive: bool = typer.Option(False, help="Run in interactive mode"), 793 command: str = typer.Option( 794 "", help="Command to run in the container" 795 ), # Made optional 796 cpu_request: str = typer.Option("1", help="CPU request"), 797 ram_request: str = typer.Option("8Gi", help="RAM request"), 798 gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"), 799 gpu_product: GPU_PRODUCTS = typer.Option( 800 "NVIDIA-A100-SXM4-40GB", 801 help="GPU product type to use (ignored for non-GPU jobs)", 802 show_choices=True, 803 show_default=True, 804 ), 805 secrets_env_vars: list[str] = typer.Option( 806 [], # Use empty list as default instead of None 807 help="List of secret environment variables to export to the container", 808 ), 809 local_env_vars: list[str] = typer.Option( 810 [], # Use empty list as default instead of None 811 help="List of local environment variables to export to the container", 812 ), 813 load_dotenv: bool = typer.Option( 814 True, help="Load environment variables from .env file" 815 ), 816 nfs_server: str = typer.Option(NFS_SERVER, help="NFS server"), 817 pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"), 818 dry_run: bool = typer.Option(False, help="Dry run"), 819 priority: PRIORITY = typer.Option( 820 "default", help="Priority class name", show_default=True, show_choices=True 821 ), 822 vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"), 823 tunnel: bool = typer.Option( 824 False, 825 help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode", 826 ), 827 startup_script: str = typer.Option( 828 None, help="Path to startup script to run in container" 829 ), 830): 831 """ 832 # `kblaunch launch` 833 Launch a Kubernetes job with specified configuration. 834 835 This command creates and deploys a Kubernetes job with the given specifications, 836 handling GPU allocation, resource requests, and environment setup. 837 838 Args: 839 * email (str, optional): User email for notifications 840 * job_name (str, required): Name of the Kubernetes job 841 * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image 842 * namespace (str, default="informatics"): Kubernetes namespace 843 * queue_name (str, default="informatics-user-queue"): Kueue queue name 844 * interactive (bool, default=False): Run in interactive mode 845 * command (str, default=""): Command to run in container 846 * cpu_request (str, default="1"): CPU cores request 847 * ram_request (str, default="8Gi"): RAM request 848 * gpu_limit (int, default=1): Number of GPUs 849 * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type 850 * secrets_env_vars (List[str], default=[]): Secret environment variables 851 * local_env_vars (List[str], default=[]): Local environment variables 852 * load_dotenv (bool, default=True): Load .env file 853 * nfs_server (str): NFS server IP 854 * pvc_name (str, optional): PVC name 855 * dry_run (bool, default=False): Print YAML only 856 * priority (PRIORITY, default="default"): Job priority 857 * vscode (bool, default=False): Install VS Code 858 * tunnel (bool, default=False): Start VS Code tunnel 859 * startup_script (str, optional): Path to startup script 860 861 Examples: 862 ```bash 863 # Launch an interactive GPU job 864 kblaunch launch --job-name test-job --interactive 865 866 # Launch a batch GPU job with custom command 867 kblaunch launch --job-name batch-job --command "python train.py" 868 869 # Launch a CPU-only job 870 kblaunch launch --job-name cpu-job --gpu-limit 0 871 872 # Launch with VS Code support 873 kblaunch launch --job-name dev-job --interactive --vscode --tunnel 874 ``` 875 876 Notes: 877 - Interactive jobs keep running until manually terminated 878 - GPU jobs require appropriate queue and priority settings 879 - VS Code tunnel requires Slack webhook configuration 880 """ 881 # Load config 882 config = load_config() 883 884 # Use email from config if not provided 885 if email is None: 886 email = config.get("email") 887 if email is None: 888 raise typer.BadParameter( 889 "Email not provided and not found in config. " 890 "Please provide --email or run 'kblaunch setup'" 891 ) 892 893 # Add SLACK_WEBHOOK to local_env_vars if configured 894 if "slack_webhook" in config: 895 os.environ["SLACK_WEBHOOK"] = config["slack_webhook"] 896 if "SLACK_WEBHOOK" not in local_env_vars: 897 local_env_vars.append("SLACK_WEBHOOK") 898 899 if "user" in config and os.getenv("USER") is None: 900 os.environ["USER"] = config["user"] 901 902 if pvc_name is None: 903 pvc_name = config.get("default_pvc") 904 905 if pvc_name is not None: 906 if not check_if_pvc_exists(pvc_name): 907 logger.error(f"Provided PVC '{pvc_name}' does not exist") 908 return 909 910 # Add validation for command parameter 911 if not interactive and command == "": 912 raise typer.BadParameter("--command is required when not in interactive mode") 913 914 # Validate GPU constraints only if requesting GPUs 915 if gpu_limit > 0: 916 try: 917 validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value) 918 except ValueError as e: 919 raise typer.BadParameter(str(e)) 920 921 is_completed = check_if_completed(job_name, namespace=namespace) 922 if not is_completed: 923 if typer.confirm( 924 f"Job '{job_name}' already exists. Do you want to delete it and create a new one?", 925 default=False, 926 ): 927 if not delete_namespaced_job_safely( 928 job_name, 929 namespace=namespace, 930 user=config.get("user"), 931 ): 932 logger.error("Failed to delete existing job") 933 return 1 934 else: 935 logger.info("Operation cancelled by user") 936 return 1 937 938 logger.info(f"Job '{job_name}' is completed. Launching a new job.") 939 940 # Get local environment variables 941 env_vars_dict = get_env_vars( 942 local_env_vars=local_env_vars, 943 load_dotenv=load_dotenv, 944 ) 945 946 # Add USER and GIT_EMAIL to env_vars if git_secret is configured 947 if config.get("git_secret"): 948 env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown")) 949 env_vars_dict["GIT_EMAIL"] = email 950 951 secrets_env_vars_dict = get_secret_env_vars( 952 secrets_names=secrets_env_vars, 953 namespace=namespace, 954 ) 955 956 # Check for overlapping keys in local and secret environment variables 957 intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys()) 958 if intersection: 959 logger.warning( 960 f"Overlapping keys in local and secret environment variables: {intersection}" 961 ) 962 # Combine the environment variables 963 union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys()) 964 965 # Handle startup script 966 script_content = None 967 if startup_script: 968 script_content = read_startup_script(startup_script) 969 # Create ConfigMap for startup script 970 try: 971 api = client.CoreV1Api() 972 config_map = client.V1ConfigMap( 973 metadata=client.V1ObjectMeta( 974 name=f"{job_name}-startup", namespace=namespace 975 ), 976 data={"startup.sh": script_content}, 977 ) 978 try: 979 api.create_namespaced_config_map(namespace=namespace, body=config_map) 980 except ApiException as e: 981 if e.status == 409: # Already exists 982 api.patch_namespaced_config_map( 983 name=f"{job_name}-startup", namespace=namespace, body=config_map 984 ) 985 else: 986 raise 987 except Exception as e: 988 raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}") 989 990 if interactive: 991 cmd = "while true; do sleep 60; done;" 992 else: 993 cmd = command 994 logger.info(f"Command: {cmd}") 995 996 logger.info(f"Creating job for: {cmd}") 997 998 # Modify command to include startup script 999 if script_content: 1000 cmd = f"bash /startup.sh && {cmd}" 1001 1002 # Build the start command with optional VS Code installation 1003 start_command = send_message_command(union) 1004 if config.get("git_secret"): 1005 start_command += setup_git_command() 1006 if vscode: 1007 start_command += install_vscode_command() 1008 if tunnel: 1009 start_command += start_vscode_tunnel_command(union) 1010 elif tunnel: 1011 logger.error("Cannot start tunnel without VS Code installation") 1012 1013 full_cmd = start_command + cmd 1014 1015 job = KubernetesJob( 1016 name=job_name, 1017 cpu_request=cpu_request, 1018 ram_request=ram_request, 1019 image=docker_image, 1020 gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None, 1021 gpu_limit=gpu_limit, 1022 gpu_product=gpu_product.value if gpu_limit > 0 else None, 1023 command=["/bin/bash", "-c", "--"], 1024 args=[full_cmd], 1025 env_vars=env_vars_dict, 1026 secret_env_vars=secrets_env_vars_dict, 1027 user_email=email, 1028 namespace=namespace, 1029 kueue_queue_name=queue_name, 1030 nfs_server=nfs_server, 1031 pvc_name=pvc_name, 1032 priority=priority.value, 1033 startup_script=script_content, 1034 git_secret=config.get("git_secret"), 1035 ) 1036 job_yaml = job.generate_yaml() 1037 logger.info(job_yaml) 1038 # Run the Job on the Kubernetes cluster 1039 if not dry_run: 1040 job.run()
kblaunch launch
Launch a Kubernetes job with specified configuration.
This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.
Args:
- email (str, optional): User email for notifications
- job_name (str, required): Name of the Kubernetes job
- docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
- namespace (str, default="informatics"): Kubernetes namespace
- queue_name (str, default="informatics-user-queue"): Kueue queue name
- interactive (bool, default=False): Run in interactive mode
- command (str, default=""): Command to run in container
- cpu_request (str, default="1"): CPU cores request
- ram_request (str, default="8Gi"): RAM request
- gpu_limit (int, default=1): Number of GPUs
- gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
- secrets_env_vars (List[str], default=[]): Secret environment variables
- local_env_vars (List[str], default=[]): Local environment variables
- load_dotenv (bool, default=True): Load .env file
- nfs_server (str): NFS server IP
- pvc_name (str, optional): PVC name
- dry_run (bool, default=False): Print YAML only
- priority (PRIORITY, default="default"): Job priority
- vscode (bool, default=False): Install VS Code
- tunnel (bool, default=False): Start VS Code tunnel
- startup_script (str, optional): Path to startup script
Examples:
# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive
# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"
# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0
# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel
Notes:
- Interactive jobs keep running until manually terminated
- GPU jobs require appropriate queue and priority settings
- VS Code tunnel requires Slack webhook configuration
1047@monitor_app.command("gpus") 1048def monitor_gpus( 1049 namespace: str = typer.Option("informatics", help="Kubernetes namespace"), 1050): 1051 """ 1052 # `kblaunch monitor gpus` 1053 Display overall GPU statistics and utilization by type. 1054 1055 Shows a comprehensive view of GPU allocation and usage across the cluster, 1056 including both running and pending GPU requests. 1057 1058 Args: 1059 - namespace: Kubernetes namespace to monitor (default: informatics) 1060 1061 Output includes: 1062 - Total GPU count by type 1063 - Running vs. pending GPUs 1064 - Details of pending GPU requests 1065 - Wait times for pending requests 1066 1067 Examples: 1068 ```bash 1069 kblaunch monitor gpus 1070 kblaunch monitor gpus --namespace custom-namespace 1071 ``` 1072 """ 1073 try: 1074 print_gpu_total(namespace=namespace) 1075 except Exception as e: 1076 print(f"Error displaying GPU stats: {e}")
kblaunch monitor gpus
Display overall GPU statistics and utilization by type.
Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.
Args:
- namespace: Kubernetes namespace to monitor (default: informatics)
Output includes:
- Total GPU count by type
- Running vs. pending GPUs
- Details of pending GPU requests
- Wait times for pending requests
Examples:
kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
1079@monitor_app.command("users") 1080def monitor_users( 1081 namespace: str = typer.Option("informatics", help="Kubernetes namespace"), 1082): 1083 """ 1084 # `kblaunch monitor users` 1085 Display GPU usage statistics grouped by user. 1086 1087 Provides a user-centric view of GPU allocation and utilization, 1088 helping identify resource usage patterns across users. 1089 1090 Args: 1091 - namespace: Kubernetes namespace to monitor (default: informatics) 1092 1093 Output includes: 1094 - GPUs allocated per user 1095 - Average memory usage per user 1096 - Inactive GPU count per user 1097 - Overall usage totals 1098 1099 Examples: 1100 ```bash 1101 kblaunch monitor users 1102 kblaunch monitor users --namespace custom-namespace 1103 ``` 1104 """ 1105 try: 1106 print_user_stats(namespace=namespace) 1107 except Exception as e: 1108 print(f"Error displaying user stats: {e}")
kblaunch monitor users
Display GPU usage statistics grouped by user.
Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.
Args:
- namespace: Kubernetes namespace to monitor (default: informatics)
Output includes:
- GPUs allocated per user
- Average memory usage per user
- Inactive GPU count per user
- Overall usage totals
Examples:
kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
1111@monitor_app.command("jobs") 1112def monitor_jobs( 1113 namespace: str = typer.Option("informatics", help="Kubernetes namespace"), 1114): 1115 """ 1116 # `kblaunch monitor jobs` 1117 Display detailed job-level GPU statistics. 1118 1119 Shows comprehensive information about all running GPU jobs, 1120 including resource usage and job characteristics. 1121 1122 Args: 1123 - namespace: Kubernetes namespace to monitor (default: informatics) 1124 1125 Output includes: 1126 - Job identification and ownership 1127 - Resource allocation (CPU, RAM, GPU) 1128 - GPU memory usage 1129 - Job status (active/inactive) 1130 - Job mode (interactive/batch) 1131 - Resource totals and averages 1132 1133 Examples: 1134 ```bash 1135 kblaunch monitor jobs 1136 kblaunch monitor jobs --namespace custom-namespace 1137 ``` 1138 """ 1139 try: 1140 print_job_stats(namespace=namespace) 1141 except Exception as e: 1142 print(f"Error displaying job stats: {e}")
kblaunch monitor jobs
Display detailed job-level GPU statistics.
Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.
Args:
- namespace: Kubernetes namespace to monitor (default: informatics)
Output includes:
- Job identification and ownership
- Resource allocation (CPU, RAM, GPU)
- GPU memory usage
- Job status (active/inactive)
- Job mode (interactive/batch)
- Resource totals and averages
Examples:
kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
1145@monitor_app.command("queue") 1146def monitor_queue( 1147 namespace: str = typer.Option("informatics", help="Kubernetes namespace"), 1148 reasons: bool = typer.Option(False, help="Display queued job event messages"), 1149): 1150 """ 1151 # `kblaunch monitor queue` 1152 Display statistics about queued workloads. 1153 1154 Shows information about jobs waiting in the Kueue scheduler, 1155 including wait times and resource requests. 1156 1157 Args: 1158 - namespace: Kubernetes namespace to monitor (default: informatics) 1159 - reasons: Show detailed reason messages for queued jobs 1160 1161 Output includes: 1162 - Queue position and wait time 1163 - Resource requests (CPU, RAM, GPU) 1164 - Job priority 1165 - Queueing reasons (if --reasons flag is used) 1166 1167 Examples: 1168 ```bash 1169 kblaunch monitor queue 1170 kblaunch monitor queue --reasons 1171 kblaunch monitor queue --namespace custom-namespace 1172 ``` 1173 """ 1174 try: 1175 print_queue_stats(namespace=namespace, reasons=reasons) 1176 except Exception as e: 1177 print(f"Error displaying queue stats: {e}")
kblaunch monitor queue
Display statistics about queued workloads.
Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.
Args:
- namespace: Kubernetes namespace to monitor (default: informatics)
- reasons: Show detailed reason messages for queued jobs
Output includes:
- Queue position and wait time
- Resource requests (CPU, RAM, GPU)
- Job priority
- Queueing reasons (if --reasons flag is used)
Examples:
kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace