kblaunch
kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
Commands
- Launching GPU jobs with various configurations
- Monitoring GPU usage and job statistics
- Setting up user configurations and preferences
- Managing persistent volumes and Git authentication
Features
- Interactive and batch job support
- GPU resource management and constraints
- Environment variable handling from multiple sources
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- VS Code integration with remote tunneling
- Slack notifications for job status
- Real-time cluster monitoring
Resource Types
- A100 GPUs (40GB and 80GB variants)
- H100 GPUs (80GB variant)
- MIG GPU instances
- CPU and RAM allocation
- Persistent storage volumes
Job Priority Classes
- default: Standard priority for most workloads
- batch: Lower priority for long-running jobs
- short: High priority for quick jobs (with GPU constraints)
Environment Integration
- Kubernetes secrets
- Local environment variables
- .env file support
- SSH key management
- NFS workspace mounting
1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters. 2 3## Commands 4* Launching GPU jobs with various configurations 5* Monitoring GPU usage and job statistics 6* Setting up user configurations and preferences 7* Managing persistent volumes and Git authentication 8 9## Features 10* Interactive and batch job support 11* GPU resource management and constraints 12* Environment variable handling from multiple sources 13* Persistent Volume Claims (PVC) for storage 14* Git SSH authentication 15* VS Code integration with remote tunneling 16* Slack notifications for job status 17* Real-time cluster monitoring 18 19## Resource Types 20* A100 GPUs (40GB and 80GB variants) 21* H100 GPUs (80GB variant) 22* MIG GPU instances 23* CPU and RAM allocation 24* Persistent storage volumes 25 26## Job Priority Classes 27* default: Standard priority for most workloads 28* batch: Lower priority for long-running jobs 29* short: High priority for quick jobs (with GPU constraints) 30 31## Environment Integration 32* Kubernetes secrets 33* Local environment variables 34* .env file support 35* SSH key management 36* NFS workspace mounting 37""" 38 39import importlib.metadata 40 41__version__ = importlib.metadata.version("kblaunch") 42 43__all__ = [ 44 "setup", 45 "launch", 46 "monitor_gpus", 47 "monitor_users", 48 "monitor_jobs", 49 "monitor_queue", 50] 51 52from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
763@app.command() 764def setup(): 765 """ 766 `kblaunch setup` 767 768 Interactive setup wizard for kblaunch configuration. 769 No arguments - all configuration is done through interactive prompts. 770 771 This command walks users through the initial setup process, configuring: 772 - User identity and email 773 - Namespace and queue settings 774 - Slack notifications webhook 775 - Persistent Volume Claims (PVC) for storage 776 - Git SSH authentication 777 - NFS server configuration 778 779 The configuration is stored in ~/.cache/.kblaunch/config.json. 780 781 Configuration includes: 782 - User: Kubernetes username for job ownership 783 - Email: User email for notifications and Git configuration 784 - Namespace: Kubernetes namespace for job deployment 785 - Queue: Kueue queue name for job scheduling 786 - Slack webhook: URL for job status notifications 787 - PVC: Persistent storage configuration 788 - Git SSH: Authentication for private repositories 789 - NFS: Server address for mounting storage 790 """ 791 config = load_config() 792 793 # validate user 794 default_user = os.getenv("USER") 795 if "user" in config: 796 default_user = config["user"] 797 else: 798 config["user"] = default_user 799 800 if typer.confirm( 801 f"Would you like to set the user? (default: {default_user})", default=False 802 ): 803 user = typer.prompt("Please enter your user", default=default_user) 804 config["user"] = user 805 806 # Get email 807 existing_email = config.get("email", None) 808 email = typer.prompt( 809 f"Please enter your email (existing: {existing_email})", default=existing_email 810 ) 811 config["email"] = email 812 813 # Configure namespace 814 existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE")) 815 if typer.confirm("Would you like to configure your namespace?", default=True): 816 namespace = typer.prompt( 817 f"Please enter your namespace (existing: {existing_namespace})", 818 default=existing_namespace, 819 ) 820 config["namespace"] = namespace 821 # Now that we have namespace, ask about queue 822 existing_queue = config.get("queue", get_user_queue(namespace)) 823 if typer.confirm("Would you like to configure your queue?", default=True): 824 queue = typer.prompt( 825 f"Please enter your queue name (existing: {existing_queue})", 826 default=existing_queue or f"{namespace}-user-queue", 827 ) 828 config["queue"] = queue 829 830 # Get NFS Server 831 # Get the current NFS server from config or default 832 current_nfs = config.get("nfs_server", NFS_SERVER) 833 if typer.confirm("Would you like to configure the NFS server?", default=False): 834 nfs_server = typer.prompt( 835 f"Enter your NFS server address (existing: {current_nfs})", 836 default=current_nfs, 837 ) 838 config["nfs_server"] = nfs_server 839 840 # Get Slack webhook 841 if typer.confirm("Would you like to set up Slack notifications?", default=False): 842 existing_webhook = config.get("slack_webhook", None) 843 webhook = typer.prompt( 844 f"Enter your Slack webhook URL (existing: {existing_webhook})", 845 default=existing_webhook, 846 ) 847 config["slack_webhook"] = webhook 848 849 if typer.confirm("Would you like to use a PVC?", default=False): 850 user = config["user"] 851 current_default = config.get("default_pvc", f"{user}-pvc") 852 853 pvc_name = typer.prompt( 854 f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.", 855 default=current_default, 856 ) 857 858 namespace = config.get("namespace", get_current_namespace(config)) 859 if check_if_pvc_exists(pvc_name, namespace): 860 if typer.confirm( 861 f"Would you like to set {pvc_name} as the default PVC?", 862 default=True, 863 ): 864 config["default_pvc"] = pvc_name 865 else: 866 if typer.confirm( 867 f"PVC '{pvc_name}' does not exist. Would you like to create it?", 868 default=True, 869 ): 870 pvc_size = typer.prompt( 871 "Enter the desired PVC size (e.g. 10Gi)", default="10Gi" 872 ) 873 try: 874 if create_pvc(user, pvc_name, pvc_size, namespace): 875 config["default_pvc"] = pvc_name 876 except (ValueError, ApiException) as e: 877 logger.error(f"Failed to create PVC: {e}") 878 879 # Git authentication setup 880 if typer.confirm("Would you like to set up Git SSH authentication?", default=False): 881 default_key_path = str(Path.home() / ".ssh" / "id_rsa") 882 key_path = typer.prompt( 883 "Enter the path to your SSH private key", 884 default=default_key_path, 885 ) 886 secret_name = f"{config['user']}-git-ssh" 887 namespace = config.get("namespace", get_current_namespace(config)) 888 if create_git_secret(secret_name, key_path, namespace): 889 config["git_secret"] = secret_name 890 891 # validate slack webhook 892 if "slack_webhook" in config: 893 # test post to slack 894 try: 895 logger.info("Sending test message to Slack") 896 message = "Hello :wave: from ```kblaunch```" 897 response = requests.post( 898 config["slack_webhook"], 899 json={"text": message}, 900 ) 901 response.raise_for_status() 902 except Exception as e: 903 logger.error(f"Error sending test message to Slack: {e}") 904 905 # Save config 906 save_config(config) 907 logger.info(f"Configuration saved to {CONFIG_FILE}")
kblaunch setup
Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.
This command walks users through the initial setup process, configuring:
- User identity and email
- Namespace and queue settings
- Slack notifications webhook
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- NFS server configuration
The configuration is stored in ~/.cache/.kblaunch/config.json.
Configuration includes:
- User: Kubernetes username for job ownership
- Email: User email for notifications and Git configuration
- Namespace: Kubernetes namespace for job deployment
- Queue: Kueue queue name for job scheduling
- Slack webhook: URL for job status notifications
- PVC: Persistent storage configuration
- Git SSH: Authentication for private repositories
- NFS: Server address for mounting storage
910@app.command() 911def launch( 912 email: str = typer.Option(None, help="User email (overrides config)"), 913 job_name: str = typer.Option(..., help="Name of the Kubernetes job"), 914 docker_image: str = typer.Option( 915 "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image" 916 ), 917 namespace: str = typer.Option( 918 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 919 ), 920 queue_name: str = typer.Option( 921 None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)" 922 ), 923 interactive: bool = typer.Option(False, help="Run in interactive mode"), 924 command: str = typer.Option( 925 "", help="Command to run in the container" 926 ), # Made optional 927 cpu_request: str = typer.Option("6", help="CPU request"), 928 ram_request: str = typer.Option("40Gi", help="RAM request"), 929 gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"), 930 gpu_product: GPU_PRODUCTS = typer.Option( 931 "NVIDIA-A100-SXM4-40GB", 932 help="GPU product type to use (ignored for non-GPU jobs)", 933 show_choices=True, 934 show_default=True, 935 ), 936 secrets_env_vars: list[str] = typer.Option( 937 [], # Use empty list as default instead of None 938 help="List of secret environment variables to export to the container", 939 ), 940 local_env_vars: list[str] = typer.Option( 941 [], # Use empty list as default instead of None 942 help="List of local environment variables to export to the container", 943 ), 944 load_dotenv: bool = typer.Option( 945 True, help="Load environment variables from .env file" 946 ), 947 nfs_server: Optional[str] = typer.Option( 948 None, help="NFS server (overrides config and environment)" 949 ), 950 pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"), 951 pvcs: str = typer.Option( 952 None, 953 help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')', 954 ), 955 dry_run: bool = typer.Option(False, help="Dry run"), 956 priority: PRIORITY = typer.Option( 957 "default", help="Priority class name", show_default=True, show_choices=True 958 ), 959 vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"), 960 tunnel: bool = typer.Option( 961 False, 962 help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode", 963 ), 964 startup_script: str = typer.Option( 965 None, help="Path to startup script to run in container" 966 ), 967): 968 """ 969 `kblaunch launch` 970 Launch a Kubernetes job with specified configuration. 971 972 This command creates and deploys a Kubernetes job with the given specifications, 973 handling GPU allocation, resource requests, and environment setup. 974 975 Args: 976 * email (str, optional): User email for notifications 977 * job_name (str, required): Name of the Kubernetes job 978 * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image 979 * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace 980 * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name 981 * interactive (bool, default=False): Run in interactive mode 982 * command (str, default=""): Command to run in container 983 * cpu_request (str, default="6"): CPU cores request 984 * ram_request (str, default="40Gi"): RAM request 985 * gpu_limit (int, default=1): Number of GPUs 986 * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type 987 * secrets_env_vars (List[str], default=[]): Secret environment variables 988 * local_env_vars (List[str], default=[]): Local environment variables 989 * load_dotenv (bool, default=True): Load .env file 990 * nfs_server (str, optional): NFS server IP (overrides config) 991 * pvc_name (str, optional): PVC name for single PVC mounting at /pvc 992 * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs) 993 * dry_run (bool, default=False): Print YAML only 994 * priority (PRIORITY, default="default"): Job priority 995 * vscode (bool, default=False): Install VS Code 996 * tunnel (bool, default=False): Start VS Code tunnel 997 * startup_script (str, optional): Path to startup script 998 999 Examples: 1000 ```bash 1001 # Launch an interactive GPU job 1002 kblaunch launch --job-name test-job --interactive 1003 1004 # Launch a batch GPU job with custom command 1005 kblaunch launch --job-name batch-job --command "python train.py" 1006 1007 # Launch a CPU-only job 1008 kblaunch launch --job-name cpu-job --gpu-limit 0 1009 1010 # Launch with VS Code support 1011 kblaunch launch --job-name dev-job --interactive --vscode --tunnel 1012 1013 # Launch with multiple PVCs 1014 kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]' 1015 ``` 1016 1017 Notes: 1018 - Interactive jobs keep running until manually terminated 1019 - GPU jobs require appropriate queue and priority settings 1020 - VS Code tunnel requires Slack webhook configuration 1021 - Multiple PVCs can be mounted with custom paths using the --pvcs option 1022 """ 1023 1024 # Load config 1025 config = load_config() 1026 1027 # Determine namespace if not provided 1028 if namespace is None: 1029 namespace = get_current_namespace(config) 1030 if namespace is None: 1031 raise typer.BadParameter( 1032 "Namespace not provided.", 1033 "Please provide --namespace or run 'kblaunch setup' to configure.", 1034 ) 1035 1036 # Determine queue name if not provided 1037 if queue_name is None: 1038 queue_name = get_user_queue(namespace) 1039 if queue_name is None: 1040 raise typer.BadParameter( 1041 "Queue name not provided.", 1042 "Please provide --queue-name or run 'kblaunch setup' to configure.", 1043 ) 1044 1045 # Use email from config if not provided 1046 if email is None: 1047 email = config.get("email") 1048 if email is None: 1049 raise typer.BadParameter( 1050 "Email not provided and not found in config. " 1051 "Please provide --email or run 'kblaunch setup' to configure." 1052 ) 1053 1054 # Determine which NFS server to use (priority: command-line > config > env var > default) 1055 if nfs_server is None: 1056 nfs_server = config.get("nfs_server", NFS_SERVER) 1057 if nfs_server is None: 1058 # warn if NFS server is not set 1059 logger.warning( 1060 "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition." 1061 ) 1062 1063 # Add SLACK_WEBHOOK to local_env_vars if configured 1064 if "slack_webhook" in config: 1065 os.environ["SLACK_WEBHOOK"] = config["slack_webhook"] 1066 if "SLACK_WEBHOOK" not in local_env_vars: 1067 local_env_vars.append("SLACK_WEBHOOK") 1068 1069 if "user" in config and os.getenv("USER") is None: 1070 os.environ["USER"] = config["user"] 1071 1072 if pvc_name is None: 1073 pvc_name = config.get("default_pvc") 1074 1075 if pvc_name is not None: 1076 if not check_if_pvc_exists(pvc_name, namespace): 1077 logger.error(f"Provided PVC '{pvc_name}' does not exist") 1078 return 1079 1080 # Parse multiple PVCs if provided 1081 parsed_pvcs = [] 1082 if pvcs: 1083 try: 1084 parsed_pvcs = json.loads(pvcs) 1085 # Validate the format 1086 for pvc in parsed_pvcs: 1087 if ( 1088 not isinstance(pvc, dict) 1089 or "name" not in pvc 1090 or "mount_path" not in pvc 1091 ): 1092 raise typer.BadParameter( 1093 "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys" 1094 ) 1095 # Validate that the PVC exists 1096 if not check_if_pvc_exists(pvc["name"], namespace): 1097 logger.warning( 1098 f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'" 1099 ) 1100 if not typer.confirm( 1101 f"Continue with PVC '{pvc['name']}' that doesn't exist?", 1102 default=False, 1103 ): 1104 return 1 1105 except json.JSONDecodeError: 1106 raise typer.BadParameter("Invalid JSON format for pvcs parameter") 1107 1108 # Add validation for command parameter 1109 if not interactive and command == "": 1110 raise typer.BadParameter("--command is required when not in interactive mode") 1111 1112 # Validate GPU constraints only if requesting GPUs 1113 if gpu_limit > 0: 1114 try: 1115 validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value) 1116 except ValueError as e: 1117 raise typer.BadParameter(str(e)) 1118 1119 is_completed = check_if_completed(job_name, namespace=namespace) 1120 if not is_completed: 1121 if typer.confirm( 1122 f"Job '{job_name}' already exists. Do you want to delete it and create a new one?", 1123 default=False, 1124 ): 1125 if not delete_namespaced_job_safely( 1126 job_name, 1127 namespace=namespace, 1128 user=config.get("user"), 1129 ): 1130 logger.error("Failed to delete existing job") 1131 return 1 1132 else: 1133 logger.info("Operation cancelled by user") 1134 return 1 1135 1136 logger.info(f"Job '{job_name}' is completed. Launching a new job.") 1137 1138 # Get local environment variables 1139 env_vars_dict = get_env_vars( 1140 local_env_vars=local_env_vars, 1141 load_dotenv=load_dotenv, 1142 ) 1143 1144 # Add USER and GIT_EMAIL to env_vars if git_secret is configured 1145 if config.get("git_secret"): 1146 env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown")) 1147 env_vars_dict["GIT_EMAIL"] = email 1148 1149 secrets_env_vars_dict = get_secret_env_vars( 1150 secrets_names=secrets_env_vars, 1151 namespace=namespace, 1152 ) 1153 1154 # Check for overlapping keys in local and secret environment variables 1155 intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys()) 1156 if intersection: 1157 logger.warning( 1158 f"Overlapping keys in local and secret environment variables: {intersection}" 1159 ) 1160 # Combine the environment variables 1161 union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys()) 1162 1163 # Handle startup script 1164 script_content = None 1165 if startup_script: 1166 script_content = read_startup_script(startup_script) 1167 # Create ConfigMap for startup script 1168 try: 1169 api = client.CoreV1Api() 1170 config_map = client.V1ConfigMap( 1171 metadata=client.V1ObjectMeta( 1172 name=f"{job_name}-startup", namespace=namespace 1173 ), 1174 data={"startup.sh": script_content}, 1175 ) 1176 try: 1177 api.create_namespaced_config_map(namespace=namespace, body=config_map) 1178 except ApiException as e: 1179 if e.status == 409: # Already exists 1180 api.patch_namespaced_config_map( 1181 name=f"{job_name}-startup", namespace=namespace, body=config_map 1182 ) 1183 else: 1184 raise 1185 except Exception as e: 1186 raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}") 1187 1188 if interactive: 1189 cmd = "while true; do sleep 60; done;" 1190 else: 1191 cmd = command 1192 logger.info(f"Command: {cmd}") 1193 1194 logger.info(f"Creating job for: {cmd}") 1195 1196 # Modify command to include startup script 1197 if script_content: 1198 cmd = f"bash /startup.sh && {cmd}" 1199 1200 # Build the start command with optional VS Code installation 1201 start_command = send_message_command(union) 1202 if config.get("git_secret"): 1203 start_command += setup_git_command() 1204 if vscode: 1205 start_command += install_vscode_command() 1206 if tunnel: 1207 start_command += start_vscode_tunnel_command(union) 1208 elif tunnel: 1209 logger.error("Cannot start tunnel without VS Code installation") 1210 1211 full_cmd = start_command + cmd 1212 1213 job = KubernetesJob( 1214 name=job_name, 1215 cpu_request=cpu_request, 1216 ram_request=ram_request, 1217 image=docker_image, 1218 gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None, 1219 gpu_limit=gpu_limit, 1220 gpu_product=gpu_product.value if gpu_limit > 0 else None, 1221 command=["/bin/bash", "-c", "--"], 1222 args=[full_cmd], 1223 env_vars=env_vars_dict, 1224 secret_env_vars=secrets_env_vars_dict, 1225 user_email=email, 1226 namespace=namespace, 1227 kueue_queue_name=queue_name, 1228 nfs_server=nfs_server, 1229 pvc_name=pvc_name, 1230 pvcs=parsed_pvcs, # Pass the parsed PVCs list 1231 priority=priority.value, 1232 startup_script=script_content, 1233 git_secret=config.get("git_secret"), 1234 ) 1235 job_yaml = job.generate_yaml() 1236 logger.info(job_yaml) 1237 # Run the Job on the Kubernetes cluster 1238 if not dry_run: 1239 job.run()
kblaunch launch
Launch a Kubernetes job with specified configuration.
This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.
Args:
- email (str, optional): User email for notifications
- job_name (str, required): Name of the Kubernetes job
- docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
- namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
- queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
- interactive (bool, default=False): Run in interactive mode
- command (str, default=""): Command to run in container
- cpu_request (str, default="6"): CPU cores request
- ram_request (str, default="40Gi"): RAM request
- gpu_limit (int, default=1): Number of GPUs
- gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
- secrets_env_vars (List[str], default=[]): Secret environment variables
- local_env_vars (List[str], default=[]): Local environment variables
- load_dotenv (bool, default=True): Load .env file
- nfs_server (str, optional): NFS server IP (overrides config)
- pvc_name (str, optional): PVC name for single PVC mounting at /pvc
- pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
- dry_run (bool, default=False): Print YAML only
- priority (PRIORITY, default="default"): Job priority
- vscode (bool, default=False): Install VS Code
- tunnel (bool, default=False): Start VS Code tunnel
- startup_script (str, optional): Path to startup script
Examples:
# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive
# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"
# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0
# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel
# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
Notes:
- Interactive jobs keep running until manually terminated
- GPU jobs require appropriate queue and priority settings
- VS Code tunnel requires Slack webhook configuration
- Multiple PVCs can be mounted with custom paths using the --pvcs option
1327@monitor_app.command("gpus") 1328def monitor_gpus( 1329 namespace: str = typer.Option( 1330 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1331 ), 1332): 1333 """ 1334 `kblaunch monitor gpus` 1335 Display overall GPU statistics and utilization by type. 1336 1337 Shows a comprehensive view of GPU allocation and usage across the cluster, 1338 including both running and pending GPU requests. 1339 1340 Args: 1341 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1342 1343 Output includes: 1344 - Total GPU count by type 1345 - Running vs. pending GPUs 1346 - Details of pending GPU requests 1347 - Wait times for pending requests 1348 1349 Examples: 1350 ```bash 1351 kblaunch monitor gpus 1352 kblaunch monitor gpus --namespace custom-namespace 1353 ``` 1354 """ 1355 try: 1356 namespace = namespace or get_current_namespace(config) 1357 print_gpu_total(namespace=namespace) 1358 except Exception as e: 1359 print(f"Error displaying GPU stats: {e}")
kblaunch monitor gpus
Display overall GPU statistics and utilization by type.
Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Total GPU count by type
- Running vs. pending GPUs
- Details of pending GPU requests
- Wait times for pending requests
Examples:
kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
1362@monitor_app.command("users") 1363def monitor_users( 1364 namespace: str = typer.Option( 1365 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1366 ), 1367): 1368 """ 1369 `kblaunch monitor users` 1370 Display GPU usage statistics grouped by user. 1371 1372 Provides a user-centric view of GPU allocation and utilization, 1373 helping identify resource usage patterns across users. 1374 1375 Args: 1376 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1377 1378 Output includes: 1379 - GPUs allocated per user 1380 - Average memory usage per user 1381 - Inactive GPU count per user 1382 - Overall usage totals 1383 1384 Examples: 1385 ```bash 1386 kblaunch monitor users 1387 kblaunch monitor users --namespace custom-namespace 1388 ``` 1389 """ 1390 try: 1391 namespace = namespace or get_current_namespace(config) 1392 print_user_stats(namespace=namespace) 1393 except Exception as e: 1394 print(f"Error displaying user stats: {e}")
kblaunch monitor users
Display GPU usage statistics grouped by user.
Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- GPUs allocated per user
- Average memory usage per user
- Inactive GPU count per user
- Overall usage totals
Examples:
kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
1397@monitor_app.command("jobs") 1398def monitor_jobs( 1399 namespace: str = typer.Option( 1400 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1401 ), 1402): 1403 """ 1404 `kblaunch monitor jobs` 1405 Display detailed job-level GPU statistics. 1406 1407 Shows comprehensive information about all running GPU jobs, 1408 including resource usage and job characteristics. 1409 1410 Args: 1411 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1412 1413 Output includes: 1414 - Job identification and ownership 1415 - Resource allocation (CPU, RAM, GPU) 1416 - GPU memory usage 1417 - Job status (active/inactive) 1418 - Job mode (interactive/batch) 1419 - Resource totals and averages 1420 1421 Examples: 1422 ```bash 1423 kblaunch monitor jobs 1424 kblaunch monitor jobs --namespace custom-namespace 1425 ``` 1426 """ 1427 try: 1428 namespace = namespace or get_current_namespace(config) 1429 print_job_stats(namespace=namespace) 1430 except Exception as e: 1431 print(f"Error displaying job stats: {e}")
kblaunch monitor jobs
Display detailed job-level GPU statistics.
Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Job identification and ownership
- Resource allocation (CPU, RAM, GPU)
- GPU memory usage
- Job status (active/inactive)
- Job mode (interactive/batch)
- Resource totals and averages
Examples:
kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
1434@monitor_app.command("queue") 1435def monitor_queue( 1436 namespace: str = typer.Option( 1437 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1438 ), 1439 reasons: bool = typer.Option(False, help="Display queued job event messages"), 1440 include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"), 1441): 1442 """ 1443 `kblaunch monitor queue` 1444 Display statistics about queued workloads. 1445 1446 Shows information about jobs waiting in the Kueue scheduler, 1447 including wait times and resource requests. 1448 1449 Args: 1450 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1451 - reasons: Show detailed reason messages for queued jobs 1452 - include_cpu: Include CPU jobs in the queue 1453 1454 Output includes: 1455 - Queue position and wait time 1456 - Resource requests (CPU, RAM, GPU) 1457 - Job priority 1458 - Queueing reasons (if --reasons flag is used) 1459 1460 Examples: 1461 ```bash 1462 kblaunch monitor queue 1463 kblaunch monitor queue --reasons 1464 kblaunch monitor queue --namespace custom-namespace 1465 ``` 1466 """ 1467 try: 1468 namespace = namespace or get_current_namespace(config) 1469 print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu) 1470 except Exception as e: 1471 print(f"Error displaying queue stats: {e}")
kblaunch monitor queue
Display statistics about queued workloads.
Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
- reasons: Show detailed reason messages for queued jobs
- include_cpu: Include CPU jobs in the queue
Output includes:
- Queue position and wait time
- Resource requests (CPU, RAM, GPU)
- Job priority
- Queueing reasons (if --reasons flag is used)
Examples:
kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace