10 - 开发调试指南 & CPFS/OSS 存储实践

一、K8s 开发调试方式

调试方式对比

方式	适用场景	优点	缺点
本地开发 + Docker Compose	API/服务开发	快速迭代	与 K8s 环境有差异
kubectl exec 进入 Pod	快速排查	简单直接	容器可能没有调试工具
kubectl debug	调试崩溃的 Pod	不需要修改 Pod	K8s 1.25+
端口转发	本地访问集群服务	无需 Ingress	单连接
Telepresence	本地代码 + 集群环境	最接近真实环境	配置稍复杂
远程 IDE（JupyterHub）	AI/训练调试	直接用 GPU	网络延迟

方式 1：kubectl exec + debug

bash

# 进入运行中的 Pod
kubectl exec -it <pod-name> -n serving -- /bin/sh

# 如果容器没有 shell（distroless 镜像）
# 用 debug 容器附加
kubectl debug <pod-name> -n serving -it --image=nicolaka/netshoot --target=app
# netshoot 包含: curl, ping, dig, tcpdump, strace 等调试工具

# 调试崩溃的 Pod（CrashLoopBackOff）
kubectl debug <pod-name> -n serving -it \
  --copy-to=debug-pod \
  --container=app \
  --image=python:3.11-slim \
  -- /bin/sh
# 这会复制 Pod 配置但用新镜像启动，可以检查环境变量、挂载卷等

方式 2：端口转发

bash

# 转发 Pod 端口到本地
kubectl port-forward pod/<pod-name> 8000:8000 -n serving

# 转发 Service 端口
kubectl port-forward svc/api-gateway 8000:8000 -n serving

# 转发到数据库（本地连接远端 PG）
kubectl port-forward svc/postgresql 5432:5432 -n infra

# 后台运行端口转发
kubectl port-forward svc/api-gateway 8000:8000 -n serving &

方式 3：Telepresence（本地代码 + 集群环境）

最适合日常开发——本地运行代码但能访问集群内的所有服务。

bash

# 安装
brew install datawire/blackbird/telepresence

# 连接到集群
telepresence connect

# 此时本地可以直接访问集群内服务
curl http://api-gateway.serving:8000/healthz
curl http://redis.infra:6379

# 拦截流量到本地（替换集群中的 Pod）
telepresence intercept api-gateway --namespace serving --port 8000

# 现在集群中访问 api-gateway 的流量会转发到本地 8000 端口
# 本地启动你的服务
python src/main.py --port 8000

# 断开拦截
telepresence leave api-gateway
telepresence quit

方式 4：JupyterHub on K8s（AI 开发）

算法同学最常用的方式——在集群中启动 Jupyter 环境，直接使用 GPU。

yaml

# 快速启动一个带 GPU 的 Jupyter Pod
apiVersion: v1
kind: Pod
metadata:
  name: dev-notebook-darren
  namespace: training
  labels:
    user: darren
    type: notebook
spec:
  nodeSelector:
    pool: gpu-training
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: jupyter
    image: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-base/pytorch-base:2.2.0-cuda12.1
    command:
    - jupyter
    - notebook
    - --ip=0.0.0.0
    - --port=8888
    - --no-browser
    - --allow-root
    - --NotebookApp.token=''
    ports:
    - containerPort: 8888
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
    - name: workspace
      mountPath: /workspace
    - name: datasets
      mountPath: /data
      readOnly: true
    - name: outputs
      mountPath: /output
  volumes:
  - name: workspace
    persistentVolumeClaim:
      claimName: dev-workspace-darren
  - name: datasets
    persistentVolumeClaim:
      claimName: datasets
  - name: outputs
    persistentVolumeClaim:
      claimName: model-outputs

bash

# 端口转发访问 Jupyter
kubectl port-forward pod/dev-notebook-darren 8888:8888 -n training
# 浏览器打开 http://localhost:8888

二、训练任务调试流程

┌──────────────────────────────────────────────────────────┐
│  训练调试四步法                                            │
│                                                          │
│  Step 1: 本地小数据集验证                                  │
│  └── docker run --gpus 1 -v ./data:/data train.py        │
│      --epochs=1 --batch-size=4 --debug                   │
│                                                          │
│  Step 2: K8s 单 GPU 验证                                  │
│  └── Jupyter Notebook / 单 Worker PyTorchJob             │
│      检查: 数据加载 OK? 模型收敛? 内存够用?                │
│                                                          │
│  Step 3: K8s 小规模分布式验证                              │
│  └── 2 Worker × 1 GPU, 跑 5 个 epoch                    │
│      检查: NCCL 通信 OK? 梯度同步? Checkpoint 正常?       │
│                                                          │
│  Step 4: 全规模训练                                       │
│  └── N Worker × M GPU, 完整训练                          │
│      监控: GPU 利用率, loss 曲线, TensorBoard             │
└──────────────────────────────────────────────────────────┘

训练 Pod 调试技巧

bash

# 查看训练 Pod 日志
kubectl logs -f <pytorch-job-worker-0> -n training

# 如果 Pod 在 Pending，查看原因
kubectl describe pod <pod-name> -n training
# 常见原因: GPU 不足, PVC 未绑定, 镜像拉取失败

# 查看 NCCL 通信日志（设置 NCCL_DEBUG=INFO）
kubectl logs <worker-0> -n training | grep NCCL

# 查看 GPU 使用情况
kubectl exec <worker-0> -n training -- nvidia-smi

# 进入训练 Pod 手动调试
kubectl exec -it <worker-0> -n training -- /bin/bash
python -c "import torch; print(torch.cuda.device_count())"
python -c "import torch.distributed as dist; dist.init_process_group('nccl')"

三、CPFS 智算版存储实践

CPFS 智算版是什么？

CPFS（Cloud Parallel File System）智算版是阿里云专为 AI/HPC 场景设计的高性能并行文件系统：

特性	NAS 通用型	NAS 极速型	CPFS 智算版
吞吐	600 MB/s	1.2 GB/s	数十 GB/s
IOPS	15K	100K	数百万
延迟	ms 级	亚 ms	微秒级
并发	中	高	极高
适用场景	通用文件共享	中等 I/O	大规模 AI 训练
价格	低	中	高

适合 CPFS 的场景：

多节点分布式训练的数据读取（数据并行）
大规模数据集（TB 级）的高并发读
Checkpoint 频繁写入
仿真环境数据 I/O

ACK 中使用 CPFS

yaml

# 1. 创建 CPFS StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: alicloud-cpfs
provisioner: cpfsplugin.csi.alibabacloud.com
parameters:
  volumeAs: subpath
  server: "<cpfs-id>.cn-hangzhou.cpfs.nas.aliyuncs.com"
  fileSystemId: "<cpfs-id>"
  protocolType: lustre       # CPFS 智算版使用 Lustre 协议
reclaimPolicy: Retain
mountOptions:
  - flock                    # 文件锁支持

---
# 2. 创建 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cpfs-training-data
  namespace: training
spec:
  accessModes: [ReadWriteMany]
  storageClassName: alicloud-cpfs
  resources:
    requests:
      storage: 10Ti          # CPFS 按容量付费

训练 Pod 使用 CPFS

yaml

spec:
  containers:
  - name: train
    volumeMounts:
    - name: training-data
      mountPath: /data
    - name: checkpoints
      mountPath: /output
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: cpfs-training-data    # CPFS 高速读
  - name: checkpoints
    persistentVolumeClaim:
      claimName: cpfs-training-data    # CPFS 高速写

CPFS vs NAS 选择矩阵

使用场景	推荐存储	原因
分布式训练数据读取	CPFS 智算版	需要极高并发读吞吐
Checkpoint 写入	CPFS 智算版	多 Worker 同时写
模型推理加载	NAS 通用型	读频率低，不需要极致性能
标注平台文件	NAS 通用型	成本敏感，性能够用
数据库	云盘 SSD	需要块存储
归档/冷数据	OSS	成本最低

CPFS 性能优化

yaml

# 训练 Pod 中的 DataLoader 优化
env:
- name: OMP_NUM_THREADS
  value: "4"
# DataLoader 配置
# num_workers=4-8  (根据 CPU 核数)
# prefetch_factor=2
# pin_memory=True  (GPU 训练必须)
# persistent_workers=True

python

# train.py 中优化数据加载
dataloader = DataLoader(
    dataset,
    batch_size=args.batch_size,
    sampler=sampler,
    num_workers=8,            # CPFS 支持高并发，可以设大一些
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True,  # 避免反复创建 worker
)

四、OSS 集成

OSS CSI 挂载

yaml

# OSS 挂载为 PVC（只读场景）
apiVersion: v1
kind: PersistentVolume
metadata:
  name: oss-pv
spec:
  capacity:
    storage: 100Gi
  accessModes: [ReadOnlyMany]
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: oss-pv
    volumeAttributes:
      bucket: "robotics-raw-data"
      url: "oss-cn-hangzhou-internal.aliyuncs.com"
      path: "/datasets/v1"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: oss-data
  namespace: training
spec:
  accessModes: [ReadOnlyMany]
  resources:
    requests:
      storage: 100Gi
  volumeName: oss-pv

OSS SDK 方式（更灵活）

对于大文件上传/下载，直接用 SDK 比 CSI 挂载更高效：

python

import oss2

auth = oss2.StsAuth(
    os.environ['OSS_ACCESS_KEY_ID'],
    os.environ['OSS_ACCESS_KEY_SECRET'],
    os.environ['OSS_SECURITY_TOKEN']
)
bucket = oss2.Bucket(auth, 'oss-cn-hangzhou-internal.aliyuncs.com', 'robotics-models')

# 上传模型
bucket.put_object_from_file('models/vision/v1.2.0/model.pt', '/output/model_final.pt')

# 下载模型
bucket.get_object_to_file('models/vision/v1.2.0/model.pt', '/models/model.pt')

在 K8s 中，OSS 凭证通过 ServiceAccount + RRSA（RAM Role for Service Account）注入，无需硬编码 AK/SK：

yaml

# ServiceAccount 绑定阿里云 RAM 角色
apiVersion: v1
kind: ServiceAccount
metadata:
  name: oss-access
  namespace: training
  annotations:
    ack.alibabacloud.com/ram-role-arn: acs:ram::123456:role/oss-training-role

五、数据流与存储整合

┌──────────────────────────────────────────────────────────────┐
│                     完整数据流                                │
│                                                              │
│  机器人端                                                     │
│  ├── 采集数据 ────────► OSS (原始数据桶)                      │
│  │                       │                                   │
│  │                       ▼                                   │
│  │               数据处理 Job (K8s)                           │
│  │               ├── 清洗/格式化                              │
│  │               ├── 标注平台 (NAS)                           │
│  │               └── 输出处理后数据                            │
│  │                       │                                   │
│  │                       ▼                                   │
│  │              CPFS (训练数据集)   ← 高性能训练读取           │
│  │                       │                                   │
│  │                       ▼                                   │
│  │              分布式训练 (PyTorchJob)                       │
│  │               ├── 读数据: CPFS                            │
│  │               ├── 写 Checkpoint: CPFS                     │
│  │               └── 最终模型                                │
│  │                       │                                   │
│  │                       ▼                                   │
│  │              模型评估 Job → 通过 → 模型注册                 │
│  │                       │                                   │
│  │                 ┌─────┴─────┐                             │
│  │                 ▼           ▼                             │
│  │          NAS (推理加载)  OSS (模型归档)                    │
│  │                 │                                         │
│  │                 ▼                                         │
│  │          推理服务 (Deployment)                             │
│  │                 │                                         │
│  ◄─── API 响应 ◄──┘                                         │
│  │                                                           │
│  ◄─── OTA 下载 ◄── OSS (OTA 包)                             │
└──────────────────────────────────────────────────────────────┘

存储选型总结

高性能读写 (训练)   → CPFS 智算版
共享文件 (模型/标注) → NAS 通用型
海量低成本 (归档)    → OSS
数据库              → 云盘 ESSD
临时缓存            → emptyDir (本地 SSD)

六、开发环境最佳实践

为每个开发者创建独立工作空间

bash

# 创建开发者 PVC（NAS 子路径）
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dev-workspace-darren
  namespace: training
spec:
  accessModes: [ReadWriteMany]
  storageClassName: alicloud-nas
  resources:
    requests:
      storage: 50Gi
EOF

快速启动开发 Pod 脚本

bash

#!/bin/bash
# scripts/dev-pod.sh
# 用法: ./dev-pod.sh <username> [gpu_count]

USER=${1:?Usage: dev-pod.sh <username> [gpu_count]}
GPU=${2:-1}

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: dev-${USER}
  namespace: training
  labels:
    user: ${USER}
    type: dev-pod
spec:
  nodeSelector:
    pool: gpu-training
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: dev
    image: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-base/pytorch-base:2.2.0-cuda12.1
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: ${GPU}
    volumeMounts:
    - name: workspace
      mountPath: /workspace
    - name: datasets
      mountPath: /data
      readOnly: true
    - name: outputs
      mountPath: /output
  volumes:
  - name: workspace
    persistentVolumeClaim:
      claimName: dev-workspace-${USER}
  - name: datasets
    persistentVolumeClaim:
      claimName: cpfs-training-data
  - name: outputs
    persistentVolumeClaim:
      claimName: model-outputs
EOF

echo "Waiting for pod to be ready..."
kubectl wait --for=condition=ready pod/dev-${USER} -n training --timeout=300s
echo "Pod ready! Connect with:"
echo "  kubectl exec -it dev-${USER} -n training -- /bin/bash"

开发 Pod 清理策略

yaml

# 使用 CronJob 自动清理闲置开发 Pod
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-idle-dev-pods
  namespace: training
spec:
  schedule: "0 22 * * *"      # 每天晚上 10 点检查
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-cleaner
          containers:
          - name: cleaner
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - |
              # 删除运行超过 24 小时的 dev pod
              kubectl get pods -n training -l type=dev-pod \
                --no-headers -o custom-columns=NAME:.metadata.name,AGE:.metadata.creationTimestamp \
                | while read name ts; do
                  age=$(( ($(date +%s) - $(date -d "$ts" +%s)) / 3600 ))
                  if [ $age -gt 24 ]; then
                    echo "Deleting idle dev pod: $name (age: ${age}h)"
                    kubectl delete pod $name -n training
                  fi
                done
          restartPolicy: OnFailure

下一步

→ 11 - 发布上线完整流程

10 - 开发调试指南 & CPFS/OSS 存储实践 ​

一、K8s 开发调试方式 ​

调试方式对比 ​

方式 1：kubectl exec + debug ​

方式 2：端口转发 ​

方式 3：Telepresence（本地代码 + 集群环境） ​

方式 4：JupyterHub on K8s（AI 开发） ​

二、训练任务调试流程 ​

训练 Pod 调试技巧 ​

三、CPFS 智算版存储实践 ​

CPFS 智算版是什么？ ​

ACK 中使用 CPFS ​

训练 Pod 使用 CPFS ​

CPFS vs NAS 选择矩阵 ​

CPFS 性能优化 ​

四、OSS 集成 ​

OSS CSI 挂载 ​

OSS SDK 方式（更灵活） ​

五、数据流与存储整合 ​

存储选型总结 ​

六、开发环境最佳实践 ​

为每个开发者创建独立工作空间 ​

快速启动开发 Pod 脚本 ​

开发 Pod 清理策略 ​

下一步 ​

10 - 开发调试指南 & CPFS/OSS 存储实践

一、K8s 开发调试方式

调试方式对比

方式 1：kubectl exec + debug

方式 2：端口转发

方式 3：Telepresence（本地代码 + 集群环境）

方式 4：JupyterHub on K8s（AI 开发）

二、训练任务调试流程

训练 Pod 调试技巧

三、CPFS 智算版存储实践

CPFS 智算版是什么？

ACK 中使用 CPFS

训练 Pod 使用 CPFS

CPFS vs NAS 选择矩阵

CPFS 性能优化

四、OSS 集成

OSS CSI 挂载

OSS SDK 方式（更灵活）

五、数据流与存储整合

存储选型总结

六、开发环境最佳实践

为每个开发者创建独立工作空间

快速启动开发 Pod 脚本

开发 Pod 清理策略

下一步