Skip to content

09 - CI/CD Pipeline 完整规划

一、全局流水线架构

┌─────────────────────────────────────────────────────────────────────┐
│                        完整 CI/CD 流程                               │
│                                                                     │
│  ┌─────────┐    ┌──────────────────────────────────────┐           │
│  │ 代码仓库 │    │             CI (持续集成)              │           │
│  │ GitHub   │    │                                      │           │
│  │          │    │  ┌─────┐ ┌──────┐ ┌──────┐ ┌─────┐  │           │
│  │ PR/Push  │───►│  │Lint │→│ Test │→│Build │→│Scan │  │           │
│  │          │    │  └─────┘ └──────┘ └──────┘ └──┬──┘  │           │
│  └─────────┘    │                                │     │           │
│                  │                          Push to ACR │           │
│                  └────────────────────────────────┼─────┘           │
│                                                  │                  │
│                  ┌───────────────────────────────┼──────┐           │
│                  │          CD (持续部署)          │      │           │
│                  │                               ▼      │           │
│                  │  ┌─────────────────────────────────┐  │           │
│                  │  │ Manifests Git Repo              │  │           │
│                  │  │ (values.yaml / kustomize)       │  │           │
│                  │  └───────────────┬─────────────────┘  │           │
│                  │                  │                     │           │
│                  │                  ▼                     │           │
│                  │  ┌─────────────────────────────────┐  │           │
│                  │  │ ArgoCD                          │  │           │
│                  │  │ - 检测 Git 变更                  │  │           │
│                  │  │ - Sync → Staging                │  │           │
│                  │  │ - 验证通过 → Sync → Production   │  │           │
│                  │  └─────────────────────────────────┘  │           │
│                  └───────────────────────────────────────┘           │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    K8s 集群 (ACK)                              │  │
│  │                                                               │  │
│  │  Staging NS ──验证──► Production NS                           │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

二、代码仓库结构

组织仓库结构
├── robotics-api/          # API 服务代码
│   ├── src/
│   ├── tests/
│   ├── Dockerfile
│   ├── .github/workflows/ci.yml
│   └── ...

├── robotics-training/     # 训练代码
│   ├── src/
│   ├── configs/
│   ├── Dockerfile
│   └── .github/workflows/ci.yml

├── robotics-llm-server/   # LLM 推理服务
│   ├── src/
│   ├── Dockerfile
│   └── .github/workflows/ci.yml

├── robotics-admin-web/    # 管理后台前端
│   ├── src/
│   ├── Dockerfile
│   └── .github/workflows/ci.yml

└── robotics-k8s-manifests/  # K8s 部署配置(独立仓库!)
    ├── base/                # 公共基础配置
    │   ├── namespace.yaml
    │   └── network-policies.yaml
    ├── serving/             # 推理/API 服务
    │   ├── api-gateway/
    │   │   ├── deployment.yaml
    │   │   ├── service.yaml
    │   │   ├── hpa.yaml
    │   │   └── kustomization.yaml
    │   └── llm-server/
    │       ├── deployment.yaml
    │       ├── service.yaml
    │       └── kustomization.yaml
    ├── training/            # 训练相关
    │   ├── pytorch-job-templates/
    │   └── storage.yaml
    ├── infra/               # 基础设施
    │   ├── monitoring/
    │   ├── argocd/
    │   └── databases/
    ├── overlays/            # 环境差异
    │   ├── staging/
    │   │   └── kustomization.yaml
    │   └── production/
    │       └── kustomization.yaml
    └── argocd-apps/         # ArgoCD Application 定义
        ├── serving.yaml
        ├── infra.yaml
        └── data.yaml

核心原则:代码仓库和部署配置仓库分离。


三、CI Pipeline 详细设计

API 服务 CI(robotics-api)

yaml
# .github/workflows/ci.yml
name: CI Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

env:
  ACR_REGISTRY: registry.cn-hangzhou.aliyuncs.com
  ACR_NAMESPACE: robotics-serving
  IMAGE_NAME: api-gateway

jobs:
  # ===== Stage 1: 代码质量 =====
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: pip install -r requirements.txt -r requirements-dev.txt

    - name: Lint
      run: |
        ruff check src/
        mypy src/

    - name: Unit Tests
      run: pytest tests/unit/ -v --cov=src --cov-report=xml

    - name: Integration Tests
      run: |
        docker compose -f docker-compose.test.yml up -d
        pytest tests/integration/ -v
        docker compose -f docker-compose.test.yml down

  # ===== Stage 2: 构建和扫描(仅 main 分支) =====
  build-and-push:
    needs: lint-and-test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.tag.outputs.tag }}

    steps:
    - uses: actions/checkout@v4

    - name: Generate tag
      id: tag
      run: echo "tag=$(date +%Y%m%d)-${GITHUB_SHA::7}" >> $GITHUB_OUTPUT

    - name: Login to ACR
      run: |
        docker login \
          -u ${{ secrets.ACR_USERNAME }} \
          -p ${{ secrets.ACR_PASSWORD }} \
          ${{ env.ACR_REGISTRY }}

    - name: Build image
      run: |
        docker build \
          --build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
          --build-arg GIT_SHA=${GITHUB_SHA::7} \
          -t ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }} \
          .

    - name: Security scan
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
        exit-code: 1
        severity: CRITICAL,HIGH
        format: table

    - name: Push image
      run: |
        docker push ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}

  # ===== Stage 3: 更新 K8s Manifests =====
  update-manifests:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Checkout manifests repo
      uses: actions/checkout@v4
      with:
        repository: your-org/robotics-k8s-manifests
        token: ${{ secrets.MANIFESTS_PAT }}

    - name: Update image tag (staging)
      run: |
        cd overlays/staging
        kustomize edit set image \
          ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}=${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image_tag }}

    - name: Commit and push
      run: |
        git config user.name "CI Bot"
        git config user.email "[email protected]"
        git add .
        git commit -m "deploy(staging): ${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image_tag }}"
        git push

训练代码 CI(robotics-training)

训练代码的 CI 不同于服务——不需要部署,只需要构建可用的训练镜像:

yaml
# .github/workflows/ci.yml
name: Training CI

on:
  push:
    branches: [main]
    paths: ['src/**', 'configs/**', 'Dockerfile', 'requirements.txt']

jobs:
  build-training-image:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Generate tag
      id: tag
      run: echo "tag=$(date +%Y%m%d)-${GITHUB_SHA::7}" >> $GITHUB_OUTPUT

    - name: Login and build
      run: |
        docker login -u ${{ secrets.ACR_USERNAME }} -p ${{ secrets.ACR_PASSWORD }} registry.cn-hangzhou.aliyuncs.com
        docker build -t registry.cn-hangzhou.aliyuncs.com/robotics-training/vision-train:${{ steps.tag.outputs.tag }} .
        docker push registry.cn-hangzhou.aliyuncs.com/robotics-training/vision-train:${{ steps.tag.outputs.tag }}

    - name: Notify
      run: |
        # 通知飞书/钉钉:新训练镜像已就绪
        echo "New training image: vision-train:${{ steps.tag.outputs.tag }}"

四、CD Pipeline(ArgoCD)

安装 ArgoCD

bash
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 获取初始密码
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# 端口转发访问 UI
kubectl port-forward svc/argocd-server -n argocd 8080:443

ArgoCD Application 定义

yaml
# argocd-apps/serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: serving-staging
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/robotics-k8s-manifests.git
    targetRevision: main
    path: overlays/staging
  destination:
    server: https://kubernetes.default.svc
    namespace: staging
  syncPolicy:
    automated:             # 自动同步
      prune: true          # 删除不在 Git 中的资源
      selfHeal: true       # 自动修复漂移
    syncOptions:
    - CreateNamespace=true
---
# 生产环境:需要手动审批
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: serving-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/robotics-k8s-manifests.git
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated: null        # 不自动同步,需要手动点击或 PR 审批
    syncOptions:
    - CreateNamespace=true

发布流程

代码 Merge 到 main


CI: Lint → Test → Build → Scan → Push ACR


CI: 自动更新 manifests repo (staging overlay)


ArgoCD: 自动同步到 Staging Namespace


Staging 验证 (自动化测试 / 手动验证)
    │ 通过

PR: 修改 production overlay 的 image tag
    │ 审批通过 + Merge

ArgoCD: 手动 Sync / 或自动同步到 Production


Production 灰度发布 (Ingress canary)
    │ 观察指标

全量发布

五、环境管理(Kustomize)

目录结构

serving/api-gateway/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── hpa.yaml
│   └── kustomization.yaml
├── overlays/
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patch-replicas.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── patch-replicas.yaml
│       └── patch-resources.yaml

base/kustomization.yaml

yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yaml

overlays/staging/kustomization.yaml

yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: staging
patches:
- path: patch-replicas.yaml
images:
- name: api-gateway
  newName: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-serving/api-gateway
  newTag: "20260315-abc1234"

overlays/staging/patch-replicas.yaml

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 1     # Staging 只需 1 个副本

overlays/production/kustomization.yaml

yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: production
patches:
- path: patch-replicas.yaml
- path: patch-resources.yaml
images:
- name: api-gateway
  newName: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-serving/api-gateway
  newTag: "v3.1.2"      # 生产用稳定版本

六、Pipeline 矩阵

仓库CI 触发构建产物CD 方式部署目标
robotics-apiPR + Push mainDocker image → ACRArgoCD + Kustomizestaging → prod
robotics-llm-serverPR + Push mainDocker image → ACRArgoCD + Kustomizestaging → prod
robotics-admin-webPR + Push mainDocker image → ACRArgoCD + Kustomizestaging → prod
robotics-trainingPush mainDocker image → ACR手动/Arena 提交training ns
robotics-data-pipelinePush mainDocker image → ACRArgoCDdata ns
robotics-k8s-manifestsCI 自动 commitK8s YAMLArgoCD 自动同步全部 ns

七、监控发布质量

发布后自动检查

yaml
# ArgoCD Post-Sync Hook: 发布后运行冒烟测试
apiVersion: batch/v1
kind: Job
metadata:
  name: smoke-test
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
      - name: smoke-test
        image: curlimages/curl
        command:
        - sh
        - -c
        - |
          sleep 10
          curl -sf http://api-gateway:8000/healthz || exit 1
          curl -sf http://api-gateway:8000/v1/status || exit 1
          echo "Smoke test passed!"
      restartPolicy: Never
  backoffLimit: 3

关键发布指标

在 Grafana 中配置发布仪表盘:

发布后重点关注(前 30 分钟):
├── HTTP 错误率 (5xx) ← 不应上升
├── 响应延迟 P99     ← 不应恶化
├── Pod 重启次数      ← 应该为 0
├── CPU/内存使用      ← 不应异常飙升
└── 业务指标         ← 不应下降

自动回滚

yaml
# ArgoCD 自动回滚配置
spec:
  syncPolicy:
    automated:
      selfHeal: true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

如果部署失败(Pod 无法启动),ArgoCD 会自动重试。 如果需要手动回滚:

bash
# ArgoCD 回滚到上一个版本
argocd app rollback serving-production

# 或者在 Git 中 revert commit,ArgoCD 会自动同步
git revert HEAD
git push

下一步

10 - 开发调试指南 & CPFS/OSS 存储实践