09 - CI/CD Pipeline 完整规划
一、全局流水线架构
┌─────────────────────────────────────────────────────────────────────┐
│ 完整 CI/CD 流程 │
│ │
│ ┌─────────┐ ┌──────────────────────────────────────┐ │
│ │ 代码仓库 │ │ CI (持续集成) │ │
│ │ GitHub │ │ │ │
│ │ │ │ ┌─────┐ ┌──────┐ ┌──────┐ ┌─────┐ │ │
│ │ PR/Push │───►│ │Lint │→│ Test │→│Build │→│Scan │ │ │
│ │ │ │ └─────┘ └──────┘ └──────┘ └──┬──┘ │ │
│ └─────────┘ │ │ │ │
│ │ Push to ACR │ │
│ └────────────────────────────────┼─────┘ │
│ │ │
│ ┌───────────────────────────────┼──────┐ │
│ │ CD (持续部署) │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────┐ │ │
│ │ │ Manifests Git Repo │ │ │
│ │ │ (values.yaml / kustomize) │ │ │
│ │ └───────────────┬─────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────┐ │ │
│ │ │ ArgoCD │ │ │
│ │ │ - 检测 Git 变更 │ │ │
│ │ │ - Sync → Staging │ │ │
│ │ │ - 验证通过 → Sync → Production │ │ │
│ │ └─────────────────────────────────┘ │ │
│ └───────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ K8s 集群 (ACK) │ │
│ │ │ │
│ │ Staging NS ──验证──► Production NS │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘二、代码仓库结构
组织仓库结构
├── robotics-api/ # API 服务代码
│ ├── src/
│ ├── tests/
│ ├── Dockerfile
│ ├── .github/workflows/ci.yml
│ └── ...
│
├── robotics-training/ # 训练代码
│ ├── src/
│ ├── configs/
│ ├── Dockerfile
│ └── .github/workflows/ci.yml
│
├── robotics-llm-server/ # LLM 推理服务
│ ├── src/
│ ├── Dockerfile
│ └── .github/workflows/ci.yml
│
├── robotics-admin-web/ # 管理后台前端
│ ├── src/
│ ├── Dockerfile
│ └── .github/workflows/ci.yml
│
└── robotics-k8s-manifests/ # K8s 部署配置(独立仓库!)
├── base/ # 公共基础配置
│ ├── namespace.yaml
│ └── network-policies.yaml
├── serving/ # 推理/API 服务
│ ├── api-gateway/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── hpa.yaml
│ │ └── kustomization.yaml
│ └── llm-server/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
├── training/ # 训练相关
│ ├── pytorch-job-templates/
│ └── storage.yaml
├── infra/ # 基础设施
│ ├── monitoring/
│ ├── argocd/
│ └── databases/
├── overlays/ # 环境差异
│ ├── staging/
│ │ └── kustomization.yaml
│ └── production/
│ └── kustomization.yaml
└── argocd-apps/ # ArgoCD Application 定义
├── serving.yaml
├── infra.yaml
└── data.yaml核心原则:代码仓库和部署配置仓库分离。
三、CI Pipeline 详细设计
API 服务 CI(robotics-api)
yaml
# .github/workflows/ci.yml
name: CI Pipeline
on:
pull_request:
branches: [main]
push:
branches: [main]
env:
ACR_REGISTRY: registry.cn-hangzhou.aliyuncs.com
ACR_NAMESPACE: robotics-serving
IMAGE_NAME: api-gateway
jobs:
# ===== Stage 1: 代码质量 =====
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt -r requirements-dev.txt
- name: Lint
run: |
ruff check src/
mypy src/
- name: Unit Tests
run: pytest tests/unit/ -v --cov=src --cov-report=xml
- name: Integration Tests
run: |
docker compose -f docker-compose.test.yml up -d
pytest tests/integration/ -v
docker compose -f docker-compose.test.yml down
# ===== Stage 2: 构建和扫描(仅 main 分支) =====
build-and-push:
needs: lint-and-test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.tag.outputs.tag }}
steps:
- uses: actions/checkout@v4
- name: Generate tag
id: tag
run: echo "tag=$(date +%Y%m%d)-${GITHUB_SHA::7}" >> $GITHUB_OUTPUT
- name: Login to ACR
run: |
docker login \
-u ${{ secrets.ACR_USERNAME }} \
-p ${{ secrets.ACR_PASSWORD }} \
${{ env.ACR_REGISTRY }}
- name: Build image
run: |
docker build \
--build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg GIT_SHA=${GITHUB_SHA::7} \
-t ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }} \
.
- name: Security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
exit-code: 1
severity: CRITICAL,HIGH
format: table
- name: Push image
run: |
docker push ${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
# ===== Stage 3: 更新 K8s Manifests =====
update-manifests:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Checkout manifests repo
uses: actions/checkout@v4
with:
repository: your-org/robotics-k8s-manifests
token: ${{ secrets.MANIFESTS_PAT }}
- name: Update image tag (staging)
run: |
cd overlays/staging
kustomize edit set image \
${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}=${{ env.ACR_REGISTRY }}/${{ env.ACR_NAMESPACE }}/${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image_tag }}
- name: Commit and push
run: |
git config user.name "CI Bot"
git config user.email "[email protected]"
git add .
git commit -m "deploy(staging): ${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image_tag }}"
git push训练代码 CI(robotics-training)
训练代码的 CI 不同于服务——不需要部署,只需要构建可用的训练镜像:
yaml
# .github/workflows/ci.yml
name: Training CI
on:
push:
branches: [main]
paths: ['src/**', 'configs/**', 'Dockerfile', 'requirements.txt']
jobs:
build-training-image:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Generate tag
id: tag
run: echo "tag=$(date +%Y%m%d)-${GITHUB_SHA::7}" >> $GITHUB_OUTPUT
- name: Login and build
run: |
docker login -u ${{ secrets.ACR_USERNAME }} -p ${{ secrets.ACR_PASSWORD }} registry.cn-hangzhou.aliyuncs.com
docker build -t registry.cn-hangzhou.aliyuncs.com/robotics-training/vision-train:${{ steps.tag.outputs.tag }} .
docker push registry.cn-hangzhou.aliyuncs.com/robotics-training/vision-train:${{ steps.tag.outputs.tag }}
- name: Notify
run: |
# 通知飞书/钉钉:新训练镜像已就绪
echo "New training image: vision-train:${{ steps.tag.outputs.tag }}"四、CD Pipeline(ArgoCD)
安装 ArgoCD
bash
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# 获取初始密码
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
# 端口转发访问 UI
kubectl port-forward svc/argocd-server -n argocd 8080:443ArgoCD Application 定义
yaml
# argocd-apps/serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: serving-staging
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/robotics-k8s-manifests.git
targetRevision: main
path: overlays/staging
destination:
server: https://kubernetes.default.svc
namespace: staging
syncPolicy:
automated: # 自动同步
prune: true # 删除不在 Git 中的资源
selfHeal: true # 自动修复漂移
syncOptions:
- CreateNamespace=true
---
# 生产环境:需要手动审批
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: serving-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/robotics-k8s-manifests.git
targetRevision: main
path: overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated: null # 不自动同步,需要手动点击或 PR 审批
syncOptions:
- CreateNamespace=true发布流程
代码 Merge 到 main
│
▼
CI: Lint → Test → Build → Scan → Push ACR
│
▼
CI: 自动更新 manifests repo (staging overlay)
│
▼
ArgoCD: 自动同步到 Staging Namespace
│
▼
Staging 验证 (自动化测试 / 手动验证)
│ 通过
▼
PR: 修改 production overlay 的 image tag
│ 审批通过 + Merge
▼
ArgoCD: 手动 Sync / 或自动同步到 Production
│
▼
Production 灰度发布 (Ingress canary)
│ 观察指标
▼
全量发布五、环境管理(Kustomize)
目录结构
serving/api-gateway/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── hpa.yaml
│ └── kustomization.yaml
├── overlays/
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── patch-replicas.yaml
│ └── production/
│ ├── kustomization.yaml
│ ├── patch-replicas.yaml
│ └── patch-resources.yamlbase/kustomization.yaml
yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yamloverlays/staging/kustomization.yaml
yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: staging
patches:
- path: patch-replicas.yaml
images:
- name: api-gateway
newName: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-serving/api-gateway
newTag: "20260315-abc1234"overlays/staging/patch-replicas.yaml
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
spec:
replicas: 1 # Staging 只需 1 个副本overlays/production/kustomization.yaml
yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: production
patches:
- path: patch-replicas.yaml
- path: patch-resources.yaml
images:
- name: api-gateway
newName: registry-vpc.cn-hangzhou.aliyuncs.com/robotics-serving/api-gateway
newTag: "v3.1.2" # 生产用稳定版本六、Pipeline 矩阵
| 仓库 | CI 触发 | 构建产物 | CD 方式 | 部署目标 |
|---|---|---|---|---|
| robotics-api | PR + Push main | Docker image → ACR | ArgoCD + Kustomize | staging → prod |
| robotics-llm-server | PR + Push main | Docker image → ACR | ArgoCD + Kustomize | staging → prod |
| robotics-admin-web | PR + Push main | Docker image → ACR | ArgoCD + Kustomize | staging → prod |
| robotics-training | Push main | Docker image → ACR | 手动/Arena 提交 | training ns |
| robotics-data-pipeline | Push main | Docker image → ACR | ArgoCD | data ns |
| robotics-k8s-manifests | CI 自动 commit | K8s YAML | ArgoCD 自动同步 | 全部 ns |
七、监控发布质量
发布后自动检查
yaml
# ArgoCD Post-Sync Hook: 发布后运行冒烟测试
apiVersion: batch/v1
kind: Job
metadata:
name: smoke-test
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: smoke-test
image: curlimages/curl
command:
- sh
- -c
- |
sleep 10
curl -sf http://api-gateway:8000/healthz || exit 1
curl -sf http://api-gateway:8000/v1/status || exit 1
echo "Smoke test passed!"
restartPolicy: Never
backoffLimit: 3关键发布指标
在 Grafana 中配置发布仪表盘:
发布后重点关注(前 30 分钟):
├── HTTP 错误率 (5xx) ← 不应上升
├── 响应延迟 P99 ← 不应恶化
├── Pod 重启次数 ← 应该为 0
├── CPU/内存使用 ← 不应异常飙升
└── 业务指标 ← 不应下降自动回滚
yaml
# ArgoCD 自动回滚配置
spec:
syncPolicy:
automated:
selfHeal: true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m如果部署失败(Pod 无法启动),ArgoCD 会自动重试。 如果需要手动回滚:
bash
# ArgoCD 回滚到上一个版本
argocd app rollback serving-production
# 或者在 Git 中 revert commit,ArgoCD 会自动同步
git revert HEAD
git push