PAI DSW Instance Management
Manage the full lifecycle of Alibaba Cloud PAI DSW (Data Science Workshop) instances — from provisioning through configuration changes, status monitoring, and start/stop operations. Also supports querying available ECS compute specs.
Architecture: INLINECODE0
API Version: pai-dsw/2022-01-01
Installation
Pre-check: Aliyun CLI >= 3.3.1 required
Run aliyun version to verify the version is 3.3.1 or higher. If not installed or the version is too low, see references/cli-installation-guide.md for installation instructions.
[MUST] Then run aliyun configure set --auto-plugin-install true to enable automatic plugin installation.
CODEBLOCK0
Authentication
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
- - NEVER read, echo, or print AK/SK values (e.g.,
echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN) - NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
aliyun configure set with literal credential values - ONLY use
aliyun configure list to check credential status
> aliyun configure list
>
Check the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- 1. Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
aliyun configure in a terminal or environment variables in a shell profile) - Return and retry after
aliyun configure list shows a valid profile
RAM Permissions
See references/ram-policies.md for the complete permission list and minimum-privilege policy.
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- 1. Read
references/ram-policies.md to get the full list of permissions required by this skill - Use the
ram-permission-diagnose skill to guide the user through requesting the necessary permissions - Pause and wait until the user confirms that the required permissions have been granted
Parameter Confirmation
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
| Parameter | Required | Description | Default |
|---|
| INLINECODE13 | Required | PAI workspace ID | None — user must provide |
| INLINECODE14 |
Required | Instance name (letters, digits, underscores only; max 27 chars) | None — user must provide |
|
EcsSpec | Required (post-paid) | ECS compute spec, e.g.,
ecs.c6.large. Query via
list-ecs-specs | None |
|
ImageId | Mutually exclusive with
ImageUrl | Image ID from PAI console | None |
|
ImageUrl | Mutually exclusive with
ImageId | Container image URL. See
references/common-images.md for common official images | None |
|
RegionId | Required | Region, e.g.,
cn-hangzhou,
cn-shanghai | None — user must confirm |
|
Accessibility | Optional | Visibility scope:
PUBLIC (all workspace users) or
PRIVATE |
PRIVATE |
|
InstanceId | Required (update/get/start/stop) | Instance ID (
dsw-xxxxx format) | None |
|
VpcId | Optional | VPC ID for private network access | None |
|
VSwitchId | Optional | VSwitch ID within the VPC | None |
|
SecurityGroupId | Optional | Security group ID | None |
|
AcceleratorType | Required (spec query) | Accelerator type:
CPU or
GPU | None — user must confirm |
|
Datasets | Optional | Dataset mounts in CLI list format:
DatasetId=<> MountPath=<> MountAccess=RO|RW | None —
user must confirm, no default |
|
--read-timeout | Optional | CLI read timeout in seconds (for long-running operations) |
10 |
|
--connect-timeout | Optional | CLI connection timeout in seconds |
10 |
How to get WorkspaceId: If the user doesn't know their workspace ID, run:
> aliyun aiworkspace list-workspaces --region <region> --user-agent AlibabaCloud-Agent-Skills
>
This returns all workspaces the user has access to. Select the appropriate one based on WorkspaceName or ask the user to confirm.
Reference: Create and Manage Workspaces
Core Workflow
Full command syntax and parameter details: references/related-commands.md.
Every aliyun CLI command must include --user-agent AlibabaCloud-Agent-Skills.
1. Query Available ECS Specs
Run aliyun pai-dsw list-ecs-specs --accelerator-type <CPU|GPU> --region <region> to list available compute specs.
[MUST] Region confirmation: The --region parameter is required. Spec availability varies by region — always confirm the region with the user before querying.
[MUST] Determine accelerator type correctly:
- - User mentions a spec name (e.g.,
ecs.hfc6.10xlarge): Query BOTH CPU and GPU types, then match InstanceType in results. Use the returned AcceleratorType field to confirm the classification. - User specifies image type: GPU image URL (contains
-gpu- or cu) → query GPU specs; CPU image URL → query CPU specs. - User describes use case only: GPU for 大模型训练/深度学习, CPU for 数据分析/轻量任务. Always confirm with user if ambiguous.
- [IMPORTANT] Do NOT guess from spec name prefix — the naming convention is unreliable. Always verify via API response.
[MUST] Choose accelerator type based on user requirements:
- - Default recommendation: GPU for 大模型训练/深度学习, CPU for 数据分析/轻量任务
- Match image type (strong indicator): If user specifies a GPU image URL (contains
-gpu- or cu), query GPU specs. If CPU image, query CPU specs. - Spec name requires verification: If user mentions a spec name, query both types and find the match in results
- Always confirm with user before querying if the use case is ambiguous and no spec name is provided
Key response fields:
- -
InstanceType: Spec name (e.g., ecs.hfc6.10xlarge) - INLINECODE59 :
CPU or GPU — the actual classification from API - INLINECODE62 : PRIMARY indicator —
true means the spec is available for pay-as-you-go/subscription - INLINECODE64 : SECONDARY indicator — only for spot instances:
WithStock (available) or NoStock (unavailable) - INLINECODE67 /
Memory / GPU / GPUType: Hardware details - INLINECODE71 : Hourly price in CNY
[MUST] Availability check logic:
- - For pay-as-you-go/subscription: Check INLINECODE72
- For spot instances: Check
IsAvailable == true AND INLINECODE74 - DO NOT use
SpotStockStatus alone to judge availability — many specs have IsAvailable: true but INLINECODE77 - Example:
ecs.hfc6.10xlarge with IsAvailable: true, SpotStockStatus: "NoStock" → Available for pay-as-you-go
2. Create Instance (check-then-act)
[MUST] Idempotency guarantee: The CreateInstance API does not support ClientToken, so idempotency is ensured via a check-then-act pattern. Before creating, you must call list-instances --instance-name <name> to check if the name already exists.
Step 2.1 — Check existence
CODEBLOCK3
Decision logic:
- -
TotalCount == 0 → Name is available, proceed to Step 2.2 to create - INLINECODE82 → [MUST] Verify exact name match:
1. Iterate through the returned
Instances array
2. For each instance, compare its
InstanceName field with the target name
character by character (case-sensitive, exact string match)
3.
Exact match found (
instance.InstanceName === targetName) → Name already exists:
- Extract the
InstanceId from the matching instance
- Call
get-instance --instance-id <id> to get full details
- Compare key parameters (
EcsSpec,
ImageUrl,
Accessibility, etc.)
-
Match → Return the existing
InstanceId,
do not recreate
-
Mismatch → Ask user to choose a different name
4.
No exact match found (no instance has
InstanceName === targetName) → Name is available, proceed to Step 2.2 to create
[WARNING] Critical: Exact name match required
The --instance-name filter may return partial matches. For example:
- - Query: INLINECODE94
- Response may include:
llm_train_001, llm_train_001_v2, INLINECODE97
You MUST verify exact match by checking:
> for instance in response.Instances:
> if instance.InstanceName == targetName: # EXACT string equality
> # Name already exists - DO NOT create
>
Do NOT assume name is available just because TotalCount > 0 but you "think" no exact match. If TotalCount >= 1, carefully check each instance's InstanceName field.
Step 2.2 — Provision
Run aliyun pai-dsw create-instance with required args: --workspace-id, --instance-name, --ecs-spec, --region, and either --image-url or --image-id.
[MUST] Region confirmation: The --region parameter is required and must be confirmed with the user. Do NOT use CLI default region without explicit user approval. Spec availability and pricing vary by region.
[MUST] Match EcsSpec with image type:
- - GPU image URL (contains
-gpu- or cu) → Must select a GPU spec (e.g., ecs.gn6v-c4g1.xlarge) - CPU image URL (contains
-cpu-) → Must select a CPU spec (e.g., ecs.c6.large) - The spec type MUST match the image type, otherwise the instance will fail to start
- Use case (大模型训练/数据分析) is only a recommendation, image type is the definitive indicator
Dataset mounting (optional): If the user specifies a dataset to mount, use the --datasets parameter in CLI list format:
> --datasets DatasetId=<dataset-id> MountPath=<mount-path> MountAccess=RO
>
[MUST] Dataset parameters require explicit user confirmation — do NOT assume or auto-generate dataset configurations.
Official images: references/common-images.md.
Advanced usage (VPC, datasets): references/related-commands.md.
Response: INLINECODE116
[IMPORTANT] Return immediately after creation: After create-instance returns InstanceId, do NOT block waiting for Running status. Instead:
- 1. Return the
InstanceId and current status (Creating) to the user immediately - Provide the user with a command to check status later:
> aliyun pai-dsw get-instance --instance-id <instance-id> --user-agent AlibabaCloud-Agent-Skills
>
- 3. Inform the user that instance startup typically takes 2–5 minutes
Why this matters: Blocking polling prevents the agent from responding to other user requests. DSW instance creation is a long-running operation; the agent should return control to the user promptly.
3. List Instances
Run aliyun pai-dsw list-instances. Filter by --workspace-id or --status; paginate with --page-number / --page-size.
4. Get Instance Details
Run aliyun pai-dsw get-instance --instance-id <id> to check instance status and details.
When to poll: Only poll when the user explicitly asks to wait for a status change (e.g., "wait until it's running"). Otherwise, return the current status immediately.
Timeout limits: Maximum 60 polls (30 minutes total). If exceeded, stop and prompt user to check manually.
Polling interval: 10–30 seconds between calls.
CLI timeout: For long-running operations, increase read timeout:
> aliyun pai-dsw get-instance --instance-id <id> --read-timeout 30 --user-agent AlibabaCloud-Agent-Skills
>
Once Status == "Running", access the instance via InstanceUrl.
For complete status transitions, see Instance Status Values in references/related-commands.md.
5. Stop Instance
Run aliyun pai-dsw stop-instance --instance-id <id>.
Status transition: Running → Stopping → Stopped
Save environment image: To save the environment as a custom image before stopping, use the PAI Console. See Create a DSW Instance Image for instructions.
6. Update Instance
Run aliyun pai-dsw update-instance --instance-id <id> to modify --instance-name, --ecs-spec, --image-id, --accessibility, --datasets, etc.
[MUST] Before updating:
- 1. Call
get-instance to check current status and configuration - Check if update is needed:
- For --ecs-spec: Compare current EcsSpec with target spec. If already equal, skip update and inform user
- For --image-id/--image-url: Compare current ImageId/ImageUrl with target
- For --instance-name: Compare current InstanceName with target
- 3. If already at target configuration, return current instance info — do not call update-instance
- If update is needed, use
--start-instance true to auto-start after update
[IMPORTANT] Always update the specified instance by its InstanceId. Do NOT substitute with another instance that already has the target spec — the user's request is to upgrade the specific instance, not to find an alternative.
7. Start Instance
Run aliyun pai-dsw start-instance --instance-id <id>, then poll (Step 4) until Running.
Prerequisite: Instance must be in Stopped or Failed state. Call get-instance to confirm before starting.
Success Verification
Full verification steps: references/verification-method.md.
Quick check: get-instance should return Status == "Running" with a non-empty InstanceUrl.
Cleanup
This skill does not expose instance deletion (irreversible operation — use the console).
To stop incurring charges, stop the instance via Step 5 (stop-instance).
Best Practices
- 1. Always run check-then-act before creation — use
list-instances --instance-name <name> to avoid duplicate-instance errors. - Prefer
PRIVATE visibility — prevents accidental operations by other workspace users. - Check instance status before update — call
get-instance first; some parameters require Stopped state, others can be updated while Running. - Use
--resource-id ALL with list-instances — the default only returns post-paid instances. - Observe polling timeout limits — see Step 4 for timeout and interval guidance.
- Verify spec availability before provisioning — run
list-ecs-specs to confirm the spec is available in the target region. - Tag instances with Labels — simplifies batch queries and lifecycle management.
References
references/ram-policies.md |
| CLI Commands |
references/related-commands.md |
| Verification |
references/verification-method.md |
| Acceptance Criteria |
references/acceptance-criteria.md |
| Common Images |
references/common-images.md |
| PAI DSW API Overview |
help.aliyun.com |
PAI DSW 实例管理
管理阿里云PAI DSW(数据科学工作台)实例的完整生命周期——从资源开通到配置变更、状态监控以及启动/停止操作。同时支持查询可用的ECS计算规格。
架构:PAI工作空间 + DSW实例 + ECS规格 + 镜像 + VPC + 数据集
API版本:pai-dsw/2022-01-01
安装
前置检查:需要Aliyun CLI >= 3.3.1
运行 aliyun version 验证版本是否为3.3.1或更高。如果未安装或版本过低,请参阅 references/cli-installation-guide.md 获取安装说明。
[必须] 然后运行 aliyun configure set --auto-plugin-install true 启用自动插件安装。
bash
macOS(推荐)
brew install aliyun-cli
验证版本
aliyun version
启用自动插件安装
aliyun configure set --auto-plugin-install true
安装pai-dsw插件
aliyun plugin install --names pai-dsw
认证
前置检查:需要阿里云凭证
安全规则:
- - 切勿读取、回显或打印AK/SK值(例如,echo $ALIBABACLOUDACCESSKEYID 是被禁止的)
- 切勿要求用户在对话或命令行中直接输入AK/SK
- 切勿使用字面凭证值运行 aliyun configure set
- 仅使用 aliyun configure list 检查凭证状态
bash
aliyun configure list
检查输出中是否存在有效的配置文件(AK、STS或OAuth身份)。
如果没有有效的配置文件,请在此处停止。
- 1. 从 阿里云控制台 获取凭证
- 在此会话之外配置凭证(通过终端中的 aliyun configure 或Shell配置文件中的环境变量)
- 当 aliyun configure list 显示有效配置文件后返回并重试
RAM权限
请参阅 references/ram-policies.md 获取完整的权限列表和最小权限策略。
[必须] 权限失败处理: 当任何命令或API调用在执行过程中因权限错误而失败时,请遵循以下流程:
- 1. 阅读 references/ram-policies.md 获取此技能所需的所有权限
- 使用 ram-permission-diagnose 技能引导用户申请必要的权限
- 暂停并等待,直到用户确认所需的权限已授予
参数确认
重要:参数确认 — 在执行任何命令或API调用之前,所有用户可自定义的参数(例如,RegionId、实例名称、CIDR块、密码、域名、资源规格等)必须与用户确认。未经用户明确批准,不得假设或使用默认值。
| 参数 | 必需 | 描述 | 默认值 |
|---|
| WorkspaceId | 必需 | PAI工作空间ID | 无 — 用户必须提供 |
| InstanceName |
必需 | 实例名称(仅限字母、数字和下划线;最多27个字符) | 无 — 用户必须提供 |
| EcsSpec | 必需(按量付费) | ECS计算规格,例如 ecs.c6.large。通过 list-ecs-specs 查询 | 无 |
| ImageId | 与 ImageUrl 互斥 | PAI控制台中的镜像ID | 无 |
| ImageUrl | 与 ImageId 互斥 | 容器镜像URL。常见官方镜像请参阅
references/common-images.md | 无 |
| RegionId | 必需 | 区域,例如 cn-hangzhou、cn-shanghai | 无 — 用户必须确认 |
| Accessibility | 可选 | 可见范围:PUBLIC(所有工作空间用户)或 PRIVATE | PRIVATE |
| InstanceId | 必需(更新/获取/启动/停止) | 实例ID(dsw-xxxxx 格式) | 无 |
| VpcId | 可选 | 用于私有网络访问的VPC ID | 无 |
| VSwitchId | 可选 | VPC内的虚拟交换机ID | 无 |
| SecurityGroupId | 可选 | 安全组ID | 无 |
| AcceleratorType | 必需(规格查询) | 加速器类型:CPU 或 GPU | 无 — 用户必须确认 |
| Datasets | 可选 | CLI列表格式的数据集挂载:DatasetId=<> MountPath=<> MountAccess=RO\|RW | 无 —
用户必须确认,无默认值 |
| --read-timeout | 可选 | CLI读取超时时间(秒,用于长时间运行的操作) | 10 |
| --connect-timeout | 可选 | CLI连接超时时间(秒) | 10 |
如何获取WorkspaceId:如果用户不知道自己的工作空间ID,请运行:
bash
aliyun aiworkspace list-workspaces --region --user-agent AlibabaCloud-Agent-Skills
这将返回用户有权访问的所有工作空间。根据 WorkspaceName 选择适当的工作空间,或要求用户确认。
参考:创建和管理工作空间
核心工作流
完整命令语法和参数详情:references/related-commands.md。
每个 aliyun CLI命令必须包含 --user-agent AlibabaCloud-Agent-Skills。
1. 查询可用ECS规格
运行 aliyun pai-dsw list-ecs-specs --accelerator-type --region 列出可用的计算规格。
[必须] 区域确认:--region 参数是必需的。规格可用性因区域而异 — 在查询之前始终与用户确认区域。
[必须] 正确确定加速器类型:
- - 用户提到规格名称(例如 ecs.hfc6.10xlarge):查询CPU和GPU两种类型,然后在结果中匹配 InstanceType。使用返回的 AcceleratorType 字段确认分类。
- 用户指定镜像类型:GPU镜像URL(包含 -gpu- 或 cu)→ 查询GPU规格;CPU镜像URL → 查询CPU规格。
- 用户仅描述用例:GPU用于大模型训练/深度学习,CPU用于数据分析/轻量任务。如有歧义,始终与用户确认。
- [重要] 不要根据规格名称前缀猜测 — 命名约定不可靠。始终通过API响应进行验证。
[必须] 根据用户需求选择加速器类型:
- - 默认推荐:GPU用于大模型训练/深度学习,CPU用于数据分析/轻量任务
- 匹配镜像类型(强指示):如果用户指定了GPU镜像URL(包含 -gpu- 或 cu),查询GPU规格。如果是CPU镜像,查询CPU规格。
- 规格名称需要验证:如果用户提到规格名称,查询两种类型并在结果中找到匹配项
- 在查询之前始终与用户确认,如果用例不明确且未提供规格名称
关键响应字段:
- - InstanceType:规格名称(例如 ecs.hfc6.10xlarge)
- AcceleratorType:CPU 或 GPU — 来自API的实际分类
- IsAvailable:主要指标 — true 表示该规格可用于按量付费/包年包月
- SpotStockStatus:次要指标 — 仅适用于抢占式实例:WithStock(可用)或 NoStock(不可用)
- CPU / Memory / GPU / GPUType:硬件详情
- Price:每小时价格(人民币)
[必须] 可用性检查逻辑:
- - 对于按量付费/包年包月:检查 IsAvailable == true
- 对于抢占式实例:检查 IsAvailable == true 且 SpotStockStatus == WithStock
- 不要单独使用 SpotStockStatus 判断可用性 — 许多规格的 IsAvailable: true 但 SpotStockStatus: NoStock
- 示例:ecs.hfc6.10xlarge 的 Is