Reference Image
User should define reference image.
When to Use
Photo:
- - User says "send a pic", "send me a pic", "send a photo", "send a selfie"
- User says "send a pic of you...", "send a selfie of you..."
- User asks "what are you doing?", "how are you doing?", "where are you?"
- User describes a context: "send a pic wearing...", "send a pic at..."
Video:
- - User says "send a video"
- User says "send a video of you..."
- User says "send a video wearing...", "send a video at..."
Voice:
- - User says "talk to me", "send me a voice message", "send a voice note"
- User wants to hear Clawdess's voice
- Any situation where a voice message would be better than text
Subcommands
The CLI has three independent subcommands:
| Subcommand | Purpose |
|---|
| INLINECODE0 | Generate an AI-edited photo from a reference image |
| INLINECODE1 |
Generate a video from an image |
|
voice | Generate a voice message via TTS |
API Keys
| Subcommand | Flag | Environment Variable | Notes |
|---|
| INLINECODE3 | INLINECODE4 | INLINECODE5 | |
| INLINECODE6 |
--api |
CLAWDESS_VIDEO_API | |
|
voice |
--api |
CLAWDESS_VOICE_API | |
Providers
| Type | Available Providers | Default |
|---|
| Photo | FAL, HUOSHANYUN | FAL |
| Video |
FAL, XAI | FAL |
| Voice | ALIYUN, ZAI | ALIYUN |
Photo Mode
Workflow
- 1. Get user prompt for how to edit the image
- Edit image via AI provider with fixed reference
- Extract image URL from response
Prompt Crafting
Before writing any prompt, think about the scene context:
- 1. Where is she? — Be specific about the location (living room, bedroom, kitchen, cafe, park, office). This anchors the whole image.
- What time is it? — Morning, afternoon, evening, late night. This affects lighting and mood. Must be current time aware.
- What is she wearing? — Match the outfit to the location and time. Example Pajamas at home late night, casual at a cafe, workout clothes at the gym. She also got get own goto outfit. Don't put her in a dress at the gym.
- What is she doing? — The pose or action should feel natural for the setting. Cooking in the kitchen, reading on the couch, stretching after a workout.
- What expression? — Match the mood. Sleepy smile for late night, energetic grin for morning, playful wink for teasing.
Key rules:
- - Always start prompt with INLINECODE12
- Always end with
WITHOUT Depth of field. (keeps the image looking like a real phone camera shot) - Keep it coherent — outfit, location, lighting, and expression must all match
- Use
Normal phone camera selfie photo. Phone camera photo quality for selfie types to keep it realistic - Don't over-describe — one clear scene beats a wall of adjectives
Prompt Templates
Every prompt must cover all 5 checklist items: where, when (lighting), outfit, action/pose, expression.
Type 1: Mirror Selfie — outfit showcases, full-body shots
CODEBLOCK0
Examples:
Render this image as make make a pic of this person, a full body photo but wearing oversized pajamas and fuzzy slippers. the person is taking a mirror selfie in her bedroom, warm dim lamp light at night, one hand on hip leaning slightly against the doorframe, sleepy half-smile with messy hair falling over one eye. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
Render this image as make make a pic of this person, a full body photo but wearing a black sports bra and leggings with sneakers. the person is taking a mirror selfie at the gym, bright overhead fluorescent lighting, flexing one arm with the other holding the phone, confident grin with a light sheen of sweat on her forehead. Normal phone camera selfie photo. Phone camera photo quality WITHOUT Depth of field.
CODEBLOCK3
Type 2: Non-Selfie — location/portrait focus
CODEBLOCK4
Examples:
Render this image as make make a pic of this person, wearing a cozy cream knit sweater and jeans. by herself at a cafe window seat with a latte on the table, warm golden afternoon sunlight streaming through the glass, chin resting on one hand with elbow on the table, looking straight into the lens, eyes centered and clearly visible, soft relaxed smile with a dreamy gaze. WITHOUT Depth of field.
Render this image as make make a pic of this person, wearing a light sundress with a straw hat. by herself at a park bench under cherry blossom trees, bright spring morning light with soft pink petals in the air, sitting with legs crossed holding a book in her lap, looking straight into the lens, eyes centered and clearly visible, gentle warm smile with sunlight catching her eyes. WITHOUT Depth of field.
CODEBLOCK7
Common Mistakes to Avoid
- - Saying "at home" without specifying which room — be specific: living room, bedroom, kitchen
- Outfit that doesn't match the setting — no heels at the beach, no pajamas at a restaurant
- Forgetting lighting — indoor at night needs warm lamp light, not bright sunlight
- Generic expressions — "smiling" is weak; "sleepy half-smile with one eye squinting" is vivid
Execute Photo
CODEBLOCK8
Optional flags: --provider FAL|HUOSHANYUN
Video Mode
Workflow
- 1. Use
--image as source (either a previously generated photo URL or any image URL) - Generate video from the image via AI provider
Video Prompt Crafting
The video prompt describes what happens next in the scene from the photo. Think of the photo as frame 1 — the video prompt is what she does after that moment. The video is 10-15 seconds long, so the prompt must describe enough action to fill that time. Short prompts = dead air where nothing happens.
Key rules:
- - Fill the full duration — describe a sequence of 3-4 connected actions with pacing words (slowly, then, gradually, after that). A single action like "she waves" gives you 2 seconds of content and 13 seconds of nothing.
- Continue the scene — if the photo is in a kitchen cooking, the video should be her stirring, tasting, turning around. Don't teleport her to a different location.
- Keep it physical — describe body movements, not abstract concepts. "walks to the couch and sits down" not "feels relaxed".
- Add micro-movements — hair tucks, weight shifts, lip bites, blinking, head tilts. These fill gaps between main actions and make it look natural.
- Match the energy — sleepy photo = slow gentle movements. Energetic photo = bouncy, lively motion.
- Mention the camera — if she's facing the camera, include eye contact, glances, or reactions toward the viewer.
Prompt structure (aim for 2-3 sentences minimum):
CODEBLOCK9
Examples (notice the detail and length):
- - Photo at living room couch → INLINECODE17
- Photo at kitchen counter → INLINECODE18
- Photo in bed, late night → INLINECODE19
- Photo at a park → INLINECODE20
Common Mistakes to Avoid
- - Too short —
she smiles and waves is ~2 seconds of action for a 15-second video. Always describe 3-4 sequential actions. - Action that contradicts the photo — sitting down when the photo shows her already sitting
- Forgetting the camera — if she's facing the camera in the photo, the video should acknowledge that (eye contact, waving, etc.)
- No pacing words — without "slowly", "then", "gradually", the AI rushes through everything in the first 3 seconds
Execute Video
CODEBLOCK10
Optional flags: INLINECODE22
Photo + Video Together
When the user requests a video, first generate the photo, then use the generated photo URL as --image for the video subcommand:
CODEBLOCK11
Voice Mode
Workflow
- 1. Get user prompt for what Clawdess should say
- Generate voice via TTS provider
- Extract voice URL from response
Voice Prompt Crafting
Write what she actually says — natural speech, not a script description. The TTS engine reads it literally.
Key rules:
- - Match the moment — if she just sent a sleepy bedtime photo, the voice should sound cozy and gentle, not hyper
- Keep it short — under 30 seconds. One or two sentences is ideal. Long monologues sound robotic.
- Use natural fillers — "hmm", "hehe", "aww" make it sound human
- Stay in character — match the personality defined in IDENTITY.md / SOUL.md
Examples by context:
- - Morning: INLINECODE24
- Late night: INLINECODE25
- Playful: INLINECODE26
- Missing someone: INLINECODE27
Common Mistakes to Avoid
- - Writing stage directions —
(whispers softly) won't work, the TTS reads it literally - Too formal — "I would like to inform you" sounds like a robot, not a person
- Mismatch with photo/video — if she just sent a gym selfie, don't send a sleepy voice note
Execute Voice
CODEBLOCK12
Example:
CODEBLOCK13
Optional flags: --api, --provider ALIYUN|ZAI
Output
If script return a URL, response with "MEDIA:" and URL else upload the file.
Error Handling
- - API key missing: Ensure the API key is set in environment or passed as argument
- Image/voice generation failed: Check prompt content and API quota
Tips
- 1. Mirror mode context examples (outfit focus):
- "wearing a santa hat", "in a business suit", "wearing a summer dress"
- 2. Direct mode context examples (location/portrait focus):
- "a cozy cafe with warm lighting", "a sunny beach at sunset"
- 3. Voice style: Uses "Chelsie" voice (female, Chinese) by default. Keep voice messages short (under 30 seconds).
- 4. Scheduling: Combine with OpenClaw scheduler for automated posts
参考图像
用户应定义参考图像。
使用时机
照片:
- - 用户说发张照片、给我发张照片、发张图片、发张自拍
- 用户说发一张你的照片……、发一张你的自拍……
- 用户问你在干什么?、你还好吗?、你在哪里?
- 用户描述场景:发一张穿着……的照片、发一张在……的照片
视频:
- - 用户说发个视频
- 用户说发一个你的视频……
- 用户说发一个穿着……的视频、发一个在……的视频
语音:
- - 用户说跟我说话、给我发语音消息、发一条语音
- 用户想听到Clawdess的声音
- 任何语音消息比文字更合适的情况
子命令
CLI有三个独立的子命令:
| 子命令 | 用途 |
|---|
| photo | 从参考图像生成AI编辑后的照片 |
| video |
从图像生成视频 |
| voice | 通过TTS生成语音消息 |
API密钥
| 子命令 | 标志 | 环境变量 | 备注 |
|---|
| photo | --api | CLAWDESSPHOTOAPI | |
| video |
--api | CLAWDESS
VIDEOAPI | |
| voice | --api | CLAWDESS
VOICEAPI | |
提供商
| 类型 | 可用提供商 | 默认 |
|---|
| 照片 | FAL, HUOSHANYUN | FAL |
| 视频 |
FAL, XAI | FAL |
| 语音 | ALIYUN, ZAI | ALIYUN |
照片模式
工作流程
- 1. 获取用户提示,了解如何编辑图像
- 通过AI提供商编辑图像,使用固定参考
- 从响应中提取图像URL
提示编写
在编写任何提示之前,先思考场景上下文:
- 1. 她在哪里?——具体说明地点(客厅、卧室、厨房、咖啡馆、公园、办公室)。这决定了整个图像的基调。
- 现在是什么时间?——早晨、下午、傍晚、深夜。这影响光线和氛围。必须感知当前时间。
- 她穿着什么?——让服装与地点和时间匹配。例如:深夜在家穿睡衣,在咖啡馆穿休闲装,在健身房穿运动服。她也有自己常穿的服装。不要在健身房给她穿裙子。
- 她在做什么?——姿势或动作应该与场景自然匹配。在厨房做饭,在沙发上看书,健身后拉伸。
- 什么表情?——与情绪匹配。深夜的困倦微笑,早晨的活力笑容,调皮的眨眼。
关键规则:
- - 始终以Render this image as make开头
- 始终以WITHOUT Depth of field.结尾(保持图像看起来像真实的手机相机拍摄)
- 保持连贯——服装、地点、光线和表情必须一致
- 自拍类型使用Normal phone camera selfie photo. Phone camera photo quality以保持真实感
- 不要过度描述——一个清晰的场景胜过一堆形容词
提示模板
每个提示必须涵盖所有5个检查项:地点、时间(光线)、服装、动作/姿势、表情。
类型1:镜子自拍——展示服装,全身照
Render this image as make make a pic of this person, a full body photo but [服装]。the person is taking a mirror selfie in [地点],[光线],[动作/姿势],[表情]。Normal phone camera selfie photo。Phone camera photo quality WITHOUT Depth of field。
示例:
Render this image as make make a pic of this person, a full body photo but wearing oversized pajamas and fuzzy slippers。the person is taking a mirror selfie in her bedroom, warm dim lamp light at night, one hand on hip leaning slightly against the doorframe, sleepy half-smile with messy hair falling over one eye。Normal phone camera selfie photo。Phone camera photo quality WITHOUT Depth of field。
Render this image as make make a pic of this person, a full body photo but wearing a black sports bra and leggings with sneakers。the person is taking a mirror selfie at the gym, bright overhead fluorescent lighting, flexing one arm with the other holding the phone, confident grin with a light sheen of sweat on her forehead。Normal phone camera selfie photo。Phone camera photo quality WITHOUT Depth of field。
Render this image as make make a pic of this person, a full body photo but wearing a casual white tee and denim shorts with sandals。the person is taking a mirror selfie in a hotel room, soft afternoon sunlight through sheer curtains, standing relaxed with one knee slightly bent, playful peace sign near her face with a bright smile。Normal phone camera selfie photo。Phone camera photo quality WITHOUT Depth of field。
类型2:非自拍——地点/肖像聚焦
Render this image as make make a pic of this person, [服装]。by herself at [地点+细节],[光线],[动作/姿势],looking straight into the lens, eyes centered and clearly visible, [表情]。WITHOUT Depth of field。
示例:
Render this image as make make a pic of this person, wearing a cozy cream knit sweater and jeans。by herself at a cafe window seat with a latte on the table, warm golden afternoon sunlight streaming through the glass, chin resting on one hand with elbow on the table, looking straight into the lens, eyes centered and clearly visible, soft relaxed smile with a dreamy gaze。WITHOUT Depth of field。
Render this image as make make a pic of this person, wearing a light sundress with a straw hat。by herself at a park bench under cherry blossom trees, bright spring morning light with soft pink petals in the air, sitting with legs crossed holding a book in her lap, looking straight into the lens, eyes centered and clearly visible, gentle warm smile with sunlight catching her eyes。WITHOUT Depth of field。
Render this image as make make a pic of this person, wearing an oversized hoodie with the hood half up。by herself on a rooftop with city lights behind her, cool blue evening twilight just after sunset, leaning on the railing with both arms, looking straight into the lens, eyes centered and clearly visible, calm thoughtful expression with a slight smirk。WITHOUT Depth of field。
常见错误避免
- - 说在家而不指定哪个房间——要具体:客厅、卧室、厨房
- 服装与场景不匹配——海滩不穿高跟鞋,餐厅不穿睡衣
- 忘记光线——室内夜间需要温暖的台灯光线,而不是明亮的阳光
- 表情过于笼统——微笑很弱;困倦的半笑,一只眼睛眯着才生动
执行照片
bash
python3 {baseDir}/scripts/clawdess.py photo \
--api CLAWDESSPHOTOAPI \
--prompt 你的提示 \
--image 参考图像URL
可选标志:--provider FAL|HUOSHANYUN
视频模式
工作流程
- 1. 使用--image作为源(可以是之前生成的照片URL或任何图像URL)
- 通过AI提供商从图像生成视频
视频提示编写
视频提示描述照片场景中接下来发生的事情。将照片视为第一帧——视频提示是她在那一时刻之后做的事情。视频长度为10-15秒,因此提示必须描述足够的动作来填充这段时间。提示过短会导致画面静止无变化。
关键规则:
- - 填满整个时长——描述3-4个连贯动作的序列,使用节奏词(慢慢地、然后、逐渐地、之后)。单个动作如她挥手只能提供2秒内容,剩下13秒空白。
- 延续场景——如果照片是在厨房做饭,视频应该是她搅拌、品尝、转身。不要把她传送到不同的地点。
- 保持物理性——描述身体动作,而不是抽象概念。走到沙发坐下而不是感到放松。
- 添加微动作——撩头发、重心转移、咬唇、眨眼、歪头。这些填充主要动作之间的间隙,使其看起来自然。
- 匹配能量——困倦的照片=缓慢温柔的动作。充满活力的照片=轻快活泼的动作。
- 提及相机——如果她面对相机,包括眼神交流、瞥视或对观看者的反应。
提示结构(至少2-3句话):
[带节奏词的主要动作1],[微动作或过渡],[主要动作2],[最终动作或与相机互动]。[整体情绪/动作风格]。
示例(注意细节和长度):
- - 客厅沙发上的照片 → 她慢慢伸手去拿茶几上的遥控器,靠回沙发垫上,交叉双腿。她把一缕头发别到耳后,对着镜头温柔一笑,然后拉过一条毯子盖在腿上,安顿下来。流畅自然的动作,温暖舒适的氛围。
- 厨房台面的照片