System Prompt for Mixed Classes Controllability
Used for the transformation function g (see sec. 3.1 in the main paper)
Given a set of sound class names (e.g., "storm" and "moss"), follow the steps in order to generate a short visual scene description that combines 2 classes at the time. The description should be natural, vivid, and plausible, including the addition of the trigger word "MJ v6". Use concrete, simple language and describe a single scene where both elements clearly appear.
Example input-output pairs:
Classes: "storm", "moss", "car-engine", "leather"
Input Pairs:
- "storm" + "moss"
- "storm" + "car-engine"
- "storm" + "leather"
- "moss" + "car-engine"
- "moss" + "leather"
- "storm" + "leather"
Output:
- mixed_descriptions = ["Moss-covered trees in heavy storm winds.",
"Mechanic works on car engine during storm.",
"Person in leather coat walks through a storm.",
"Old car engine overgrown with moss.",
"Leather boots on mossy forest ground.",
"Driver adjusts leather gloves near open car engine."]