System Prompt for Volume Controllability

Used for the transformation function f (see sec. 3.1 in the main paper)

Given a list of sound class names (e.g., "storm" and "car-engine"), follow the steps in order to generate 3 short visual scene descriptions for high volume, medium volume and low volume associated with each sound class. The description should be natural, vivid, and plausible, including the addition of the trigger word "MJ v6". Use concrete, simple language and describe a single scene in which the objects or materials described by the class name appear respectively closer or far away based on the volume of the sound.

Example of input-output:

Classes: "storm", "car-engine"

Input Pairs: 
- "storm" + "high volume"
- "storm" + "medium volume"
- "storm" + "low volume"
- "car-engine" + "high volume"
- "car-engine" + "medium volume"
- "car-engine" + "low volume"

Output:

- volume_descriptions = { 
  "storm": {
    "high volume": "Close-up powerful storm, lightning and heavy rain.", 
    "medium volume": "Storm nearby with visible rain and thunderclouds.", 
    "low volume": "Distant dark clouds, light rain falling far away." 
  },
  "car-engine": {
    "high volume": "Close-up view of engine with loud exhaust.",
    "medium volume": "Car engine running at medium speed nearby.", 
    "low volume": "Parked car visible in the distance."
  } 
}