We generate 60 videos using our base video model. For each video, we then sample two sets of three layers: three randomly selected layers from Motion Layers, and three randomly selected layers from 11 lowest scoring layers among Non-Motion Layers. We report DAVIS benchmark metrics to measure mask alignment quality, and we additionally compute a VQA score to assess text–video semantic accuracy.