Necessary pre-processing: Separate voice and accompaniment with UVR (skip if no accompaniment) Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds. Manually check ...