https://github.com/descriptinc/descript-audio-codec/blob/mai...
https://github.com/NVIDIA/BigVGAN/blob/main/loss.py#L23
https://arxiv.org/pdf/2210.13438 (the github repo doesn't include training, just inference)
It is INCREDIBLY common to use multi-scale spectral loss as the audio distance / objective measure in audio generation. They have some issues (i.e. they aren't always well correlated with human perception) but they are the known-current-best.
https://github.com/descriptinc/descript-audio-codec/blob/mai...
https://github.com/NVIDIA/BigVGAN/blob/main/loss.py#L23
https://arxiv.org/pdf/2210.13438 (the github repo doesn't include training, just inference)
It is INCREDIBLY common to use multi-scale spectral loss as the audio distance / objective measure in audio generation. They have some issues (i.e. they aren't always well correlated with human perception) but they are the known-current-best.