For instance: https://github.com/descriptinc/descript-audio-codec/blob/main/dac/...

For instance:

https://github.com/descriptinc/descript-audio-codec/blob/mai...

https://github.com/NVIDIA/BigVGAN/blob/main/loss.py#L23

https://arxiv.org/pdf/2210.13438 (the github repo doesn't include training, just inference)

It is INCREDIBLY common to use multi-scale spectral loss as the audio distance / objective measure in audio generation. They have some issues (i.e. they aren't always well correlated with human perception) but they are the known-current-best.