The neural encoders involved in the experiments are reported below. The respective github links point to the specific commit used for generating the evaluation data.
This paper reviews the current state and emerging trends in synthetic speech detection. It outlines the main data-driven approaches, discusses the advantages and drawbacks of focusing future research solely on neural encoding detection, and offers recommendations for promising research directions.
The observations in this paper aim to guide future state-of-the-art research in the field and to highlight the risk of overcommitting to approaches that may not stand the test of time.
This page complements the paper by providing a full evaluation of a few off-the-shelf models for synthetic speech detection, that were tested firstly on the pristine ASVSpoof 2019 LA eval dataset, and then with its variants created by neurally encoding the bona fide trials.
Example bona fide trials used for the performance evaluation. The encoders were configured to compress the input utterances, and the output was then decoded as WAV.
| Dataset Variant | Example ID | |||
|---|---|---|---|---|
Summary of the balanced accuracy (BAC) and equal error rate (EER) achieved by the latest self-supervised-learning-based methods on the ASVSpoof 2019 LA eval dataset and neurally-encoded variants thereof.
The results were obtained by using the off-the-shelf weights provided by the authors of the respective detection models.
Detection performance, that are perfect on the original ASVSpoof19 dataset, degrade dramatically in presence of neural encoding of bonafide trials, with the sole exception of the Descript Audio Codec.
Results breakdown in terms of area under the curve (AUC), equal error rate (EER) and balanced accuracy (BAC).
The sampling frequency at which each neural encoder operates was noted whenever different than 16 kHz.
| Baseline | XLSR-AASIST | XLSR-SLS | XLSR-MAMBA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AUC | EER | BAC | AUC | EER | BAC | AUC | EER | BAC | |
| ASVspoof19 LA | 0.9999 | 0.0015 | 0.9983 | 0.9999 | 0.0023 | 0.9977 | 0.9999 | 0.0020 | 0.9964 |
| ASVspoof19 LA (24 kHz) | 0.9999 | 0.0026 | 0.9975 | 0.9999 | 0.0026 | 0.9970 | 0.9999 | 0.0020 | 0.9969 |
| ASVspoof19 LA (44.1 kHz) | 0.9999 | 0.0025 | 0.9974 | 0.9999 | 0.0026 | 0.9970 | 0.9999 | 0.0020 | 0.9970 |
| Neural Encoders | XLSR-AASIST | XLSR-SLS | XLSR-MAMBA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AUC | EER | BAC | AUC | EER | BAC | AUC | EER | BAC | |
| EnCodec (24 kHz) | 0.6276 | 0.3955 | 0.5043 | 0.8737 | 0.1992 | 0.5070 | 0.6973 | 0.329 | 0.5079 |
| Lyra-V2 | 0.4695 | 0.5050 | 0.5145 | 0.7897 | 0.2755 | 0.5153 | 0.4629 | 0.5101 | 0.5009 |
| Descript Audio Codec (44.1 kHz) | 0.9989 | 0.0151 | 0.9737 | 0.9984 | 0.0189 | 0.9484 | 0.9988 | 0.0166 | 0.9802 |
| L3AC | 0.8823 | 0.1948 | 0.5584 | 0.9158 | 0.1589 | 0.5301 | 0.8792 | 0.1859 | 0.5863 |
| SpeechTokenizer | 0.9803 | 0.0677 | 0.7537 | 0.9721 | 0.0838 | 0.6675 | 0.9719 | 0.0877 | 0.7737 |
Since the detection performance are not affected by the sampling rate and the spoofed content was not modified, their decrease depends entirely on the occurrence of false alarms on bona fide trials.
ROC curves upon ASVSpoof 2019 LA eval dataset and neurally encoded variants. The operating point for output probability of 0.5 (where, by convention, p>0.5 implies that the content is synthetic) is marked by a circle.
The presence of neural encoding is drastically moving the p=0.5 operating point, resulting in insufficient balanced accuracy.
@article{cuccovillo2026iwbf,
author = {Cuccovillo, Luca and Wang, Xin and Gerhardt, Milica and Aichroth, Patrick},
title = {Neural Encoding Detection is Not All You Need for Synthetic Speech Detection},
booktitle = {IEEE International Workshop on Biometrics and Forensics (IWBF)},
location = {Sophia Antipolis, France},
year = {2026},
@pages = {IN_PRESS},
note = {in press},
}