Comparative performance - A new feature for automatic speaker verification anti-spoofing: Const

Table 5 shows the performance of CQCC independently for each of the dif-ferent spoofing attacks grouped into known and unknown attacks. Results are presented here for only the 19-th order feature set with A coefficients only. The

average EER of this system is 0.36%. Also illustrated for comparison is the per-formance of the four systems described in Section 2.2².

Focusing first on known attacks, all four systems deliver excellent error rates of below 0.41%. The proposed CQCC features are third in the ranking according to the average error rate, with an EER of 0.067%. Voice conversion attacks S2 and S5 seem to be the most difficult to detect. Speech synthesis attacks S3 and S4, however, are perfectly detected by all systems.

It is for unknown attacks where the difference between systems is greatest.

Whereas attacks S6, S7 and S9 are detected reliably by all systems, there is con-siderable variation for attacks S8 and S10. In particular, the performance for attack S10, the only unit-selection-based speech synthesis algorithm, varies considerably;

past results range from 8.2% to 26.1%. Being so much higher than the error rates for other attacks, the average performance for unknown attacks is dominated by the performance for S10. Average error rates for past work and unknown attacks range from 1.7% to 5.2%.

CQCC features compare favourably. While the performance for S6, S7 and S9 is not as good as that of other systems, error rates are still low and below 0.1%.

While the error rate for S8 of 1.8% is considerably higher than for other systems, it is significantly better than all other systems for attack S10. Here is the error is reduced to 1.1%. This corresponds to a relative improvement of 86% with regard to the next best performing system for S10. The average performance of CQCC features for unknown attacks is 0.7%. This corresponds to a relative improvement of 61% over the next best system.

The average performance across all 10 spoofing attacks is illustrated in the fi-nal column of Table 5. The average error rate of 0.360% is significantly better than those reported in previous work. The picture of generalisation is thus not straight-forward. While performance for unknown attacks is worse than it is for known at-tacks, CQCC features nonetheless deliver the most consistent performance across the 10 different spoofing attacks in the ASVspoof database. Even if it must be acknowledge that the work reported in this paper was conducted post-evaluation, to the authors’ best knowledge, CQCC features give the best spoofing detection performance reported to date.

7 Conclusions

This paper introduces a new feature for the automatic detection of spoofing attacks which can threaten the reliability of automatic speaker verification. The new feature is based upon the constant Q transform and is combined with tradi-tional cepstral analysis. Termed constant Q cepstral coffficients (CFCCs), the new features provide a variable-resolution, time-frequency representation of the

spec-2Thanks to Md. Shahid Ullah and Tomi Kinnunen from the University of Eastern Findland for kindly providing individual results on all spoofing attacks.

Table5:Performance(EER)ofthebestperformingsystem,includingindividualresultsoneachspoofingattack,averageknownand unknownattacks,andtotalaverage.Standarddeviationisprovidedasameasureofgeneralisationforunknownattacks.Resultsbythe systemsreviewedinSection2areincludedforcomparison. KnownAttacksUnknownAttacksAll SystemS1S2S3S4S5Avg.S6S7S8S9S10Avg.Avg. CFCC-IF0.1010.8630.0000.0001.0750.4080.8460.2420.1420.3468.4902.0131.211 i-vector0.0040.0220.0000.0000.0130.0080.0190.0000.0150.00419.573.9221.965 M&Pfeat.0.0000.0000.0000.0000.0100.0020.0100.0000.0000.00026.105.2222.612 LFCC-DA0.0270.4080.0000.0000.1140.1100.1490.0110.0740.0278.1851.6700.890 CQCC-A0.0050.1490.0000.0000.1790.0670.1520.0711.8290.0741.1360.6530.360

trum which captures detailed characteristics which are missed by more classical approaches to feature extraction.

These characteristics are shown to be informative for spoofing detection. When coupled with a simple Gaussian mixture model-based classifier and assessed on a standard database, CQCC features outperform all existing approaches to spoofing detection. In addition, while there is still a marked discrepancy between perfor-mance for known and unknown spoofing attacks, CQCC results correspond to a relative improvements of 61% over the previously best performing system. Fu-ture work should consider the application of CQCCs for more generalised counter-measures such as a 1-class classifier. The application of CQCCs in other speaker recognition and related problems is another obvious direction.

References

[1] N. Evans, T. Kinnunen, and J. Yamagishi, “Spoofing and countermeasures for automatic speaker verification,” inINTERSPEECH, 2013, pp. 925–929.

[2] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A survey,” Speech Communi-cation, vol. 66, pp. 130 – 153, 2015.

[3] J. Lindberg and M. Blomberg, “Vulnerability in speaker verification – a study of technical impostor techniques,” inProc. European Conference on Speech Communication and Technology (Eurospeech), 1999.

[4] J. Villalba and E. Lleida,Biometrics and ID Management: COST 2101 Eu-ropean Workshop, BioID 2011, Brandenburg (Havel), Germany, March 8-10, 2011. Proceedings, 2011, ch. Detecting Replay Attacks from Far-Field Recordings on Speaker Verification Systems, pp. 274–285.

[5] B. L. Pellom and J. H. L. Hansen, “An experimental study of speaker verifi-cation sensitivity to computer voice-altered imposters,” inProc. IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, Mar 1999, pp. 837–840 vol.2.

[6] P. Z. Patrick, G. Aversano, R. Blouet, M. Charbit, and G. Chollet, “Voice forgery using alisp: Indexation in a client memory,” in Proc. IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, March 2005, pp. 17–20.

[7] T. Masuko, T. Hitotsumatsu, K. Tokuda, and T. Kobayashi, “On the security of hmm-based speaker verification systems against imposture using synthetic speech.” inEUROSPEECH. ISCA, 1999.

[8] P. De Leon, M. Pucher, and J. Yamagishi, “Evaluation of the vulnerability of speaker verification to synthetic speech,” inOdyssey IEEE Workshop, 2010.

[9] Y. W. Lau, M. Wagner, and D. Tran, “Vulnerability of speaker verification to voice mimicking,” in Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004., Oct 2004, pp.

145–148.

[10] Y. W. Lau, D. Tran, and M. Wagner,Knowledge-Based Intelligent Informa-tion and Engineering Systems: 9th InternaInforma-tional Conference, KES 2005, Mel-bourne, Australia, September 14-16, 2005, Proceedings, Part IV. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, ch. Testing Voice Mimicry with the YOHO Speaker Verification Corpus, pp. 15–21.

[11] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilci, M. Sahidullah, and A. Sizov, “ASVspoof 2015: the first automatic speaker verification spoof-ing and countermeasures challenge,” inINTERSPEECH, Dresden, Germany, 2015.

[12] T. B. Patel and H. A. Patil, “Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs.

spoofed speech,” inINTERSPEECH, 2015, pp. 2062–2066.

[13] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison of features for synthetic speech detection,” inINTERSPEECH, 2015, pp. 2087–2091.

[14] C. Hanilc¸i, T. Kinnunen, M. Sahidullah, and A. Sizov, “Classifiers for syn-thetic speech detection: a comparison,” inINTERSPEECH, 2015, pp. 2087–

2091.

[15] Z. Wu, T. Kinnunen, N. Evans, and J. Yamagishi, “ASVspoof 2015: the first automatic verification spoofing and countermeasures challenge evalua-tion plan,” 2014.

[16] S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, and V. Shchemelinin, “STC anti-spoofing systems for the asvspoof 2015 challenge,” inINTERSPEECH, 2015.

[17] X. Xiao, X. Tian, S. Du, H. Xu, E. Chng, and H. Li, “Spoofing speech detec-tion using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge,” inINTERSPEECH, 2015.

[18] D. Gabor, “Theory of communication,”J. Inst. Elect. Eng., vol. 93, pp. 429–

457, 1946.

[19] A. V. Oppenheim, R. W. Schafer, and J. R. Buck,Discrete-time Signal Pro-cessing (2Nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1999.

[20] B. C. J. Moore,An Introduction to the Psychology of Hearing. BRILL, 2003.

[21] J. Youngberg and S. Boll, “Constant-q signal analysis and synthesis,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, Apr 1978, pp. 375–378.

[22] B. Mont-Reynaud, “The bounded-Q approach to time-varying spectral anal-ysis,” 1986.

[23] J. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, January 1991.

[24] R. E. Radocy and J. D. Boyle,Psychological foundations of musical behavior.

C. C. Thomas, 1979.

[25] G. Costantini, R. Perfetti, and M. Todisco, “Event based transcription system for polyphonic piano music,”Signal Process., vol. 89, no. 9, pp. 1798–1811, Sep. 2009.

[26] R. Jaiswal, D. Fitzgerald, E. Coyle, and S. Rickard, “Towards shifted nmf for improved monaural separation,” in 24th IET Irish Signals and Systems Conference (ISSC 2013), June 2013, pp. 1–7.

[27] C. Schorkhuber, A. Klapuri, and A. Sontacch, “Audio pitch shifting using the constant-Q transform,”Journal of the Audio Engineering Society, vol. 61, no.

7/8, pp. 425–434, July/August 2013.

[28] S. Mallat,A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, 3rd ed. Academic Press, 2008.

[29] M. Vetterli and C. Herley, “Wavelets and filter banks: theory and design,”

IEEE Transactions on Signal Processing, vol. 40, no. 9, pp. 2207–2232, Sep 1992.

[30] C. Sch¨orkhuber, A. Klapuri, N. Holighaus, and M. D¨orfler, “A matlab tool-box for efficient perfect reconstruction time-frequency transforms with log-frequency resolution,” inAudio Engineering Society (53rd Conference on Se-mantic Audio), G. Fazekas, Ed., AES (Vereinigte Staaten (USA)), 6 2014.

[31] B. R. Glasberg and B. C. J. Moore, “Derivation of auditory filter shapes from notched-noise data,”Hearing Research, vol. 47, no. 1, pp. 103 – 138, 1990.

[32] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,”IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp.

357–366, Aug 1980.

[33] G. Wolberg, Cubic Spline Interpolation: a Review. Columbia University, 1988.

[34] S. Maymon and A. V. Oppenheim, “Sinc interpolation of nonuniform sam-ples,” IEEE Transactions on Signal Processing, vol. 59, no. 10, pp. 4745–

4758, Oct 2011.

[35] P. Jacob, “Design and implementation of polyphase decimation filter,” Inter-national Journal of Computer Networks and Wireless Communications (IJC-NWC), ISSN, pp. 2250–3501.

Dans le document A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients (Page 19-25)