Pitch perception of noise-vocoded harmonic complex tones mimicking musical instruments

(1)

HAL Id: hal-03234198

https://hal.archives-ouvertes.fr/hal-03234198

Submitted on 26 May 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

tones mimicking musical instruments

Masashi Unoki, Yukina Hosaka, Shunsuke Kidani

To cite this version:

(2)

Pitch perception of noise-vocoded harmonic complex tones

mimicking musical instruments

Masashi Unoki

Yukina Hosaka

Shunsuke Kidani

School of Information Science, Japan Advanced Institute of Science and Technology, Japan

{unoki, y-hosaka, kidani}@jaist.ac.jp

ABSTRACT

Cues derived from temporal fine structure (TFS) play an important role in pitch perception. However, noise-vocoded sound (NVS) has only the temporal amplitude envelope (TAE) and no TFS, which makes pitch perception of NVS difficult. We aimed to clarify whether cues derived from the TAE can play a role in pitch perception. We first used Thurston’s method of paired comparison to investigate whether the pitch perceptual scales of harmonic complex tones (HCTs) mimicking musical instruments and their NVSs could be perceived as musical scales and found that the pitch perceptual scales of the NVSs and HCTs were almost identical. We then used the same method using the stimuli of both HCTs and NVSs with three spectral tilts (moving down, flat, and moving up) and found that while the pitch perceptual scales of the NVSs were generally affected by the spectral tilts, those of the HCTs were not. These results suggest that cues derived from the TAE play an important role in pitch perception, even though they are affected by spectral tilts.

1. INTRODUCTION

The temporal amplitude envelope (TAE) of speech has been proved to be an important cue for speech perception from studies using noise-vocoded sound (NVS) [1–4]. NVS is generated by replacing the carriers with band-limited noise, so the spectral cue is significantly reduced and the temporal cue is preserved. Shannon et al. showed that the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for listeners to recognize linguistic information [1]. The modulation frequency bands from 4 to 16 Hz have been shown to be important regions in speech recognition [5]. Therefore, people can correctly perceive linguistic information using the TAE of speech signals as a primary cue.

In our previous studies [6-9], the relative contributions of spectral and temporal cues in non-linguistic information, such as vocal emotion recognition and speaker individuality for NVS, were clarified by systematically varying the number of channels and upper limitation of the envelope frequency. The results indicated that the TAE contributes to recognizing both speaker and vocal emotion. Therefore, temporal modulation cues higher than the modulation frequency of 4 Hz and lower than 8 Hz were found to be important for perceiving not only linguistic

information but also non-linguistic information such as speaker individuality and vocal emotion.

In another of our previous studies, we investigated the role of temporal cues in para-linguistic information for NVS by studying whether the TAE of speech affects the perception of urgency. Urgency scales were derived from a paired comparison of the results and used to investigate the relationship between the temporal modulation components and urgency perception. The results indicated that temporal modulation cues in the TAE play an important role in urgency perception. Therefore, temporal modulation components of NVS upwards of 6 Hz and downwards of 8 Hz were significant cues for urgency perception [10].

Pitch perception is a critical component of auditory and speech perception. Pitch is also intrinsically related to music perception, as it conveys crucial information about the melody, harmony, and tonality of sounds [11]. This can play an important role in non-linguistic and para-linguistic perception. For harmonic complex tones (HCTs) produced by musical instruments and the human voice, pitch is determined primarily by the low-numbered harmonics.

NVS can be used in cochlear implant (CI) simulator. CI users can recognize speech since a CI simulator can provide sufficient temporal cues of speech. However, CI users cannot accurately perceive pitch since a CI simulator provides poor spectral cues [12, 13]. Cues derived from temporal fine structure (TFS) play an important role in pitch perception. CI users do not have access to the pitch produced by spectrally resolved harmonics due to the lack of spectral resolution. NVS has only the TAE and no TFS as well a limited number of channels. Therefore, NVS makes pitch perception difficult for CI users.

There is weak pitch information via the periodic fluctuations of a complex tone’s TAE. The aim of this study was to clarify whether cues derived from the TAE of NVS can play a role in pitch perception.

2. SYNTHESIS OF NOISE-VOCODED SOUND

In this study, we used the same synthesis method of NVS used in our previous studies to investigate pitch perception.

(3)

equivalent rectangular bandwidth (ERBN) and ERBN

-number scale [9]. The ERBN-number scale is comparable

to a scale of distance along the basilar membrane, so the frequency resolution of the auditory system can be faithfully replicated by dividing frequency bands in accordance with the ERBN-number. The relationship

between ERBN-number and acoustic frequency is defined

as follows:

ERBN− number = 21.4 log10�4.37𝑓𝑓₁₀₀₀+ 1� , (1)

where 𝑓𝑓 is acoustic frequency in Hz. The boundary frequencies of the BPFs were defined from 3 to 35 ERBN-number with bandwidth as 2 ERBN. Therefore,

the band-pass filterbank had 16 channels.

The TAE of the output signal from each BPF was

then extracted using the Hilbert transformation and a low-pass filter (LPF) (2nd-order Butterworth IIR filter). The cut-off frequency of the LPF determined the upper limit of modulation frequency as 64 Hz, which relates to the temporal resolution that higher temporal resolution will be obtained with a higher upper limit of the modulation frequency.

Finally, the TAE in each channel served to modulate amplitude with the narrow band-limited noise (NBN) that was generated by band-pass filtering white noise at the same boundary frequency. All amplitude-modulated NBN was summed to generate the NVS stimulus.

We conducted two psychoacoustic experiments to clarify whether cues derived from the TAE can play a role in pitch perception, in the next two sections.

Figure 1. Signal processing method for noise-vocoded sound.

3. EXPERIMENT I

We conducted a psychoacoustic experiment I to investigate whether the pitch perceptual scales of HCTs mimicking musical instruments and their NVSs could be perceived as musical scales.

3.1 Stimuli, Participants, and Procedure

15 HCTs having the fundamental frequencies from C3 (130.8 Hz) to C5 (523.3 Hz) in the musical scale were used as the original stimuli. These stimuli had spectral tilt of −6 dB/Oct. The highest harmonic in these stimuli was up to 10 kHz. The corresponding NVSs created using the method in Fig. 1 were also used as the stimuli. Figure 2 shows an example spectral shape of an HCT (C4).

Fifteen native Japanese speakers (4 females and 11 males, aged 22 to 28) participated in this experiment. All participants had normal hearing (hearing losses of the participants were below the hearing level of 12 dB in the frequency range from 125 to 8000 Hz).

The experiment was conducted while the participants were in a soundproof room. The stimuli were simultaneously presented to both ears through a PC (Windows 10, MATLAB), audio interface (RME, Fireface

UCX), headphone amplification (Audio Technica AT-HA21), and a set of headphones (Sennheiser HDA 200). The A-weighted sound pressure levels were calibrated to be the same as 70 dB for all participants by using an artificial ear (B&K, type 4153) and sound-level meter (B&K type 2231).

(4)

3.2 Results

The averaged percent correctness for each pair was derived from the results of Thurston’s method of paired comparison. The chance level was 50% in this experiment. We decided that pitch can be perceived when the averaged percent correctness was over 75%. All participants correctly perceived pitch for all 105 pairs of the original stimuli. In contrast, participants correctly perceived pitch in 94 of the 105 NVS pairs.

The pitch perceptual scales of both HCTs and NVSs were derived from the results of Thurston’s method of paired comparisons. Figure 3 shows the results of pitch perceptual scales. We found that the pitch perceptual scales of the NVSs were almost identical as well as those of HCTs, except for C4 and D4, as shown in Fig 3(a).

Figure 2. Example of spectral shape of stimulus (HCT

C4). Spectrum level was normalized to be 0 dB at fundamental frequency.

Figure 3. Pitch perceptual scales derived from results of

Thurston’s method of paired comparisons: (a) original stimuli and (b) NVSs.

4. EXPERIMENT II

Psychoacoustic experiment II was conducted in the same manner as in Experiment I to investigate whether pitch perceptual scales derived from Thurston’s method of paired comparison are affected by spectral tilts (moving

down, flat, and moving up) using the stimuli of both HCTs and NVSs.

4.1 Stimuli, Participants, and Procedure

The same 15 HCT stimuli were used with the three spectral tilts, −6 dB/Oct. (moving down), 0 dB/Oct. (flat), and 6 dB/Oct. (moving up). Thus, the total number of stimuli was 45. Corresponding NVSs were also created using the same method in Fig. 1. Figure 4 shows examples spectral shapes of an HCT (C4) with the three spectral tilts.

Fourteen native Japanese speakers (3 females and 11 males, aged 22 to 28) participated in this experiment. All participants had normal hearing (hearing losses of the participants were below the hearing level of 12 dB in the frequency range from 125 to 8000 Hz).

Figure 4. Example spectral shapes of stimulus (HCT C4)

used in Experiment II: (a) moving down, (b) flat, and (c) moving up. Figure format is same as in Fig. 2.

(5)

These pairs of two stimuli were created in HCTs and NVCs respectively. Thus, the total number of combination with a half non-diagonal elements was 945 (= (3 × 15) × (45 − 3) ÷ 2). The total number of paired stimuli was 1890 (945 pairs of the original stimuli and 945 pairs of their NVSs), which were split into 14 sessions. Participants were asked to evaluate whether or not perceived pitch of the first stimulus is higher than that of the second stimulus as the forced-choice. Silence between the first and second stimuli was 0.5 s. The total execution time was about 240 min, and rest time of 10 min after each consecutive two-sessions was given.

Thurston’s method of paired comparisons to original stimuli: (a) moving down, (b) flat, and (c) moving up.

4.2 Results

The averaged percent correctness for each pair was derived from the results of Thurston’s method of paired comparison. The chance level was 50 % in this experiment. We decided that pitch can be perceived when the averaged percent correctness was over 75%. All participants correctly perceived pitch in 898 of the 945 original stimuli pairs. In contrast, participants correctly perceived pitch in 550 of the 945 NVS. Therefore, the averaged percent correctness of NVSs was above 50% while that of HCTs was 90%.

We have then analyzed the averaged percent correctness for each pair which derived from the results of Thurston’s method of paired comparison in each spectral tilt. All participants correctly perceived pitch in 103 of the 105 HCT pairs for moving-down and flat, and 102 of the 105 HCT pairs for moving-up. In contrast, participants correctly perceived pitch in 92 of the 105 NVS pairs for

moving-down, 72 of the 105 pairs for flat, and 22 of the 105 pairs for moving-up.

The pitch perceptual scales of both HCTs and NVSs were derived from the results of Thurston’s method of paired comparisons. Figures 5 and 6 show the results of pitch perceptual scales of both HCTs and NVSs with the three spectral tilts. We found that the pitch perceptual scales of the HCTs with these tilts were almost identical. However, those of NVSs were incorrectly perceived as musical scales, and the range of pitch perceptual scales for moving-up was narrower than that for moving-down.

Thurston’s method of paired comparisons of NVSs: (a) moving down, (b) flat, and (c) moving up.

5. CONSIDERATIONS

(6)

6. CONCLUSION

We investigated whether cues derived from the TAE can play a role in pitch perception. We used Thurston’s method of paired comparison to investigate whether the pitch perceptual scales of HCTs mimicking musical instruments and of paired comparison for spectral tilts using the stimuli of both HCTs and NVSs. From the results of the first experiment, we found that NVSs could be perceived as musical scales and found that the pitch perceptual scales of the NVSs were almost identical as well as those of the HCTs. From results of the second experiments, we found that while the pitch perceptual scales of the NVSs were generally affected by the spectral tilts, those of the HCTs were not. These results suggest that cues derived from the TAE play an important role in pitch perception, even though they are affected by spectral tilts. However, we have not yet revealed which features in the TAE are significant for pitch perception in NVSs. This is for future work.

ACKNOWLEGEMENTS

This work was supported by a Grant in Aid for Innovative Areas (No. 16H01669, No. 18H05004) from MEXT, Japan. This work was also supported by the Mitani Foundation for Research, Development and JST-Mirai Program (Grant Number: JPMJMI18D1), and JSPS-NSFC Bilateral Joint Research Projects/Seminars (JSJSBP120197416).

7. REFERENCES

[1] R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski and M. Ekelid, “Speech recognition with primarily temporal cues,” Science, vol. 270, pp. 303-304, 1995. [2] R. O. Tachibana, Y. Sasaki and H. Riquimaroux,

“Relative contributions of spectral and temporal resolutions to the perception of syllables, words, and sentences in noise-vocoded speech,” Acoustical

Science and Technology, vol. 34, pp. 263-270, 2013.

[3] P. C. Loizou, M. Dorman, and Z. Tu, On the number of channels needed to understand speech, J. Acoust.

Soc. Am., vol. 106, pp. 2097-2103, 1999.

[4] L. Xu, and B. E. Pfingst, Spectral and temporal cues for speech recognition: Implications for auditory prostheses, Hear. Res., vol. 242, pp. 132-140, 2008. [5] R. Drullman, J. Festen and R. Plomp, “Effect of

reducing slow temporal modulations on speech reception,” J. Acoust. Soc. Am., vol. 95, no. 5, pp. 2670–2680, 1994.

[6] Z. Zhu, Y. Nishino, R. Miyauchi, and M. Unoki, Study on linguistic information and speaker individuality contained in temporal envelope of speech, Acoustical Science and Technology, vol. 37, no. 5, pp. 258–261, 2016.

[7] Z. Zhu, R. Miyauchi, Y. Araki, and M. Unoki, “Contribution of modulation spectral features on the perception of vocal-emotion using noise-vocoded speech,” Acoustical Science and Technology, vol. 39, no. 6, pp. 379–386, Nov. 2018.

[8] Z. Zhu, R. Miyauchi, Y. Araki, and M. Unoki, “Contributions of Temporal Cue on the Perception of Speaker Individuality and Vocal Emotion for Noise-Vocoded Speech,” Acoustical Science and

Technology, vol. 39, no. 3, pp. 234–242, 2018.

[9] M. Unoki and Z. Zhu, “Relationship between contributions of temporal amplitude envelope of speech and modulation transfer function in room acoustics to perception of noise-vocoded speech,” Acoustical Science and Technology, vol. 41, no. 1, pp. 233-244, Jan. 2020.

[10] M. Unoki, M. Kawamura, M. Kobayashi, S. Kidani, M. Akagi, “How the temporal amplitude envelope of speech contributes to urgency perception,”

Proceedings of 23rd International Congress on Acoustics, Aachen, Germany, ICA 2019, pp. 1739–

1744, Sept. 2019

[11] A. J. Oxenham, “Pitch perception.” J Neurosci., vol. 32, no. 39, pp. 13335–13338, 2012.

[12] W. R. Drennan, J. T. Rubinstein, E. “Music perception in cochlear implant users and its relationsip with psychoplysical capabilities,” J.

Rehabil. Res. & Dev., vol. 45, no. 5, pp. 779–790,

2008.

[13] F.-G. Zeng, Q. Tang, and T. Lu, “Abnormal Pitch Perception Produced by Cochlear Implant Stimulation,” PLOS ONE, vol. 9, issue 2, e88662, 2014.

[14] A. H. Mehta and A. J. Oxenham, “Vocoder Simulations Explain Complex Pitch Perception Limitations Experienced by Cochlear Implant Users,”

J Assoc Res Otolaryngol., vol. 18, no. 6, pp. 789–802,

2017.