Vowel identity
correlates well with the shape of the transfer function of the
vocal tract, in particular the position of the first two or three formant peaks. However, in voiced speech the transfer function is sampled at multiples of the fundamental
frequency (F0), and the short-term spectrum contains peaks at those
frequencies, rather than at formants. It is not clear how the
auditory system estimates the original
spectral envelope from the vowel waveform. Cochlear excitation patterns, for example, resolve
harmonics in the low-frequency region and their shape varies strongly with F0. The problem cannot be cured by
smoothing: lag-domain components of the
spectral envelope are aliased and cause F0-dependent distortion. The problem is severe at high F0's where the
spectral envelope is severely undersampled. This paper treats vowel identification as a process of
pattern recognition with
missing data. Matching is restricted to available data, and
missing data are ignored using an F0-dependent
weighting function that emphasizes regions near
harmonics. The model is presented in two versions: a frequency-domain version based on short-term spectra, or tonotopic excitation patterns, and a time-domain version based on
autocorrelation functions. It accounts for the relative F0-independency observed in vowel identification.