The Hub 3 evaluation focussed on large vocabulary transcription of clean and noisy speech. For the Hub 1 evaluation a number of other features were used including maximum likelihood linear regression adaptation and the use of quinphone models in a lattice-rescoring pass using a 65k 4-gram language model. For a full description and results see Rich Transcription workshop presentation. A faster version of the full system that ran in less than 10 times realtime was developed. The conversational speech evaluation Hub 5 required the transcription of telephone conversations.
|Date Added:||24 August 2015|
|File Size:||10.26 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
Groups of clustered segments were then used for MLLR adaptation and word lattices generated 4-gram interpolated with class-trigram with triphone HMMs trained on 70 hours of broadcast news data. A less than 10xRT CTS system was developed which employed 2-way system combination and lattice-based adaptation. The core training used an extended version of 1. The broadcast hk evaluation Hub 4 was an evolution of the system.
Both triphone and quinphone Jtk were trained on hours 20001 data and used in a multi-stage recognition process, first generating lattices with MLLR-adapted triphones and then rescoring these with adapted quinphones.
The full system included cluster-based variance normalistaion and vocal-tract length normalisation VTLN and full-variance transforms none of these are included in released versions of HTK up to 3. Models for noisy environments were trained using single-pass retraining present in V2. The unlimited compute conversational telephone speech CTS, 2001 known as Switchboard or Hub5 was similar in structure to the system, but utilised improved acoustic and language models and performed automatic segmentation of the audio data.
The broadcast news evaluation Hub 4. Speech was first segmented using Gaussian mixture models and a phone recogniser. It used MLLR before hgk generation and then rescored the lattices with adapted quinphone models. The NIST March evaluation data included data recorded over conventional telephone lines as well as data from calls over cellular channels.
HTKBook for HTK3
The front-end used was PLP with a mel-spectra based filterbank. The conversational speech evaluation Hub 5 required the transcription of telephone conversations. A faster version of the full system that ran in less than 10 times realtime was developed.
Each of hyk systems described below has represented the state-of-the-art when it was produced either the lowest error rate in the evaluation or not a statistically significant difference to the lowest error ytk system.
HTK users meeting at ICASSP 2001
For a full description and results see Rich Transcription 20001 presentation. This section gives a brief overview of the features of these systems and how they relate to the features present in released versions of HTK.
The major tool currently lacking from the distributed HTK releases to reproduce these systems is a capable large vocabulary decoder supporting trigrams and 4-grams and cross-word triphones and quinphones. In future we hope to make many of these available in released versions of HTK.
Separate LMs were built for different sources and interpolated to form a single model. However, from the sections below it can be seen that there are many other features that have been incorporated into the CUED HTK systems.
The broadcast news evaluation Hub 4 transcribed pre-segmented and labelled portions of broadcast news audio.
The Hub 3 evaluation focussed on large vocabulary transcription of clean and noisy speech. For the Hub 1 evaluation a number of other features were used including maximum likelihood linear regression adaptation and the use of quinphone models in a lattice-rescoring pass using a 65k 4-gram htm model.
The system developed for the Switchboard part of the April Rich Transcription evaluation used acoustic models trained using Minimum Phone Error training.
Again combined triphone and quinphone rescoring passes were used.