Out‐Of‐Vocabulary (OOV) word detection

For demonstrational purposes and internal testing of the OOV word detection BUT collected a set of recordings of OOV words and non-speech events. This data collection include

  • 16 recordings, 7 speakers, 2 female, 5 male
  • ogg-compressed 8kHz audio comparable to quality of conversational telephone speech (CTS) data
  • ASR transcripts obtained by BUT CTS recognizer
  • strong and weak phone posteriors
  • scores from the NN-based OOV detection
  • ground truth OOV labels of words in the recognition output
  • the pronunciation dictionary used in recognition

The data sets are prepared to be used by the:

and can be downloaded in separate files from the following links

fema, Irish English (1)
male, US English (2)
male, US English (3)
male, foreign accented English (4)
male, US English (5) male, USEnglish (6) male, US English (7) male, German accented English, non-speech sounds (8)
male, German accented English, non-speech sounds (9) male, German accented English, non-speech sounds (10) male, German accented English, non-speech sounds (11) male, German accented English, non-speech sounds (12)
male, German accented English, non-speech sounds (13) male, Czech accented English, non-speech sounds (14) male, German accented English, non-speech sounds (15) male, German accented English, non-speech sounds (16)