Virtually all the digital linguistic resources used in speech and language technology are static in the sense that
- One-time: they are generated once and never updated.
- Read-only: they provide no mechanisms for corrections, feature requests, etc.
- Closed-source: code and raw data used to generate the data are not released.
However, there are some benefits to designing linguistic resources dynamically, allowing them to be repeatedly regenerated and iteratively improved with the help of the research community. I’ll illustrate this with WikiPron (Lee et al. 2020), our database-cum-library for multilingual pronunciation data.
The data
Pronunctionary dictionaries are an important resource for speech technologies like automatic speech recognition and text-to-speech synthesis. Several teams have considered the possibility of mining pronunciation data from the internet, particularly from the free online dictionary Wiktionary, which by now contains millions of crowd-sourced pronunciations transcribed using the International Phonetic Alphabet. However, none of these prior efforts released any code, nor were their scrapes run repeatedly, so at best they represent of a single (2016, or 2011) slice of the data.
The tool
WikiPron is, first and foremost, a Python command-line tool for scraping pronunciation data from Wiktionary. Stable versions can be installed from PyPI using tools like pip. Once the tool is installed, users specify a language, optionally, a dialect, and various optional flags, and pronunciation data is printed to STDIN as a two-column TSV file. Since this requires an internet connection and may take a while, the system is even able to retry where it left off in case of connection hiccups. The code is carefully documented, tested, type-checked, reflowed, and linted using the CircleCI continuous integration system.
The infrastructure
We also release, at least annually, a multilingual pronunciation dictionary created using WikiPron. This increases replicability, permits users to see the format and scale of the data WikiPron makes available, and finally allows casual users to bypass the command-line tool altogether. To do this, we provide the data/ directory, which contains data and code which automates “the big scrape”, the process by which we regenerate the multilingual pronunciation dictionary. It includes
- the data for 335 (at time of writing) languages, dialects, scripts, etc.,
- code for discovering languages supported by Wiktionary,
- code for (re)scraping all languages,
- code for (re)generating data summaries (both computer-readable TSV files and human-readable READMEs rendered by GitHub), and
- integration tests that confirm the data summaries match the checked-in data,
as well as code and data used for various quality assurance processes.
Dynamic language resources
In what sense is WikiPron a dynamic language resource?
- It is many-time: it can be run as many times as one wants. Even “the big scrape” static data sets are updated more-than-annually.
- It is read-write: one can improve WikiPron data by correcting Wiktionary, and we provide instructions for contributors wishing to send pull requests to the tool.
- It is open-source: all code is licensed under the Apache 2.0 license; the data bears a Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary.
Acknowledgements
Most of the “dynamic” features in WikiPron were implemented by CUNY Graduate Center PhD student Lucas Ashby and my colleague Jackson Lee; I have at best served as an advisor and reviewer.
References
Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A.,
McCarthy, A. D., and Gorman, K. 2020. Massively multilingual pronunciation
mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228.