[The following is geared towards our incoming students. I’m just using the blog as a easy publishing mechanism.]
The following are some major projects ongoing in the GC Computational Linguistics Lab.
Many phonologists believe that phonotactic knowledge is independent of knowledge of phonological alternations. In my dissertation I evaluated computational models of autonomous phonotactic knowledge as predictions of speakers’ judgments of wordlikeness, and I found that these fail to consistently outperform simple baselines. In part, these models fail because they predict gradience that is poorly correlated with human judgments. However, these conclusions were tentative because of the poor quality of the available data, collected little attention paid to experimental design or choice of stimuli. With funding from the National Science Foundation, and in collaboration with professors Karthik Durvasula at Michigan State University and Jimin Kahng at the University of Mississippi, we are building a open-source “megastudy” of human wordlikeness judgments and performing computational modeling of the resulting data.
Speech recognizers and synthesizers are, essentially, engines for synthesizing or recognizing sequences of phonemes. Therefore, it is necessary to transform text into phoneme sequences. Such transformations are challenging insofar as they require linguistic expertise—and language-specific knowledge—and are not always amenable to generic machine learning techniques. We are engaged in several projects involving these mappings. The lab maintains WikiPron (Lee et al. 2020), software and databases for building multilingual pronunciation dictionaries, and has organized two SIGMORPHON shared tasks on multilingual grapheme-to-phoneme conversion (Gorman et al. 2020, Ashby et al. 2021). And with funding from the CUNY Professional Staff Congress, PhD student Amal Aissaoui is engaged building diacritization engines for Arabic and Latin, engines which supply missing pronunciation information for these scripts.
Morphological generation systems use machine learning to predict the inflected forms of words. In 2019 I led a team of researchers in an error analysis of the top two systems in the CoNLL-SIGMORPHON 2017 shared task on morphological generation (Gorman et al. 2019). We found that the top models struggled with inflectional patterns which are sensitive to lexeme-inherent morphosyntactic features like gender, animacy, and aspect, which are not provided in the task data. For instance, the top models often inflect Russian perfective verbs as if they were imperfective, or Polish inanimate nouns as if they were animate. Finally, we find that models struggle with abstract morphophonological patterns which cannot be inferred from the citation form alone. For instance, the top models struggle to predict whether or not a Spanish verb will undergo diphthongization under stress (e.g., negar–niego ‘to deny-I deny’ vs. pegar–pego ‘to stick-I stick’). In collaboration with professor Katharina Kann and PhD student Adam Weimerslage at the University of Colorado, Boulder, we are developing an open-source “challenge set” for morphological generation, a set that targets complex inflectional patterns in a diverse sample of 10-20 languages. This challenge set will act as benchmarks for neural network models of inflection, and will allow us to further study inherent features and abstract morphophonological patterns. In designing these challenge sets we have targeted a wide variety of morphological processes, including reduplication and templatic formation in addition to affixation and stem change. MA students Kristysha Chan, Mariana Graterol, and M. Elizabeth Garza, and PhD student Selin Alkan have all contributed to the development of this challenge set thus far.
Inflectional defectivity is the poorly-understood dark twin of productivity. With funding from the CUNY Professional Staff Congress, Emily Charde (MA 2020) is engaged in a computational study of defectivity in Greek nouns and Russian verbs.