Projects

Tłı̨chǫ Online Dictionary

As its name suggests, the Tłı̨chǫ Online Dictionary is an intelligent online dictionary of Tłı̨chǫ, containing over 10000 entries from dozens of sources, spanning from the late 1960s to 2025. It is searchable in English and Tłı̨chǫ, and is able to recognise Tłı̨chǫ search queries even if their spelling differs from that of the standard orthography. The dictionary also features the ability to search for entries by semantic domains, using a semantic classification system based on the SIL Rapid Word Collection Methodology. Inflectional paradigms are automatically generated for verbs, nouns, and adpositions . The dictionary also contains over 3000 audio recordings, over 4000 example sentences, and supports the ability to add cultural notes to individual entries to clarify usage. The site also allows users to submit their own entries for consideration, as well as to suggest alterations to existing entries.

I created the Tłı̨chǫ Online Dictionary largely as an upgrade to an existing Tłı̨chǫ dictionary site created by the University of Victoria in 2004. Through collaboration with the Tłı̨chǫ Government’s Department of Culture and Lands Protection, I have led the creation of the online dictionary both from a linguistic perspective (that is, eliciting new entries, creating paradigms, and consulting with L1 speakers, among other things) and from the perspective of web development (that is, creating and managing the site and user interface). A full list of collaborators may be found here.

 

The Tłı̨chǫ Online Dictionary itself may be found here.

itwêwina

itwêwina is an intelligent online dictionary of Plains Cree developed by the Alberta Language Technology Lab (ALTLab), searchable both in Plains Cree and English. Its contains the combined contents of three separate, pre-existing Plains Cree dictionaries with a morphological model which is able to automatically generate conjugated forms for each noun and verb entry, as well as recognise any conjugated forms as search queries. In addition to this, itwêwina‘s search functionality is able to return semantically related terms for any search query, even if the search query is not an entry in the dictionary, and can even translate certain short English sentences into conjugated Cree words. Furthermore, audio recordings of pronunciations from multiple native speakers are available for thousands of itwêwina‘s entries, and it is possible to enable a Cree text-to-speech model to generate theoretical pronunciations for any entries which lack native speaker recordings. Finally, itwêwina‘s search functionality is able to recognise systemic variations in the spelling of Cree words, as well as Cree syllabics, allowing users to effectively search for words even if they are unsure of their exact spelling.

 

Although initially created for Plains Cree, the basic framework for itwêwina has been designed to be easily adaptable for use in other languages, and versions of itwêwina have been created for Woods Cree (nīhithawīwin), Sarcee (Tsúùtʼínà), Northern Haida (X̱aat Kíl), and Arapaho (Hinóno’éitíit)

 

My primary contributions to itwêwina include manually annotating large quantities of Plains Cree audio to extract vocabulary recordings and link them to their respective entries, performing preliminary research and heuristic tests for the semantic search system, adding and standardising the contents of two of itwêwina‘s three constituent dictionaries, and collecting and processing natural language data for the search spellchecker.

 

itwêwina may be found here.

Maskwacîs Speech Database

The Maskwacîs Speech Database is a freely accessible online database of spoken Plains Cree words and sentences, developed by the Alberta Language Technology Lab and chiefly coded by Jolene Poulin. It contains 20 300 entries and over 150 000 individual recordings, the vast majority of which were elicited from native Cree speakers in Maskwacîs, Alberta over the course of a four year period from 2014 to 2018. In addition to these entries, Maskwacîs Speech Database also has a built-in audio recording system, allowing registered users to record their own pronunciations of words and sentences, which may be added to the main database once they are reviewed and approved by a linguist. Although the Speech Database was initially developed for the Maskwacîs dialect of Plains Cree, the Speech Database format has since been expanded to be used with multiple other languages, including other dialects of Plains Cree, Sarcee (Tsúùtʼínà), and Stoney-Nakoda (Îethka Îabi).

 

My primary contributions to the Maskwacîs Speech Database have been annotating the source audio files to extract recordings of relevant words and phrases, standardising the Cree spelling and English definitions for each entry, providing morphological analyses for each entry, and reviewing each entry in the database with native speakers of Cree to ensure that the pronunciations and definitions are correct

 

The Maskwacîs Speech Database may be found here.

Plains Cree Supplemental Corpus

The Plains Cree Supplemental Corpus is a digital corpus of miscellaneous written Plains Cree texts, chiefly Plains Cree translations of governmental and corporate documents, as well as a large number of stories, songs, and interviews from native speakers. Each of the 457 individual texts in the corpus has been segmented into sentences, with each Cree sentence manually paired with an English translation. The corpus has a total of size of ~349 000 tokens, with roughly half of these being Cree, and the other half being English translations. In terms of raw token count, this represents the largest searchable corpus of Plains Cree compiled to date. The materials used to create this corpus were taken from various, disparate sources across the internet (including the Cree Literacy Network), with the compilation of this corpus representing the first major centralisation of this ephemera into a single, searchable repository. The metadata for each individual source, which includes speaker age, region of origin, genre, and level of adherence to the Plains Cree Standard Roman Orthography, is catalogued in a spreadsheet alongside the main corpus materials.

 

I have headed this project since its inception, and have been responsible for locating and cataloguing each of the texts in the corpus, pairing (or parallelising) each Cree sentence with its English equivalent, and for maintaining an organised central repository of the texts.

 

I am currently in the process of making an accessible online repository; however, if you would like to access the corpus in the meantime, feel free to email me.

Translation of the Dictionnaire de la langue des Cris (1874)

The Dictionnaire de la langue des Cris is French dictionary of Plains Cree written by Pr. Albert Lacombe in 1874, constituting the earliest, and to this day one of the largest, written dictionaries of the language. However, use of the dictionary among modern Plains Cree populations is relatively minimal, owing largely to the fact that it was written in French, a language which is contemporaneously spoken by only a small portion of the population of the Western Prairies. As such, I translated all 11 411 entries in the dictionary into English, transcribing each into a searchable digital text format. This translation of the dictionary was made available online as a companion dictionary to itwêwina, although it is eventually planned for the contents of the two to be merged. In addition to this translation, I am also in the process of standardising the Cree spelling and part-of-speech classification system used in the original dictionary to correspond to modern lexicographic standards; as of writing, I have standardised ~61% of the dictionary’s contents.

 

The searchable, non-standardised English translation of the Dictionnaire de la langue des Cris may be found here. If you would like the (unfinished) standardised version, feel free to email me.

ʔayʔaǰuθəm Morphological Parser

The Comox (ʔayʔaǰuθəm) Morphological Parser is a script which I have written to automatically provide inflectional and derivational analyses for morphologically complex words in ʔayʔaǰuθəm. The linguistic basis for the script largely stems from the descriptive works of Dr. Honoré Watanabe (particularly his 2003 morphological description), as well as my personal consultations with Dr. Marianne HuijsmansAlthough still firmly in beta development, tests of the current model are able to provide morphological analyses for ~52% of tokens in previously unseen texts.

 

The ʔayʔaǰuθəm parser is not yet publicly available, however, if you would like more information, feel free to email me.