Presenter: Nathan Wolf
Faculty Sponsor: Katrin Erk
School: UMass Amherst
Research Area: Computer Science
Session: Poster Session 6, 4:15 PM - 5:00 PM, Auditorium, A65
ABSTRACT
Tokenization, the segmentation of text into a discrete sequence of “tokens”, is an essential yet often-overlooked step in all NLP tasks. Conventional tokenization schemes such as Byte-Pair Encoding (BPE) segment words without regard to their morphology. This lack of morphological information in tokens may hinder neural language models from learning an effective representation of a token’s meaning. Despite this, previous results do not conclusively show that incorporating morphological information into tokenization benefits language model performance. These mixed results could in part be due to not considering the morphological type of the languages on which the studies were conducted. To examine this, we compare how morphologically-aware tokenization impacts language model performance in fusional languages such as English, with relatively few morphemes per word, and agglutinative languages such as Finnish, with many morphemes per word. We hypothesize that conventional tokenization removes more morphological information in agglutinative languages than fusional ones, and thus morphologically-aware tokenization will help language model performance in agglutinative languages more than fusional languages. To investigate this hypothesis, we pretrain transformer language models on a multilingual web crawl dataset, which covers both fusional and agglutinative languages: English, Czech, Finnish, and Turkish. In each language, we pretrain one model with a conventional tokenizer, and one with a morphologically-aware tokenizer. We then evaluate all models on text generation, as well as a linguistically-motivated task.