Impact of Morphologically-Aware Tokenization on Language Model Performance Across Morphological Types

Presenter: Nathan Wolf

Faculty Sponsor: Katrin Erk

School: UMass Amherst

Research Area: Computer Science

Session: Poster Session 6, 4:15 PM - 5:00 PM, Auditorium, A65

ABSTRACT

Tokenization, the segmentation of text into a discrete sequence of “tokens”, is an essential yet often-overlooked step in all NLP tasks. Conventional tokenization schemes such as Byte-Pair Encoding (BPE) segment words without regard to their morphology. This lack of morphological information in tokens may hinder neural language models from learning an effective representation of a token’s meaning. Despite this, previous results do not conclusively show that incorporating morphological information into tokenization benefits language model performance. These mixed results could in part be due to not considering the morphological type of the languages on which the studies were conducted. To examine this, we compare how morphologically-aware tokenization impacts language model performance in fusional languages such as English, with relatively few morphemes per word, and agglutinative languages such as Finnish, with many morphemes per word. We hypothesize that conventional tokenization removes more morphological information in agglutinative languages than fusional ones, and thus morphologically-aware tokenization will help language model performance in agglutinative languages more than fusional languages. To investigate this hypothesis, we pretrain transformer language models on a multilingual web crawl dataset, which covers both fusional and agglutinative languages: English, Czech, Finnish, and Turkish. In each language, we pretrain one model with a conventional tokenizer, and one with a morphologically-aware tokenizer. We then evaluate all models on text generation, as well as a linguistically-motivated task.