TrustLLM with an LLM/NLP application

Large Language Models to benefit low-resource Inuktitut

Large Language Models (LLMs) are transformer-based neural architectures capable of processing and generating text by learning and capturing patterns in languages from vast amounts of textual training data. Drawing from this acquired understanding of languages, these models can be deployed to perform a wide range of down-stream Natural Language Processing (NLP) tasks such as translation, summarization, text classification, etc. While being a powerful tool, the sheer amount of data required to train them is a big obstacle when developing efficient NLP solutions for smaller languages with fewer resources, often referred to as low-resource (LR) languages. One such language, Inuktitut, spoken by roughly 40,000 people in the northernmost regions of Canada, is an especially tricky case. Not only because of lacking textual resources, but also due to the nature of the language itself, characterizing itself by its tendency to form very long and complex words packed with linguistic information.

Figure 1: An example of an Inuktitut word written in Inuktitut syllabics, romanized as “Parimunngauniralauqsimanngittunga”, translating to a full sentence in English.

This use-case aims to investigate methods for leveraging pre-trained LLMs to enhance the performance of downstream NLP tasks for Inuktitut. Initially, focus will be placed on preprocessing methods to deal with the long words in the language to make them more digestible for down-stream tasks. This includes fine-tuning a pre-trained LLM for linguistically-informed word segmentation of Inuktitut, as an alternative to using language-independent approaches such as Byte-Pair Encoding (BPE) and SentencePiece. This initial step is motivated by previous studies demonstrating the potential of this type of word segmentation as a data preprocessing method across various NLP tasks for other languages. The resulting model will serve as a preprocessing tool for further research into downstream tasks such as Machine Translation (MT) from Inuktitut to English. Through this research, the hope is to make NLP solutions more accessible for Inuktitut speakers and to help preserve the language.

Forschungszentrum Jülich

Wilhelm-Johnen-Straße, 52428 Jülich (Germany)
e-mail: info [at] fz-juelich.de
www.fz-juelich.de

University of iceland

Sæmundargata 2, 102 Reykjavík (Iceland)
e-mail: hi [at] hi.is
www.hi.is

Overview