Efficient LLM Adaptation to Low-Resource Languages via Cross-Lingual Semantic Anchoring

Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

Most Large Language Models (LLMs) remain heavily English-centric, resulting in high tokenizer fertility, where a single word in a low-resource language decomposes into multiple sub-tokens. This inefficiency inflates computational costs and weakens adaptation performance. We introduce bilingual exchange for optimized dense (BXOD) embeddings, a multi-stage framework for efficient cross-lingual adaptation based on cross-lingual semantic anchoring. BXOD employs a cold-start bilingual embedding initialization that anchors new vocabularies for low-resource languages within the semantic space of a high-resource model, guided by the multilingual LaBSE encoder. The pipeline expands the tokenizer to reduce fertility and initializes new embeddings through semantic alignment rather than sub-token composition. BXOD was evaluated using Llama 3.2 (1B, 3B) models adapted to Uzbek and Malay, and Gemma 2 (2B) models adapted to Uzbek, Malay, Korean, and Spanish. It achieved significant gains in low-resource settings, such as improving Malay news classification accuracy from 29.95% to 50.35% (+68.1%, p < 0.001) and English-to-Uzbek translation BLEU from 3.22 to 4.74 (+47.2%, p < 0.001), while preserving English measuring massive multitask language understanding (MMLU) performance and accelerating convergence. Results highlight BXOD’s strength for typologically distant, low-resource languages, while showing reduced benefits for lexically similar, high-resource languages such as Spanish. BXOD thus establishes an effective and computationally efficient paradigm for extending LLMs across linguistically diverse settings. © 2013 IEEE.

키워드

Cross-lingual transferembedding initializationlanguage adaptationlarge language models (LLMs)low-resource languagestokenization
제목
Efficient LLM Adaptation to Low-Resource Languages via Cross-Lingual Semantic Anchoring
저자
Shukhratov, BekhzodBaydadaev, ShokhrukhKwon, Jang Woo
DOI
10.1109/ACCESS.2026.3679427
발행일
2026-03
유형
Article
저널명
IEEE Access
14
페이지
55331 ~ 55344