WWW2026

Unlocking the Multilingual Long-Tail Web: A Fused Macro-Micro Framework for Scalable Content Analysis

Jiarui Zhang, Yifan Deng, Qihao Wang

摘要

Large Language Models (LLMs) enable Web-scale multilingual content analysis but face critical challenges in scaling to long-tail languages and ensuring robustness. Current research is split between two isolated trajectories: a Macro-Paradigm (system-level engineering) and a Micro-Paradigm (internal model intervention). We argue that a true Web-scale solution requires their systematic fusion, balancing large-scale data processing with fine-grained model control. We introduce the Control-Tower Framework (CTF), a novel methodology designed to systematically enhance powerful, pre-trained base models. Inspired by control-theoretic ideas, CTF transforms a base model into a controllable analysis engine via three synergistic stages: (1) Micro-enhanced pre-training that injects linguistic priors (e.g., syntax) to build a robust semantic foundation; (2) a control-inspired fine-tuning stage where a heuristic dynamic feedback loop, driven by micro-level error signals (e.g., knowledge editing loss), actively adjusts the macro-scale learning curriculum; and (3) Macro-optimized inference using Minimum Bayes Risk (MBR) decoding to enhance robustness on noisy user-generated content (UGC). Extensive experiments show that CTF surpasses the leading open-weights model, Tower+ 9B FT, by a substantial margin of +2.18 XCOMET-XXL on low-resource languages (WMT24++). Crucially, CTF unlocks large-scale cross-lingual Web mining by converting unstructured Web text into machine-analyzable assets. We evidence this with substantial gains across both document-level (on MARC) and aspect-based (on SemEval-2016) sentiment analysis tasks. Our work offers a practical pathway toward building more reliable, scalable, and controllable global information ecosystems.