Paper: LLM Translation of Compiler Intermediate Representation

John R Levine <johnl@taugh.com>
Tue, 12 May 2026 11:34:57 -0400

          From comp.compilers

Related articles
Paper: LLM Translation of Compiler Intermediate Representation johnl@taugh.com (John R Levine) (2026-05-12)
| List of all articles for this month |
From: John R Levine <johnl@taugh.com>
Newsgroups: comp.compilers
Date: Tue, 12 May 2026 11:34:57 -0400
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41016"; mail-complaints-to="abuse@iecc.com"
Keywords: GCC, LLVM
Posted-Date: 12 May 2026 11:35:54 EDT

They use an LLM to translate between GCC and LLVM intermediate representation, a
famously hard task, and claim success even though one table says it's at best
84% correct.


Abstract


GCC and LLVM underpin much of modern software infrastructure, relying on
distinct Intermediate Representations (IRs) to drive optimizations and code
generation. However, the semantic and structural differences between these IRs
create significant barriers for cross-toolchain interaction, limiting the reuse
of compiler frontends, backends, and optimization pipelines across programming
languages and compilation ecosystems. Traditional rule-based translators have
attempted to bridge this gap, but their complexity and maintenance cost have
hindered practical adoption. In this context, Large Language Models (LLMs)
appear to be an emerging technology that offers a data-driven alternative,
capable of learning complex mappings between heterogeneous compiler IRs directly
from sufficiently representative examples. To explore this approach, this paper
presents IRIS-14B, a 14-billion-parameter transformer model fine-tuned to
translate GIMPLE (as emitted by GCC) to LLVM IR (as emitted by LLVM). The model
is trained on paired IRs extracted from C sources and evaluated on the
GIMPLE-to-LLVM IR transformation applied to IRs derived from real-world C code
and competitive programming problems. To the best of our knowledge, IRIS-14B is
the first model trained explicitly for IR-to-IR translation. It outperforms the
accuracy of widely used models, including the largest state-of-the-art open
models available today, ranging from 13 to 1,000 billion parameters, by up to 44
percentage points. The proposed transformation supports the integration of LLMs
as complementary components within hybrid neuro-symbolic compiler architectures,
where models such as IRIS-14B act as interoperability layers enabling
cross-toolchain workflows without modifying existing compiler passes, while
traditional compiler infrastructure continues to perform deterministic
compilation and optimization.


https://arxiv.org/abs/2605.08247


Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.