399
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The algorithm and implementation of an extension to LLVM for solving the blocking between instruction sink and division-modulo combine

ORCID Icon, &
Article: 2273219 | Received 01 Jan 2023, Accepted 16 Oct 2023, Published online: 30 Oct 2023

Abstract

Compiler optimization presents a formidable challenge, seeking to enhance code quality through various techniques. One significant obstacle is the phase-ordering problem, where the sequence of optimization algorithms can impact code quality and potentially lead to performance deterioration. While machine learning has shown promise in addressing this issue, it still encounters limitations in certain scenarios.

Instruction Sink is a machine-independent optimization that segregates instructions like modulo and division into separate basic blocks. This separation prevents the application of the machine-dependent optimization called Division-Modulo Combine at the backend, which could otherwise lead to a decline in program performance. However, reordering these optimizations to avoid conflicts can prove to be challenging. In this paper, we propose an algorithm to enhance Instruction Sink, resolving the blocking problem, and implements it within the LLVM optimizer. Leveraging the modularity of LLVM enables us to tackle all affected back-end issues simultaneously.

To assess the effectiveness of our algorithm, we conduct a comprehensive evaluation focusing on feasibility, compatibility, and targeted improvements. The results demonstrate that our approach effectively addresses the blocking problem without conflicting with other optimization algorithms, solely impacting problematic instructions while leaving normal ones unaffected.

Subject classification codes:

1. Introduction

Compiler optimization has been a dynamic research field for several years, aiming to enhance the performance of generated code. This task poses significant challenges, notably the phase-ordering problem between optimization algorithms (Garciarena & Santana, Citation2016). The phase-ordering problem arises when optimization algorithms are executed in different orders, leading to varying performance outcomes. Compilers utilize diverse techniques, such as function inline, code motion, and register allocation, to optimize code. However, the phase-ordering problem between these optimization algorithms can potentially impact performance. For instance, Function Inline can complicate the code and affect other optimization algorithms that analyze the code (Haj-Ali et al., Citation2020).

Recently, numerous studies have leveraged machine learning to discover optimal parameters for optimization algorithms or to address phase-ordering problems. For example, You and Su (Citation2022) simplified the compiler autotuning problem by reducing the O3 subsequence labeling. The implementation of machine learning techniques has yielded promising results in enhancing compiler performance.

Nevertheless, reordering the algorithm alone will not resolve the blocking issue between Instruction Sink and Division-Modulo Combine, which constitutes a phase-ordering problem spanning multiple stages in LLVM (Lattner & Adve, Citation2004).

The compiler architecture typically consists of various components: scanner, syntax, semantics, intermediate language generator, optimizer, and machine code generator (Aho et al., Citation2007). Figure illustrates that LLVM follows a three-phase compiler architecture, where the initial four stages are collectively form the frontend, and the machine code generator constitutes the backend. The frontend converts high-level source code into machine-independent intermediate language, while the backend translates intermediate language into machine-dependent code.

Figure 1 Three-phase compiler architecture – Three-phase compiler can be divided into front-end, optimizer, and back-end, such as LLVM.

Figure 1 Three-phase compiler architecture – Three-phase compiler can be divided into front-end, optimizer, and back-end, such as LLVM.

Two types of optimization techniques exist: machine-independent for intermediate languages and machine-dependent for the backend. Instruction Sink (Knoop et al., Citation1994) is a code motion optimization algorithm falling under the category of machine-independent optimizations (Muchnick et al., Citation1997). Conversely, Division-Modulo Combine (Ebner et al., Citation2008) is a machine-dependent optimization for the backend. The order of these two optimizations cannot be swapped.

This paper proposes an algorithm to address the overlooked issue of the phase-ordering problem. Specifically, we focus on resolving the blocking between Instruction Sink and Division-Modulo Combine and implement our algorithm in LLVM. We then evaluate the effectiveness of our algorithm in addressing the problem without causing side effects in normal code where the problem does not occur.

2. Motivation

As mentioned earlier, the conflict between Instruction Sink and Division Modulo Combine arises due to the phase-ordering problem. However, swapping the order of these two algorithms is not a viable solution since they operate in different phases of the compiler. This section provides an in-depth description of both algorithms and elaborates on the conflicts between them.

2.1. Instruction sink

Instruction Sink is a machine-independent optimization algorithm designed for code motion. The algorithm analyzes instructions that define variables and those that use variables. If a variable is identified as dead, Instruction Sink will relocate the corresponding instruction to eliminate unnecessary operations.

Sometimes, code conditions lead to the definition instruction and the use instruction residing in different basic blocks, as shown in Figure (a). For instance, the definition instruction sinkee = b + c and the use instruction foo(sinkee) of the variable sinkee are separated into different basic blocks by the condition cond. When cond evaluates to false, the programming control flow will not enter the basic block of the use instruction, causing the variable to be defined but not used, rendering the definition instruction redundant.

Figure 2 Instruction Sink with pseudocode. (a). The program control flow chart before Instruction Sink. (b). Apply the code to Instruction Sink to increase the performance of the program.

Figure 2 Instruction Sink with pseudocode. (a). The program control flow chart before Instruction Sink. (b). Apply the code to Instruction Sink to increase the performance of the program.

In such cases, Instruction Sink sinks the definition instruction to the same basic block as the use instruction, as shown in Figure (b). Consequently, when cond is true, both instructions are executed in the same basic block, while the variable is not defined if cond is false, thereby improving program performance.

Applying Instruction Sink within loops can significantly reduce the number of instructions and greatly enhance program performance. Since this optimization algorithm operates on machine-independent intermediate languages, it is applicable to various hardware architectures.

2.2. Division-modulo combine

Division-Modulo Combine is a machine-dependent optimization technique. Certain hardware Instruction Set Architectures, such as MIPS R2000 (Kane & Heinrich, Citation1992), x86, etc., provide a division method that calculates both the quotient and the remainder. Division-Modulo Combine leverages this feature during the backend code generation phase to consolidate division and modulo instructions into a single division operation, based on the hardware information.

For example, Figure (a) illustrates a basic block where we need to calculate the quotient and remainder of X divided by Y. The resulting assembly language generates separate division instructions for the quotient and remainder. However, when Division-Modulo Combine is applied, as shown in Figure (b), the instructions are combined into a single division, reducing the number of division instructions and improving program performance.

Figure 3 Division-Modulo Combine is used on the back-end to generate machine-dependent language. (a). Division-Modulo Combine is not used to generate the combination language. (b). Use the Division-Modulo Combine to generate a combined language.

Figure 3 Division-Modulo Combine is used on the back-end to generate machine-dependent language. (a). Division-Modulo Combine is not used to generate the combination language. (b). Use the Division-Modulo Combine to generate a combined language.

2.3. Instruction sink blocks division-modulo combine

Both Instruction Sink and Division-Modulo Combine are effective and common optimization algorithms used by compilers. However, there are situations where they conflict with each other. When Instruction Sink moves either the division instruction or the modulo instruction, causing them to end up in different basic blocks, it hinders the application of Division-Modulo Combine during the backend code generation.

Consider Figure , which demonstrates the conditions where Instruction Sink blocking Division-Modulo Combine. Without Instruction Sink, Figure (a) shows the division and modulo instructions in the basic block while.body, allowing Division-Modulo Combine to generate a single division in the assembly language.

Figure 4 Division-Modulo Combine optimization cannot be applied after Instruction Sink optimization. (a). Converted to assembly language without Instruction Sink. (b). Apply Instruction Sink to move the variable %div. (c). Converted to assembly language with Instruction Sink.

Figure 4 Division-Modulo Combine optimization cannot be applied after Instruction Sink optimization. (a). Converted to assembly language without Instruction Sink. (b). Apply Instruction Sink to move the variable %div. (c). Converted to assembly language with Instruction Sink.

However, after applying Instruction Sink, as shown in Figure (b), the modulo instruction is moved to the basic block if.end, where it is the definition instruction for the variable %div, which is only used in the same basic block if.end. Consequently, when the instructions are separated into two basic blocks, Division-Modulo Combine cannot be applied, leading to redundant division instructions being generated in the assembly language, as shown in Figure (c). This redundancy can significantly affect program execution efficiency, especially within loops.

Since Instruction Sink is a machine-independent optimization and Division-Modulo Combine is a machine-dependent optimization, swapping their order does not resolve the conflict. Therefore, alternative solutions need to be explored.

3. Proposal

As mentioned earlier, resolving the conflict between Instruction Sink and Division-Modulo Combine requires more than a simple reordering of optimization sequences. In this section, we introduce our proposed algorithm, SHANE (Sinkee Hook – Applied on Non-positive Effect), which enhances Instruction Sink in the LLVM Intermediate Representation (IR) stage to prevent blocking conditions. We will explain the blocking problem conditions, outline how SHANE helps to avoid them, and discuss its integration with Instruction Sink.

3.1. Blocking problem conditions

The occurrence of the blocking problem between Instruction Sink and Division-Modulo Combine depends on the following conditions:

  1. The hardware’s Instruction Set Architecture must support a division method capable of calculating both the quotient and the remainder, thus enabling Division-Modulo Combine.

  2. Within a basic block, there must be division and modulo instructions sharing the same divisor and dividend.

  3. After applying Instruction Sink optimization, one instruction (termed “Sinkee”) will be sunk to another basic block, while the other instruction remains in the original block.

When these conditions align, Division-Modulo Combine cannot be applied to the quotient and remainder instructions, leading to an inability to reduce the number of division instructions and adversely affecting program performance.

3.2. Shane

Our algorithm, SHANE, proactively identifies and prevents the occurrence of blocking problems. It operates in two main steps:

Step 1: Checking for Division-Modulo Combine Applicability

SHANE leverages LLVM’s Application Programming Interface (API) to access hardware information during the optimizer phase. This enables SHANE to determine whether the Instruction Set Architecture supports the necessary division method for Division-Modulo Combine.

Step 2: Analyzing Instruction Sink Patterns and Partners

SHANE searches for instructions that fit the Instruction Sink pattern and identifies the Sinkee and Partner variables declared by these instructions. The optimization algorithm, which includes Instruction Sink, analyzes the LLVM IR at the optimization process’s start to identify relevant patterns and achieve the optimization objective.

For a given instruction matching the Instruction Sink pattern, SHANE identifies the variable defined by the instruction as Sinkee. It then looks for another division or modulo instruction within the same basic block that shares the same divisor and dividend as Sinkee, naming this instruction’s variable as Partner, as illustrated in Figure (a).

Figure 5 Search for quotient or remainder commands in the basic block. (a). Pairs of quotient and remainder instructions. (b). Partner is defined and used in the same basic block.

Figure 5 Search for quotient or remainder commands in the basic block. (a). Pairs of quotient and remainder instructions. (b). Partner is defined and used in the same basic block.

Afterward, SHANE examines whether Partner will be sunk by Instruction Sink or not. This analysis involves assessing the position of the Partner’s definition and use instructions within the basic block. If both instructions are located in the same basic block, it indicates that Partner’s definition instruction will not be sunk and will remain in its original position, as demonstrated in Figure (b).

Assuming all blocking conditions are met, Division-Modulo Combine cannot be applied to the isolated quotient and remainder instructions caused by Instruction Sink. This leads to redundant division instructions, negatively impacting program performance. To prevent this, SHANE cancels the sink operation on Sinkee, effectively avoiding the blocking problem.

To ensure safe execution of Instruction Sink, SHANE also analyzes whether the sinking instructions have side effects before executing the sink operation, as depicted in Figure . By integrating SHANE into Instruction Sink at the LLVM IR stage, we can address the blocking problem effectively. LLVM’s flexibility enables SHANE to be applied to conflicting backends, future hardware architectures and high-level languages.

Figure 6 SHANE in instruction sink.

Figure 6 SHANE in instruction sink.

In the following section, we will provide a detailed implementation of SHANE in LLVM, including technical code examples and pseudocode.

4. Implementation

Instruction Sink is implemented in instcombine pass of the LLVM optimizer, which hosts several excellent optimization algorithms. To integrate SHANE into the compiler, we incorporate it into the function TryToSinkInstruction in instcombine. This function is responsible for detecting the side effects of sink instructions before executing the operation, ensuring safe code motion. The TryToSinkInstruction function presents an appropriate location for SHANE implementation as it enables us to address the blocking problem before reaching the backend.

4.1. Pseudocode

The following pseudocode outlines the functionality of SHANE within the TryToSinkInstruction function (Code 1):

Code 1. Pseudocode of AHANE.

In Lines 1 to 3, SHANE is invoked within the TryToSinkInstruction function. A true return value from SHANE indicates that the sink instruction would cause a blocking problem, while a false return value signifies safe sinking.

During the optimization stage, LLVM provides an API that enables us to obtain hardware information in advance and make informed judgments. Line 6 in the pseudocode calls the function hasDivRemOp, and if it returns false, it indicates that the hardware lacks a division method capable of calculating both the remainder and the quotient simultaneously. In such cases, the blocking problem under discussion will not occur.

To identify the target quotient and remainder instructions for Division-Modulo Combine, we create a mapping in Line 8 (PartnerOpMap) that associates division and modulo instructions with their respective partners. In Lines 9–11, we check whether the operator that defines Sinkee exists in the PartnerOpMap, indicating that the sink instruction is indeed a candidate for Division-Modulo Combine.

SHANE then proceeds to traverse all the instructions within the same basic block as Sinkee (Lines 13–19) to locate the desired partner instruction. If such an instruction is found, we validate whether it meets the conditions required for Division-Modulo Combine, including sharing the same operands as Sinkee and belonging to the same basic block. If these conditions are met, SHANE returns true (Line 19), signaling a blocking problem. Otherwise, it returns false (Line 20), indicating that no blocking problem is detected.

4.2. Handling special cases

In practice, we must account for code structure and its interactions with other optimization algorithms, to detect all possible blocking problems and prevent conflicts with SHANE. Two special cases, integer concatenation/trimming and constant divisors, are discussed below (Code 2):

Code 2. LLVM IR of concatenated and trimmed operators with integer numbers.

In the example LLVM IR code provided above, the variable %divisor is originally a 64-bit integer. However, since %dividend is a 32-bit integer, the %divisor is trimmed down to 32 bits to facilitate operations. In this scenario, the quotient and remainder (Lines 3 and 7) are derived from %dividend and %divisor, making the code optimizable using Division-Modulo Combine in the backend. However, if Instruction Sink causes the two division instructions to be separated, Division-Modulo Combine will encounter blocking problems. Detecting this type of blocking problem is challenging if SHANE simply checks whether the operator variables are the same (Figure ).

Figure 7 Operators are integers converted by bit.

Figure 7 Operators are integers converted by bit.

To address this issue, SHANE traces the operands to detect all the code that may lead to blocking problem comprehensively. It identifies whether the division and modulo instructions share the same operands, accounting for integer concatenation or trimming operations. By tracing the definition of different operands, SHANE can identify whether the division and modulo instructions originated from the same source variable after concatenation or trimming.

Another special case occurs when the divisor is a constant. In such instances, the compiler converts the division operation into multiplication and arithmetic logic bit shifts to the right (Granlund & Montgomery, Citation1994), as illustrated in Figure . This transformation improves performance by avoiding the use of division instructions. To address this case, SHANE does not process the division instruction when it detects a constant as Sinkee’s definition instruction operand.

Figure 8 Divisor is a constant.

Figure 8 Divisor is a constant.

With the modularity of LLVM, we compile instcombine with the SHANE algorithm into a Shared Object, enabling its implementation in any LLVM compiler. In the next section, we will evaluate the practical results of this proposal.

5. Evaluation

After implementing SHANE, we conducted evaluations covering the following aspects.

  1. Feasibility evaluation: We assessed whether SHANE can effectively resolve blocking issues.

  2. Compatibility evaluation: We examined whether SHANE conflicts with other optimization algorithms.

  3. Targeted evaluation: We evaluated whether SHANE has any adverse effects on code that does not involve blocking issues.

The evaluations were performed on Intel x86 architecture CPUs, and detailed specifications are provided in Table . The operating system used was Linux Ubuntu 18.04.6. The algorithm was implemented in LLVM 14.0.0 instCombine, and the compiler version was utilized for evaluation.

Table 1 Evaluation environment.

5.1. Feasibility evaluation

The feasibility evaluation aims to determine whether SHANE can effectively address the issue of Instruction Sink blocking Division-Modulo Combine. This evaluation involves four codes, all of which contain the blocking problem. Among them, pi.c (Beeler et al., Citation1972) is an algorithm for calculating the value of π, and we have configured it to compute π to 800 decimal places.

In addition, “Testcase_01” to “Testcase_03” are specially designed single-source codes. We utilized division and modulo instructions that shared identical divisors and dividends to separately define the variables “sinkee” and “partner.” The instructions for using these two variables were distributed into separate basic blocks, resulting in the blocking issue. Specifically: “Testcase_01” repeated this process within a loop 30,000 times. “Testcase_02” utilized these variables to calculate Fibonacci numbers and executed a loop 50,000 times. “Testcase_03” introduced more complex arithmetic operations and employed nested loops, culminating in a total of 500,500 iterations of division instructions within the loops.

The evaluation method involves applying the instcombine optimization pass, which includes Instruction Sink, to the codes during the machine-independent optimization phase. Subsequently, the code is compiled into executable files with Division-Modulo Combine enabled. No other optimization algorithms are used.

We utilize the “perf” tool to execute each executable file one million times and calculate the average time. This approach allows us to compare the compilation results of the SHANE optimizer (SHANE-sink) with the original optimizer (Original-sink) without SHANE’s enhancements. The evaluation presented are shown in Table . The performance optimized with SHANE-sink is superior to that of the original-sink. This indicates that SHANE has indeed effectively resolved the blocking problem.

Table 2 Results of feasibility evaluations.

5.1.1. Special cases

Considering the two special cases mentioned in Section 4.2, we have also designed corresponding test cases to evaluate whether SHANE can accurately detect code segments with the blocking issue.

“Testcase_04” is used to evaluate scenarios involving the concatenation and truncation of operands with different integer bit lengths. We predeclare a 64-bit divisor and a 32-bit dividend for defining division instructions for “sinkee” and “partner.” As a result, the divisor is truncated into two new 32-bit divisors, each utilized in division and modulo instructions. We execute the compiled executable files, optimized with SHANE-sink and Original-sink, one million times using the “perf” tool to calculate the average execution time. Table displays the results, demonstrating a performance improvement after SHANE-sink optimization, indicating that SHANE effectively detects cases where the source of concatenated and truncated divisors is the same, thereby resolving the associated blocking issue.

Furthermore, we modified the instructions in “Testcase_01” to evaluate using constants as divisors for both division and modulo instructions. After compiling the executable files optimized with SHANE-sink and Original-sink, we used the Linux tool “diff” to compare the executable files. The results showed complete consistency between the two, indicating that SHANE indeed detects that using a constant divisor does not lead to the blocking issue we need to address, and therefore, no inappropriate modifications are made in this scenario.

5.2. Compatibility evaluation

The purpose of the compatibility evaluation is to determine whether SHANE conflicts with other optimizations and to assess its compatibility with them. This step is essential due to the potential phase-order problem within the compiler, which can lead to conflicts between modifications in the optimization algorithm and existing other optimizations. The evaluation process employs the same four codes utilized in the feasibility evaluation.

Table 3 Results of feasibility evaluations for special cases.

The evaluation methodology consists of applying machine-independent O3 optimizations to the codes. O3 encompasses a variety of optimization algorithms designed to improve performance. In contrast to O0, O1, O2, Os, and Oz, O3 integrates the most extensive set of algorithms. Following the optimization phase, the refined code is compiled into an executable file with Division-Modulo Combine enabled. This resultant executable is subsequently executed one million times, and the average execution time is calculated using the “perf” tool. This process enables a performance comparison between the SHANE-O3 optimizer and the Original-O3 optimizer.

The evaluation results are presented in Table . The findings suggest that codes optimized with SHANE-O3 do not display performance degradation when compared to those optimized with Original-O3. This implies that SHANE does not interfere with other optimization algorithms in the O3 sequence.

Table 4 Results of compatibility evaluations.

5.2.1. Comparing feasibility evaluation and compatibility evaluation

When comparing the evaluation results for feasibility and compatibility, an observation emerges: that utilization of the O3 optimization unexpectedly leads to reduced performance, in contrast to employing only the instcombine optimization pass. We attribute this phenomenon to the challenges that arise from the intricate sequencing of multiple algorithms within O3 and the resulting side effects.

Taking the example of testcase_02.c, our analysis reveals that the optimization pass EarlyCSEPass (Early Common Subexpression Elimination) within O3 introduces increased data dependencies. Consequently, this rise in data dependencies adversely impacts the performance of SHANE-O3 when compared to SHANE-sink. While EarlyCSEPass optimizes by removing identical expressions, thereby reducing the count of executed instructions, it simultaneously introduces additional data dependencies.

Table illustrates performance-related data acquired through the execution of one million times and subsequent averaging. The SHANE-O3 optimization results in a minimized instruction count for the program. However, due to the heightened data dependencies, there is a decrease in instruction-level parallelism. As a result, the Instructions Per Cycle (IPC) declines, ultimately requiring a greater number of cycles for execution.

Table 5 Testcase_02.c information.

5.2.2 Comparison with DivRemPairs

In LLVM’s O3 optimization, an optimization algorithm named DivRemPairs is employed. This algorithm conducts various optimizations based on hardware-specific information. When the hardware supports an operation for acquiring both the quotient and remainder, instructions with the identical operands are relocated to a common basic block. In cases where the hardware lacks this support, the remainder instruction is decomposed into a sequence of operations involving multiplication and subtraction. This decomposition serves to minimize the usage of division instructions. This approach appears to offer a potential solution to addressing blocking issues.

The experiment proceeds as follows: First, we apply the “instcombine” optimization to the code, followed by the “DivRemPairs” optimization. Next, we compile the code into an executable file. Finally, we utilize “perf” for performance testing and compare the results with those obtained during the 5.1 Feasibility Evaluation.

The results of both optimization processes demonstrate nearly equivalent performance, as shown in Table . However, a crucial adjustment to the sequence of optimization algorithms is necessary to ensure that DivRemPairs follows instcombine. DivRemPairs is affected by optimizations performed before it, such as Instruction Sink and similar techniques, which introduce complexities into the program’s control flow. Consequently, this magnifies the difficulties associated with compiler phase ordering. In contrast, SHANE proactively addresses blocking problems prior to the program’s control process becoming more convoluted.

Table 6 Performance comparison of SHANE and DivRemPairs.

5.2.3 Benchmarks: LLVM-test-suite

We conducted an analysis utilizing the LLVM test-suite (Lattner et al., Citation2022). The LLVM test-suite serves as an official benchmark designed to assess the correctness and performance of the code that has been compiled and optimized by the compiler. Our investigation focused on the MultiSource directory, which contains complete programs made up of multiple source files, including both large benchmarks and entire applications. Our findings uncovered instances of blocking problems within three projects housed within this directory. These projects are sqlite3, found in the Applications directory, and cfrac and gs, both situated within the Benchmarks/MallocBench directory.

The sqlite3 project involves an Embeddable SQL Database Engine. The cfrac project, developed by Dave Barrett, implements continued fraction factorization and relies on numerous small, short-lived allocations. The gs project pertains to ghostscript processing of the complete Intel Software Developer’s Manual PDF, spanning approximately 5000 pages. Refer to Table for detailed benchmarks-related information.

Table 7 Benchmarks related information: Multi-source codes in LLVM-test-suite.

The evaluation method involves compiling and testing three projects using O3 optimization. We executed the “perf” tool 10,000 times to obtain the average time. A comparison was made between the optimization results of SHANE-O3 and Original-O3, as shown in Table . The results indicate that cfrac and gs did not show significant improvements, but there was a notable performance enhancement for sqlite3. The blocking problem in real-world scenarios is influenced by the project’s actual structure. Our evaluation results confirm the existence of this blocking problem, and SHANE proves to be partially effective in addressing it.

Table 8 Results of compatibility evaluations for multi-source codes.

5.3. Targeted evaluation

The purpose of the targeted evaluation is to validate that SHANE exclusively addresses the blocking problem, without inducing any unintended side effects. We had concerns that integrating SHANE into instcombine could potentially lead to unanticipated alterations in the code.

The evaluation employs the LLVM test-suite (Lattner et al., Citation2022), along with well-known benchmarks such as Coremark (Consortium et al., Citation2018), Dhrystone (Weicker, Citation1984), and Whetstone (Longbottom, Citation2005). Table shows the benchmarks-related information.

Table 9 Benchmarks related information: CoreMark, dhrystone and whetstone.

The test data of the LLVM test-suite can be categorized into the benchmarks designed for performance assessment and those for evaluating correctness. For this evaluation, we focus on the performance testing benchmarks, encompassing a total of 315 codes samples.

We compiled the code separately using SHANE-O3 and Original-O3, and then employed the Linux tool “diff” to facilitate a comparative analysis. The results are presented in Figure . It is important to emphasize that our proposed solution does not affect benchmarks beyond the intended target, as all files remain consistent. The modifications we made are limited to address blocking problems.

Figure 9 Results of benchmark evaluations. (a). Coremark comparison results. (b). Dhrystone comparison results. (c). Whetstone comparison results.

Figure 9 Results of benchmark evaluations. (a). Coremark comparison results. (b). Dhrystone comparison results. (c). Whetstone comparison results.

6. Related work

The quality of code can be influenced by projects developed using different compiler architectures. In this section, we investigate whether our proposed blocking problem is affected by different architectures of ahead-of-time and just-in-time compilers. Numerous studies have explored the use of machine learning for efficient searches of optimal sequences and algorithm parameters. We will discuss these approaches and highlight the significance of our research in this domain. Lastly, we will explore the potential of addressing the blocking problem using intrinsic functions of higher-level languages and explain why they may not offer an appropriate solution.

6.1. Ahead-of-time compilers

A single-phase compiler offers the advantage of fast compilation and low space usage (Ammann, Citation1977). It directly converts high-level languages into machine language but lacks optimization-related analysis or optimization information. As a result, the blocking problem mentioned in this research does not occur due to limited optimization.

On the other hand, a two-phase compiler is more modular compared to a single-phase compiler (Bornat, Citation1979), occupies less memory when loaded, and provides some optimization along with a machine-independent language. The occurrence of the blocking problem depends on how optimization is implemented. However, both single-phase and two-phase compilers are inferior in modularity compared to multi-phase compilers and becoming less common.

The GNU Compiler Collection (GCC) (Stallman et al., Citation1998) is another well-known and open-source compiler that exhibits higher modularity than single-phase and two-phase compilers and enjoys wider usage. Unlike LLVM, GCC does not encounter blocking problems due to differences in compiler architecture. However, resolving blocking problems in LLVM is more beneficial for programmers owing to its architecture and open-source software licensing terms.

GCC’s code and architecture are highly coupled, making it challenging to convert the code into different executable files for running on various hardware platforms. This reduces the likelihood of encountering blocking problems, but it also makes it more difficult to develop new frontends and backends or gather information about code in GCC compared to LLVM. Additionally, the GCC optimization sequence is fixed and can only determine whether an optimization is applied or not. When attempting to solve the compiler phase order problem, LLVM offers more flexibility than GCC.

Both GCC and LLVM are open-source software. However, GCC adopts the GNU General Public License (GPL) terms, requiring extensions to be GPL-licensed and open-source (Engelfriet, Citation2009). In contrast, LLVM uses the more permissive Berkeley Software Distribution (BSD) license (Engelfriet, Citation2009), allowing software developed under the BSD license to have proprietary derivative software. As a result, most commercial software begins with open-source software under the BSD license.

To summarize, the improved generalization of LLVM compared to the AOT compiler architecture mentioned above allows us to address blocking problems on LLVM while also handling potential tool-related issues. Additionally, it effectively tackles problems that many backends could encounter, reducing the likelihood of code optimization failures.

6.2. Just-in-time compilers

The blocking problem is less likely to occur in just-in-time (JIT) compilers. A JIT compiler is a type of compiler implemented within an interpreter (Aycock, Citation2003). It creates an execution environment and collects dynamic information during code execution to optimize code for efficiency. For example, in Java’s JVM, the JIT compiler analyzes hotspots in code execution and compiles frequently executed instructions or basic blocks into machine language, thereby improving program performance. Since the JIT compiler optimizes based on actual execution conditions, the likelihood of the blocking problem mentioned in this research is reduced.

However, JIT compilers have a complex architecture, require more functionality than AOT compilers, need a large capacity, and can be challenging to implement in embedded systems (Wade et al., Citation2017). The proposal to improve Instruction Sink of the LLVM optimizer remains beneficial for embedded systems, helping to avoid this optimization conflict.

6.3. Machine learning in optimization

Numerous studies have utilized machine learning to accelerate the search for optimization algorithm parameters or the best sequence of optimization algorithms in addressing the phase-ordering problem. Given the vast number of potential sequences and parameters, finding the best result can be a time-consuming process. Our proposal aims to address possible blocking problems to indirectly enhance search efficiency.

Regarding optimization algorithm parameters, Kisuki et al. (Citation2000) used iterative compilation to determine the optimal loop tiling size and loop unrolling count, resulting in improved performance. However, this method involves a lengthy compilation time because it requires executing and analyzing many “versions” of the program for information. To overcome this, Agakov et al. (Citation2006) used Markov chain models to expedite the optimization of search iterations for compilation. Similarly, Park et al. (Citation2013) proposed a learning model that predicts the optimal loop optimization parameters based on dynamic information. Due to the intricacy of modern hardware, relying solely on a simple instruction cost model to determine the best parameters for loop optimization, such as enhancing parallelization, data locality, and vectorization, is challenging.

Ashouri et al. (Citation2016) points out that the compiler phase-ordering problem is an NP-Complete problem due to the infinite number of possible optimization sequences for a compiler, making it impractical to try them all out. Almagor et al. (Citation2004) compared the search speeds of greedy algorithms, genetic algorithms, and hill-climbing algorithms. They found that between 200 and 4550 compilations were required to discover an optimized sequence that was 15% to 25% faster than a fixed sequence. Cooper et al. (Citation2005) provided various search algorithms and implemented user-interactive interfaces to optimize code, utilizing virtual execution technology to speed up the collection of dynamic information.

Machine learning is frequently used in studies to improve the efficiency in searching for optimal sequences. Garciarena and Santana (Citation2016) used a genetic algorithm with GCC to search for the best optimization sequence. Ashouri et al. (Citation2017) classified the O3 optimization sequences of LLVM into clusters, rather than directly training all individual optimization algorithms. They also applied dimensionality reduction techniques, such as Principal Component Analysis (Johnson et al., Citation2014), to speed up the search for optimal sequences by reducing the dimensionality of features used by machine learning models.

In recent years, Mammadli et al. (Citation2020) proposed a deep reinforcement learning approach to the phase-ordering problem, which depends only on the statically attainable intermediate representation of the source code. Peeler et al. (Citation2022) utilized Shackleton, a linear genetic programming framework, to optimize sequences of LLVM optimization passes. They also employed genetic improvement to find problem specific optimized LLVM pass sequences (Li et al., Citation2022).

One disadvantage of a machine learning compiler is that it requires code quality for training the model. Static information is not sufficient to determine code quality, so dynamic or mixed methods must be used to collect information. This process involves executing the same code multiple times to search for the best optimization sequence, making it time-consuming.

However, our proposal for a machine learning compiler has an advantage in that it improves the code quality of the fixed optimization sequence O1, O2, and O3 optimized by LLVM. This results in better-quality code before training the model. The sequence also avoids conflicts between Instruction Sink and Division-Modulo Combine, which shortens the training and search time and produces a better sequence.

6.4. High-level language intrinsic functions

Intrinsic functions are a method of resolving the blocking problem that are used. These functions are provided by the compiler and allow for operations on hardware-related instructions in high-level languages such as C. If the programmer knows the target hardware and knows that its instruction set has special instructions that can generate more efficient instructions, they can use intrinsic functions (Rigger et al., Citation2018).

However, using intrinsic functions can make it difficult for many static analysis tools to analyze code, as it involves both high-level language and machine language operations. Additionally, it makes the high-level language code contain machine-dependent language, reducing the code’s portability and cross-platform capabilities (Batten et al., Citation2000).

While intrinsic functions can resolve the blocking problem, they are not exclusive to x86 hardware and can be encountered on any instruction set architecture that supports a single division operation returning the quotient and remainder. Using intrinsic functions reduces the code’s cross-platform capabilities and may not be suitable for resolving blocking problems.

7. Conclusion

Compiler development has a long history, and it enables developers to write programs in abstract, high-level languages without the need to interact directly with unintuitive machine languages and assembly languages, enabling multi-platform development. Sometimes the code written by the developer may be inefficient due to code complexity or readability issues, so the compiler contains an optimizer that can optimize the code without changing the code result, to maximize the code’s performance.

Optimization in compilers has a phase-ordering problem: the sequence of optimizations can affect the quality of the code. Sometimes the sequence of optimizations may inadvertently worsen the code contrary to the intended optimization. For example, Instruction Sink blocking Division-Modulo Combine could cause a decrease in code performance. In LLVM, the former is a machine-independent optimization applied in the LLVM optimizer, while the latter is applied in the LLVM backend. This difference makes it challenging to adjust the order to resolve the blocking problem. Therefore, we propose an algorithm called SHANE and incorporate it within Instruction Sink to solve the blocking problem at the optimizer phase. Our proposal leverages LLVM’s strong modularity to apply the solution to the LLVM optimizer, enabling all affected backends to address the problem simultaneously without worrying about future backends. Our proposal can also selectively sink instructions that have blocking problems while allowing normal sinking instructions, to preserve the benefits of sinking optimization for the entire code, rather than completely canceling the application of sinking instructions. Our proposal enhances the optimization ability of Instruction Sink.

Regarding the future prospects, the following will explain the two directions we intend to explore in our research: the application of LLVM intrinsic functions to this proposal, and the investigation of whether the blocking problem exists in hardware with independent modulo instructions.

7.1. Applying LLVM’s intrinsic functions to this proposal

In the future, we hope to be able to transform division and modulo instructions that exist in the same basic block and have the same divisor and dividend into an intrinsic function in LLVM IR. This will allow the backend to generate a division directly, which will calculate both values when generating machine-dependent code. Instruction Sink is not the sole code movement optimization, and if the two are combined into an intrinsic function, we won’t have to worry about other code movement optimizations separating the two into different basic blocks. Intrinsic functions are often used for hardware-related optimizations, so they are quite suitable for implementation in this proposal. In the future, we hope to use intrinsic functions to more comprehensively address the issues mentioned in this study and any other possible issues.

7.2. Hardware with independent modulo instructions

The blocking problems discussed in this study pertains to hardware capable of computing both the quotient and remainder of a division operation. However, certain hardware platforms provide dedicated instructions solely for obtaining only the remainder, like RISC-V (Asanović & Patterson, Citation2014). When both division and modulo instructions coexist within the same basic block on this type of hardware, the LLVM optimizer replaces the modulo instruction with the expression (dividend – (quotient * divisor)) to minimize the number of division operations. In the future, our objective is to conduct experiments to assess whether segregating the division and modulo instructions into distinct basic blocks at this hardware will impede the application of this optimization.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Ministry of Science and Technology, Taiwan: [Grant Number MOST 110-2221-E-008-020].

References

  • Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O'Boyle, M. F. P., Thomson, J., Toussaint, M., & Williams, C. K. I. (2006). Using machine learning to focus on iterative optimization. International Symposium on Code Generation and Optimization (CGO’06). IEEE, 11-pp. https://doi.org/10.1109/CGO.2006.37
  • Agakov, F., Bonilla, E., Cavazos, J., et al. (2006). Using machine learning to focus on iterative optimization. International Symposium on Code Generation and Optimization (CGO’06). IEEE, 11-pp. https://doi.org/10.1109/CGO.2006.37
  • Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers: Principles, techniques, & tools. Pearson Education India.
  • Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S. W., Subramanian, D., Torczon, L., & Waterman, T. (2004). Finding effective compilation sequences. ACM SIGPLAN Notices, 39(7), 231–239. https://doi.org/10.1145/998300.997196
  • Ammann, U. (1977). On code generation in a PASCAL compiler. Software: Practice and Experience, 7(3), 391–423. https://doi.org/10.1002/spe.4380070311
  • Asanović, K., & Patterson, D. A. (2014). Instruction sets should be free: The case for risc-v. EECS Department, University of California, Berkeley, Tech. Rep. UCB/ EECS-2014-146.
  • Ashouri, A. H., Bignoli, A., Palermo, G., & Silvano, C. (2016). Predictive modeling methodology for compiler phase-ordering. Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms, 7–12. https://doi.org/10.1145/2872421.2872424
  • Ashouri, A. H., Bignoli, A., Palermo, G., Silvano, C., Kulkarni, S., & Cavazos, J. (2017). Micomp: Mitigating the compiler phase-ordering problem using optimization subsequences and machine learning. ACM Transactions on Architecture and Code Optimization (TACO), 14(3), 1–28. https://doi.org/10.1145/3124452
  • Aycock, J. (2003). A brief history of just-in-time. ACM Computing Surveys (CSUR), 35(2), 97–113. https://doi.org/10.1145/857076.857077
  • Batten, D., Jinturkar, S., Glossner, J., Schulte, M., & D’Arcy, P. (2000). A new approach to dsp intrinsic functions. Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp. https://doi.org/10.1109/HICSS.2000.926967
  • Beeler, M. (1972). Item 120 in m. beeler, r. w. gosper, and r. schroeppel, hakmem.
  • Bornat, R. (1979). Understanding and writing compilers: A do-it-yourself guide. Macmillan International Higher Education.
  • Consortium, E. M. B. (2018). Coremark: An eembc benchmark.
  • Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S., Subramanian, D., Torczon, L., & Waterman, T. (2005). Acme: Adaptive compilation made efficient. ACM SIGPLAN Notices, 40(7), 69–77. https://doi.org/10.1145/1070891.1065921
  • Ebner, D., Brandner, F., Scholz, B., Krall, A., Wiedermann, P., & Kadlec, A. (2008). Generalized instruction selection using ssa-graphs. Proceedings of the 2008 ACM SIGPLAN-SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, 43(7), 31–40. https://doi.org/10.1145/1375657.1375663
  • Engelfriet, A. (2009). Choosing an open source license. IEEE Software, 27(1), 48–49. https://doi.org/10.1109/MS.2010.5
  • Garciarena, U., & Santana, R. (2016). Evolutionary optimization of compiler flag selection by learning and exploiting flags interactions. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, 1159–1166. https://doi.org/10.1145/2908961.2931696
  • Granlund, T., & Montgomery, P. L. (1994). Division by invariant integers using multiplication. Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, 61–72. https://doi.org/10.1145/178243.178249
  • Haj-Ali, A., Huang, Q. J., Xiang, J., Moses, W., Asanovic, K., Wawrzynek, J., & Stoica, I. (2020). Autophase: Juggling hls phase orderings in random forests with deep reinforcement learning. Proceedings of Machine Learning and Systems, 2, 70–81. https://doi.org/10.48550/arXiv.2003.00671
  • Johnson, R. A., & Wichern, D. W. (2014). Applied multivariate statistical analysis. Pearson London, UK: 2014, vol. 6.
  • Kane, G., & Heinrich, J. (1992). MIPS RISC architectures. Prentice-Hall, Inc.
  • Kisuki, T., Knijnenburg, P. M., & O’Boyle, M. F. (2000). Combined selection of tile sizes and unroll factors using iterative compilation. Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. PR00622) (pp. 237–246). IEEE. https://doi.org/10.1109/PACT.2000.888348
  • Knoop, J., Rüthing, O., & Steffen, B. (1994). Partial dead code elimination. ACM Sigplan Notices, 29(6), 147–158. https://doi.org/10.1145/773473.178256
  • Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004. CGO 2004 (pp. 75–86), IEEE. https://doi.org/10.1109/CGO.2004.1281665
  • Lattner, C., contributors. (2022). LLVM test-suite repository. https://github.com/llvm/llvmtest-suite
  • Li, S. S., Peeler, H., Sloss, A. N., Reid, K. N., & Banzhaf, W. (2022). Genetic improvement in the shackleton framework for optimizing LLVM pass sequences. GECCO ‘22: Proceedings of the Genetic and Evolutionary Computation Conference Companion, 1938–1939. GECCO. https://doi.org/10.1145/3520304.3534000
  • Longbottom, R. (2005). Whetstone benchmark history and results.
  • Mammadli, R., Jannesari, A., & Wolf, F. (2020). Static neural compiler optimization via deep reinforcement learning. 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). IEEE, 1–11. https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00006
  • Muchnick, S. S. (1997). Advanced compiler design implementation. Morgan kaufmann.
  • Park, E., Cavazos, J., Pouchet, L. N., Bastoul, C., Cohen, A., & Sadayappan, P. (2013). Predictive modeling in a polyhedral optimization space. International Journal of Parallel Programming, 41(5), 704–750. https://doi.org/10.1109/CGO.2011.5764680
  • Peeler, H., Li, S. S., Sloss, A. N., Reid, K. N., Yuan, Y., & Banzhaf, W. (2022). Optimizing LLVM pass sequences with shackleton: A linear genetic programming framework. GECCO ‘22: Proceedings of the Genetic and Evolutionary Computation Conference Companion (pp. 578–581). GECCO. https://doi.org/10.1145/3520304.3528945
  • Rigger, M., Marr, S., Kell, S., Leopoldseder, D., & Mössenböck, H. (2018). An analysis of x86-64 inline assembly in c programs. Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. ACM SIGPLAN Notices, 53(3), 84–99. https://doi.org/10.1145/3296975.3186418
  • Stallman, R. M. (1998). Using and porting GNU CC (Vol 675). Free Software Foundation.
  • Wade, A. W., Kulkarni, P. A., & Jantz, M. R. (2017). AOT vs. JIT: Impact of profile data on code quality. Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, 52(5), 1–10. https://doi.org/10.1145/3140582.3081037
  • Weicker, R. P. (1984). Dhrystone: A synthetic systems programming benchmark. Communications of the ACM, 27(10), 1013–1030. https://doi.org/10.1145/358274.358283
  • You, Y. P., & Su, Y. C. (2022). Reduced O3 subsequence labelling: A stepping stone towards optimisation sequence prediction. Connection Science, 34(1), 2860–2877. https://doi.org/10.1080/09540091.2022.2044761