JVM template interpreter

Nothing to do, compile and debug OpenJDK9, and carefully study the template interpreter in HotSpot.

One: What is a template interpreter?

Languages such as C and C++ will be compiled directly into platform-dependent machine instructions at compile time. The executable file types are different for different platforms, such as Linux for ELF, Windows for PE, and MacOS for Mach-O. Java should be clear, the reason why java is relatively cross-platform is because Java is not directly compiled into machine instructions at compile time, but is compiled into an intermediate language: bytecode.

After reading Zhou Zhiming's "In-depth Understanding of Java Virtual Machine" in 2016, I didn't feel addicted. Then I read Zhang Xiuhong's "Handwritten Java Virtual Machine", which explains in detail how to implement a small JVM. Part of it is how to execute the bytecode of the method body in the Class file.

The implementation of bytecode in "Handwritten Java Virtual Machine" is actually a simple translation. For example, to implement the iload instruction (push the specified int type local variable to the top of the stack), actually use GO (this book uses GO). To implement the JVM) to achieve its corresponding function:

func _iload(frame *rtda.Frame, index uint) {
    val := frame.LocalVars().GetInt(index)
    frame.OperandStack().PushInt(val)
}

When the iload instruction in the method is executed, the _iload() method is called directly.

This interpreter is straightforward and easy to understand. If we let us implement a virtual machine, it is estimated that this method is also used (although I don't have that ability). The early HotSpot explained the execution of bytecode instructions by the above method. This interpreter has a generic term: bytecode interpreter. At present, the bytecode interpreter is still in HotSpot, but it is not used.

The advantages of the bytecode interpreter have been mentioned above, but the disadvantages are also obvious: slow. Each bytecode instruction is executed by translation. Although in the JVM written in C++, methods like _iload() above will be compiled into machine instructions, but the machine instructions generated by the compiler are redundant. And the CPU itself is constantly fetching instructions, the more instructions, the longer it takes. For the JVM interpreter, in fact, it is constantly fetching instructions. If the execution time of each bytecode instruction is slow, the overall efficiency is inevitably poor.

Since the early bytecode interpreter could not adapt to the development of the times, what optimization did the JVM engineers think of? The bytecode interpreter mentioned above is slow because the machine instructions generated by the compiler are not ideal, so we skip the compiler directly and do not write the assembly code ourselves. That's right, the current HotSpot is like this. This interpreter is called a template interpreter.

The template interpreter writes a piece of assembly code that implements the corresponding function for each instruction. When the JVM is initialized, the assembler translates the assembly code into machine instructions and loads it into the memory. For example, when executing the iload instruction, the corresponding execution is directly performed. Assembly code is fine. How to execute assembly code? Jump directly to the address of the machine instruction generated in the assembly code in memory. Many places in HotSpot use manual assembly code to optimize efficiency in my article."The ins and outs of JVM method execution"It is also mentioned that the method call is also performed by manually assembling the code.

Two: the creation of the template interpreter

1: Template initialization and machine instruction generation

The iload instruction we usually say is actually just a mnemonic of the bytecode instruction, which helps us understand that the real own code instruction is actually a number, such as iload is 21, when the virtual machine executes 21 this instruction, it is executing iload. . The bytecode instructions are defined in bytecodes.hpp:

class Bytecodes: AllStatic {
 public:
  enum Code {
    _illegal              =  -1,

    // Java bytecodes
    _nop                  =   0, // 0x00
    _aconst_null          =   1, // 0x01
    _iconst_m1            =   2, // 0x02
    _iconst_0             =   3, // 0x03
    _iconst_1             =   4, // 0x04
    _iconst_2             =   5, // 0x05
    _iconst_3             =   6, // 0x06
    _iconst_4             =   7, // 0x07
    _iconst_5             =   8, // 0x08
    _lconst_0             =   9, // 0x09
    ......
  }
}

When the JVM is initialized, a template is created for each of its own code instructions, and each template is associated with its corresponding assembly code generation function:

void TemplateTable::initialize() {
  ......
  def(Bytecodes::_nop                 , ____|____|____|____, vtos, vtos, nop                 ,  _           );
  def(Bytecodes::_aconst_null         , ____|____|____|____, vtos, atos, aconst_null         ,  _           );
  def(Bytecodes::_iconst_m1           , ____|____|____|____, vtos, itos, iconst              , -1           );
  ......
  def(Bytecodes::_iload               , ubcp|____|clvm|____, vtos, itos, iload               ,  _           );
  ......
}

The def() function is actually used to create a template:

void TemplateTable::def(Bytecodes::Code code, int flags, TosState in, TosState out, void (*gen)(int arg), int arg) {
  ......
  Template* t = is_wide ? template_for_wide(code) : template_for(code);
  // setup entry
  t->initialize(flags, in, out, gen, arg);
}

When calling def(), we pass in a series of parameters, where the second-to-last argument is a function pointer. In fact, this function pointer points to the assembly code generation function corresponding to the bytecode instruction. We still use the iload command to say something. When creating the iload instruction template, the function pointer passed in is iload:

void TemplateTable::iload() {
  ......
  / / Get the local variable slot number, put it in rbx
  locals_index(rbx);
     / / Move the local variable corresponding to the slot to rax
  __ movl(rax, iaddress(rbx));
}

The iload() function generates the machine instructions corresponding to the iload instruction.

After defining the template for all bytecodes, the JVM will iterate through all the bytecodes and generate a corresponding machine instruction entry for each bytecode:

void TemplateInterpreterGenerator::set_entry_points_for_all_bytes() {
  for (int i = 0; i < DispatchTable::length; i++) {
    Bytecodes::Code code = (Bytecodes::Code)i;
    if (Bytecodes::is_defined(code)) {
      set_entry_points(code);
    } else {
      set_unimplemented(i);
    }
  }
}

Set_entry_points(code) will eventually call TemplateInterpreterGenerator::generate_and_dispatch() to generate machine instructions:

void TemplateInterpreterGenerator::generate_and_dispatch(Template* t, TosState tos_out) {
  ......
  // generate template
  t->generate(_masm);
  // advance
  if (t->does_dispatch()) {
#ifdef ASSERT
    // make sure execution doesn't go beyond this point if code is broken
    __ should_not_reach_here();
#endif // ASSERT
  } else {
    // dispatch to next bytecode
    __ dispatch_epilog(tos_out, step);
  }
}

In generate_and_dispatch(), the template's generate() method is called, because the pointer to the corresponding machine instruction generation function is recorded when the template is initialized, and there is _gen, so the _gen() can be called directly to generate the machine instruction. For iload, it is equivalent to calling TemplateTable::iload():

void Template::generate(InterpreterMacroAssembler* masm) {
  // parameter passing
  TemplateTable::_desc = this;
  TemplateTable::_masm = masm;
  // code generation
  _gen(_arg);
  masm->flush();
}

2: Creation of bytecode distribution

After the machine instructions are generated, things are not over yet, because we need to record the entry address of the machine instructions. At the end of set_entry_points(), an entry to the machine instruction generated by the EntryPoint record is created, and the EntryPoint is indexed in bytecode and stored in the Interpreter::_normal_table table. Note: Because the bytecode instruction itself is incremented from 0: _nop = 0, _aconst_null = 1, ........ So here you can directly use the bytecode instructions as an index.

  // set entry points
  EntryPoint entry(bep, zep, cep, sep, aep, iep, lep, fep, dep, vep);
  Interpreter::_normal_table.set_entry(code, entry);
  Interpreter::_wentry_point[code] = wep;

The Entrypoint is defined as follows:

EntryPoint::EntryPoint(address bentry, address zentry, address centry, address sentry, address aentry, address ientry, address lentry, address fentry, address dentry, address ventry) {
  assert(number_of_states == 10, "check the code below");
  _entry[btos] = bentry;
  _entry[ztos] = zentry;
  _entry[ctos] = centry;
  _entry[stos] = sentry;
  _entry[atos] = aentry;
  _entry[itos] = ientry;
  _entry[ltos] = lentry;
  _entry[ftos] = fentry;
  _entry[dtos] = dentry;
  _entry[vtos] = ventry;
}

Here you will see a lot of btos, ztos and the like, this is TosState, TopOfStackState, in fact, this describes the type of the current stack top data, when the stack top data type is different, it will enter a different entry. This part uses the top-of-stack cache technology, which can be referenced.Top-of-Stack Cashing TechnologyAs long as you remember, the EntryPoint is used to record the machine instruction entry address.

3: fetching the execution process

Have you ever thought about how the CPU continuously executes instructions? Is there a unified manager who constantly removes the next instruction to execute? In fact, after the code segment is loaded into the memory, it will be placed in a continuous memory area, and each instruction is linearly arranged. The CPU uses the CS:IP register to record the current instruction address, because the instructions are all consecutively arranged, so when an instruction is executed, the offset is directly based on the current instruction length, and the next instruction address can be obtained and sent. IP registers for continuous fetching.

HotSpot borrows the idea that at the end of each machine instruction generated by a bytecode instruction, a logic that jumps to the next instruction is inserted. In this way, after the current bytecode completes its own function, it will automatically fetch the next instruction in the method body that is behind it to start execution.

Let's go back to the function generate_and_dispatch() generated by the above bytecode machine instruction:

void TemplateInterpreterGenerator::generate_and_dispatch(Template* t, TosState tos_out) {
  ......
  // generate template
  t->generate(_masm);
  // advance
  if (t->does_dispatch()) {
#ifdef ASSERT
    // make sure execution doesn't go beyond this point if code is broken
    __ should_not_reach_here();
#endif // ASSERT
  } else {
    // dispatch to next bytecode
    __ dispatch_epilog(tos_out, step);
  }
}

After t->generate(_masm) is not immediately removed, but the dispatch operation will be performed, __dispatch_epilog(tos_out, step) will be called to execute the next instruction, and dispatch_epilog() will call dispatch_next():

void InterpreterMacroAssembler::dispatch_next(TosState state, int step) {
  load_unsigned_byte(rbx, Address(_bcp_register, step));
  // advance _bcp_register
  increment(_bcp_register, step);
  dispatch_base(state, Interpreter::dispatch_table(state));
}

Load_unsigned_byte() will get the address of the next instruction according to the current instruction address offset, and get the instruction through the address and put it into the rbx register. _bcp_register is the rsi register, and HotSpot uses the rsi register to store the current instruction address.

After the fetch is complete, call increment(_bcp_register, step) to update the rsi register to point to the next instruction address.

Dispatch_base(state, Interpreter::dispatch_table(state)) begins the execution of the next instruction, and Interpreter::dispatch_table(state) returns the previously generated bytecode dispatch.

void InterpreterMacroAssembler::dispatch_base(TosState state,
                                              address* table,
                                              bool verifyoop) {
  ......
  lea(rscratch1, ExternalAddress((address)table));
  jmp(Address(rscratch1, rbx, Address::times_8));
}

**lea(rscratch1, ExternalAddress((address)table)) ** Put the DispatchTable memory address of the machine instruction address corresponding to the store instruction into rscratch1.

jmp(Address(rscratch1, rbx, Address::times_8))：Because the index in the DispatchTable is directly a bytecode instruction, starting from 0, and rbx now stores the next instruction, so it can be indexed by (DispatchTable first address + rbx * bytes occupied by each address). Then directly use the jmp instruction to jump to the address of the machine instruction corresponding to the bytecode.

Three: Summary

The general logic of the template interpreter here is finished, mainly divided into the following parts:

Create a template for each bytecode;
Generating a corresponding machine instruction for each bytecode using a template;
Store the machine instruction address generated by each bytecode in the dispatch;
At the end of each machine instruction generated by the bytecode, the automatic jump to the next instruction logic is inserted.

HotSpot is really a treasure trove. Learning HotSpot is not just about opening the virtual machine, but more importantly, learning its ideas, so that this idea can be used for us!

reference:"Uncovering Java Virtual Machine: JVM Design Principles and Implementation"

Intelligent Recommendation

JVM-[Execution Engine] Interpreter and JIT

Execution Engine-Interpreter and JIT Execute Engine Interpreter Just-In-Time Compiler [Java HotSpot VM Options] One: What is an interpreter and JIT? 1> Interpreter 2> JIT Two: HotSpot VM executi...

[JVM] bytecode_ stack-based interpreter execution principle

Explain the JVM stack-based execution principle with the following code View the bytecode command: javap -verbose ByteCode.class The bytecode of the add method is as follows: According to the above by...

JVM and Python interpreter hard drive night talk

Pay attention to the headline number, private message reply information will have unexpected surprises.................. The last photo has information. The owner of this computer is a programmer. He ...

Briefly explain the difference between JVM interpreter and compiler

JAVA compiler (javac.exe)The role is to compile the java source program into an intermediate code bytecode file, which is the most basic development tool. JAVA interpreter (java.exe)(English: Interpre...

Java Compiler & Java Interpreter & JVM

Transfer from: JVM JVM has its own complete hardwareArchitectureFor example, a processor, a stack, a register, etc., also have a corresponding instruction system (bytecode is an instruction for...