When you start to think about systematically learning Java, the JVM is an unavoidable part. Which country have I not visited that is not created by the JVM? In previous studies, I encountered many JVM-related topics, but I only had a superficial understanding without in-depth study.
When you start to think about systematically learning Java, the JVM is an unavoidable part. Which~country have I not visited~object is not created by the JVM? In previous studies, I encountered many JVM-related topics, but I only had a superficial understanding without in-depth study. This blog will document the process of learning the JVM. A small portion of the content is excerpted from the internet. Most of it is derived from various technical blogs and "Understanding the Java Virtual Machine" and then narrated through my own understanding.
Java Class Loading Process#
When a program needs a class during execution, if the class is not found in memory, the JVM will load the class into memory for the program's use through three steps: loading, linking, and initialization.
Loading#
First is the loading of the class. The virtual machine uses the ClassLoader to convert the bytecode file (.class) into a Class object in memory. The bytecode file can be a local .class file, a file in a jar package, or a byte stream provided by a remote server. Essentially, it is byte[]. The common ClassLoaders are BootStrapClassLoader, ExtensionClassLoader, and AppClassLoader. They each have their own responsibilities for loading classes from different locations.
-
BootStrapClassLoader is responsible for loading the core classes of the JVM, located in JAVA_HOME/lib/rt.jar. Typically, classes under java.* are loaded by it.
-
ExtensionClassLoader is responsible for loading extension classes, located in JAVA_HOME/lib/ext/*.jar.
-
AppClassLoader is responsible for loading the classes we write ourselves. It loads jar packages in the ClassPath and the class files we compile.
-
Custom class loaders are responsible for loading specific classes, such as encrypted bytecode files transmitted over the network. A custom class loader is needed to parse and load them. It can directly extend the URLClassLoader class.
Parent Delegation Model#
Let's look at a segment of the ClassLoader class loadClass() from the Java source code, along with my clumsy translation.
protected Class<?> loadClass(String name, boolean resolve)
throws ClassNotFoundException
{
synchronized (getClassLoadingLock(name)) {
// First, check if the class has already been loaded
// First, check if this class has already been loaded
Class<?> c = findLoadedClass(name);
// If not loaded
if (c == null) {
long t0 = System.nanoTime();
try {
// If the parent class loader is not null
if (parent != null) {
// Load by the parent class loader,
// The key point of the parent delegation model
c = parent.loadClass(name, false);
} else {
// If the parent class loader is null, it has reached the end. Choose BootStrapClassLoader to load
c = findBootstrapClassOrNull(name);
}
} catch (ClassNotFoundException e) {
// ClassNotFoundException thrown if class not found
// from the non-null parent class loader
}
if (c == null) {
// If still not found, then invoke findClass in order
// to find the class.
long t1 = System.nanoTime();
// Equivalent to a custom loader, override method
c = findClass(name);
// this is the defining class loader; record the stats
// Start loading class sun.misc.PerfCounter.getParentDelegationTime().addTime(t1 - t0);
sun.misc.PerfCounter.getFindClassTime().addElapsedTimeFrom(t1);
sun.misc.PerfCounter.getFindClasses().increment();
}
}
if (resolve) {
resolveClass(c);
}
return c;
}
}
Understanding this code clarifies what the parent delegation model is. When loading a class, it first checks whether it has already been loaded to prevent duplicate loading. If not loaded, it delegates to the parent class loader. This is a recursive function; when it calls the parent class loader, it will check again if it has been loaded. If not loaded, it continues to delegate upwards. When it reaches the top and there is no parent class loader, it hands over to BootStrapClass for processing. If BootStrapClass loads it, it returns; if not, it returns down to load for the "child" until loading is complete. If, in the end, the AppClassLoader or custom loader still fails to load, a ClassNotFound exception will be thrown. Pay attention to two points: actually, each loader that fails to load will return ClassNotFound, but it is caught and there is no exception handling. The second point is that the findClass of the final custom class loader is not implemented; for ExtensionClassLoader and AppClassLoader, it will throw ClassNotFound.
Linking#
The second step is linking. Linking consists of three steps: verification, preparation, and resolution.
Verification#
Verification is the first step of linking, ensuring that the information contained in the byte stream of the class file meets the requirements of the current virtual machine and does not compromise the security of the virtual machine. This is especially important for classes not written by oneself, such as those transmitted over the network, requiring a validation check.
- File format verification
- Metadata verification
- Bytecode verification
- Symbol reference verification
Preparation#
After completing the verification of the bytecode file, the JVM begins to allocate memory for class variables and initialize them. Two key points need to be noted here:
-
Objects for memory allocation: In Java, variables can be classified as "class variables" and "instance variables." "Class variables" refer to variables modified by static, while all other types of variables belong to "instance variables." During the preparation phase, the JVM will only allocate memory for "class variables" and will not allocate memory for "instance variables." Memory allocation for "instance variables" will begin only during the initialization phase.
-
Initialization types: During the preparation phase, the JVM allocates memory for class variables and initializes them. However, this initialization refers to assigning the zero value of that data type in Java, not the value initialized in user code.
public static int x = 666; // At this point, int will only be assigned 0; 666 will wait until initialization to be assigned public final static int y = 666; // If it's a constant, it will be assigned 666 here
Resolution#
The resolution phase is the process by which the virtual machine replaces symbolic references in the constant pool with direct references. The resolution actions mainly target seven types of symbolic references: class or interface, field, class method, interface method, method type, method handle, and method invocation qualifier, corresponding to the seven constant types in the constant pool: CONSTANT_Class_info, CONSTANT_Fieldref_info, CONSTANT_Methodref_info, CONSTANT_InterfaceMethodref_info, CONSTANT_MethodType_info, CONSTANT_MethodHandle_info, and CONSTANT_InvokeDynamic_info. The relationship between direct references and symbolic references during the resolution phase is as follows:
-
Symbolic Reference: Symbolic references describe the referenced target with a set of symbols, which can be any form of literal, as long as it can unambiguously locate the target when used. Symbolic references do not necessarily have their targets allocated in memory.
-
Direct Reference: Direct references can be pointers directly pointing to the target, relative offsets, or handles that can indirectly locate the target. Direct references are related to the memory layout implemented by the virtual machine. The same symbolic reference translated into direct references may not be the same across different virtual machine instances. If a direct reference exists, the referenced target must already exist in memory.
Initialization#
At this step, the code we wrote will truly begin to execute. The JVM will initialize the class object according to the order of statement execution. From the code perspective, the initialization phase is the process of executing the class constructor (). The virtual machine specification triggers initialization only in five situations:
- When encountering the four bytecode instructions new, getstatic, putstatic, or invokestatic (note that the newarray instruction only triggers the initialization of the array type itself and does not trigger the initialization of its related types, for example, new String[] only directly triggers the initialization of the String[] class, which means it triggers the initialization of the class [java.lang.String] but does not directly trigger the initialization of the String class) if the class has not been initialized, it needs to be initialized first.
Common scenarios include creating a new object or reading a field or method of a static object. - When using methods from the java.lang.reflect package to reflectively invoke a class, if the class has not been initialized, it needs to be triggered for initialization first.
- When initializing a class, if it is found that its parent class has not been initialized, it needs to trigger the initialization of its parent class first.
- When the virtual machine starts, the user needs to specify a main class (the class containing the main() method) to execute, and the virtual machine will first initialize this main class.
- When using JDK 1.7 dynamic language support, if a java.lang.invoke.MethodHandle instance's final resolution result is REF_getstatic, REF_putstatic, or REF_invokeStatic, and the class corresponding to this method handle has not been initialized, it needs to trigger its initialization first.
The class initialization phase is the last step of the class loading process. In the previous class loading process, except for the loading phase where user applications can participate through custom class loaders, all other actions are completely dominated and controlled by the virtual machine. Only in the initialization phase does the execution of the Java program code (bytecode) defined in the class truly begin.
In the preparation phase, variables have already been assigned an initial value (zero value) required by the system; in the initialization phase, class variables and other resources are initialized according to the subjective plan set by the programmer, or more directly, the initialization phase is the process of executing the class constructor (). The () method is automatically generated by the compiler by collecting all the assignment actions of class variables and the statements in the static block static{} in the class, and the order of collection is determined by the order of statements appearing in the source file. The static block can only access variables defined before it, while variables defined after it can be assigned in the previous static block but cannot be accessed.
public class Test{
static{
i=0;
System.out.println(i);//Error: Cannot reference a field before it is defined (illegal forward reference)
}
static int i=1;
}
The class constructor () is different from the instance constructor (); it does not require explicit calls from the programmer, and the virtual machine will ensure that the parent class constructor () is executed before the subclass constructor () is executed. Since the parent class constructor () is executed first, it means that the initialization of static blocks/static variables defined in the parent class will precede that of the subclass's static blocks/static variables. Specifically, the class constructor () is not mandatory for classes or interfaces; if a class has no static blocks and no assignments to class variables, the compiler may not generate a class constructor () for that class.
The virtual machine will ensure that a class's class constructor () is correctly locked and synchronized in a multithreaded environment. If multiple threads attempt to initialize a class simultaneously, only one thread will execute the class's class constructor (); other threads will block and wait until the active thread completes the execution of the () method. It is particularly important to note that in this situation, although other threads will be blocked, if the thread executing the () method exits, other threads will not re-enter or execute the () method after being awakened, because under the same class loader, a type will only be initialized once.
We know that in Java, creating an object often requires the following processes: parent class constructor () -> subclass constructor () -> parent class member variables and instance code blocks -> parent class constructor -> subclass member variables and instance code blocks -> subclass constructor.
The above content is excerpted from "Understanding the Java Virtual Machine" by Zhou Zhiming.
JVM Runtime Data Area#
The memory area diagram before JDK 8 is as follows:
The memory area after JDK 8:
Next, let's analyze each data area based on the above diagrams and why they have changed.
- Program Counter: This is a very small area, thread-private. It can be understood as recording the position of the thread executing code during context switching. If multiple threads are executing instructions, each thread will have its own program counter, which is thread-private. At any given time, a thread is only allowed to execute the code of one method. Whenever a Java method's instruction is executed, the program counter saves the address of the currently executing bytecode; if a native method is executed, the PC value is undefined. It is similar to the program counter in the CPU's registers.
- Java Virtual Machine Stack: Also thread-private. It describes the memory model for executing Java methods. Each method creates a stack frame during execution to store local variable tables, operand stacks, dynamic links, method exits, and other information. The execution process of each method is also the process of pushing and popping stack frames. The virtual machine stack defines two exceptional conditions: if the stack depth requested by a thread exceeds the depth allowed by the virtual machine, a StackOverflowError exception will be thrown; if the virtual machine stack can dynamically expand (most Java virtual machines can dynamically expand), and if it cannot allocate enough memory during expansion, an OutOfMemoryError exception will be thrown.
- Native Method Stack: Similar to the virtual machine stack, it serves the native methods used by the virtual machine; in the HotSpot JVM, the Java virtual machine stack and the native method stack are combined into one.
- Java Heap: Unlike the stack, the Java heap is a shared area among all threads, where almost all object instances are allocated memory. Therefore, in most cases, the heap is also the largest memory area. From the perspective of memory recovery, since most collectors now adopt a generational collection algorithm, the Java heap can also be subdivided into: young generation and old generation; further detailed into Eden space, From Survivor space, To Survivor space, etc. GC will be discussed separately later.
- Method Area (also known as Permanent Generation): Shared among all threads, a special heap structure. However, in the JVM specification, it is described as (Non-Heap). It is used to store class information, constants, static variables, and data compiled by the just-in-time compiler that have been loaded by the virtual machine. It can be understood as the logical part of the Java heap. In HotSpot JVM versions before JDK 1.7, the method area was located in the permanent generation (PermGen). Due to potential memory leaks or overflow issues in the permanent generation leading to java.lang.OutOfMemoryError: PermGen, the JEP group planned to remove the permanent generation starting from JDK 1.7, and in JDK 1.7, string constants, symbolic references, etc., were removed from the permanent generation. By Java 8, the permanent generation was completely removed from the JVM, replaced by metaspace.
- Runtime Constant Pool: The runtime constant pool is part of the method area. In addition to class version, field, method, interface, and other descriptive information, the class file also contains a constant pool (Constant Pool Table) used to store various literals and symbolic references generated at compile time. This part of the content will enter the runtime constant pool in the method area after class loading. For example, the intern() of the String class.
https://www.cnblogs.com/czwbig/p/11127124.html
https://www.sczyh30.com/posts/Java/jvm-memory/
GC#
Garbage Collection, or GC. One of the main differences between Java and C++ is that Java has a garbage collection mechanism. Generally, programmers do not need to manually release memory.
There are mainly two approaches to garbage collection:
-
Reference Counting: The idea is to give an object a reference counter, incrementing it by 1 each time it is referenced and decrementing it by 1 when it is no longer referenced. When the counter reaches 0, it can be determined that the current object is no longer useful and can be garbage collected. However, this approach has a serious flaw: it cannot solve the problem of circular references. For example, if A references B and B references A, when both objects are no longer useful, the GC cannot determine this, leading to memory waste.
-
Reachability Analysis: This approach treats a group of objects as root nodes and traverses from the root nodes. After the traversal, if there are objects that are unreachable, it indicates that these objects are not being used and can be garbage collected. The JVM's GC uses this approach.
So, which objects can serve as root nodes (GC Roots)? -
Live threads
-
Objects referenced in the local variable table of the virtual machine stack.
-
Objects referenced by static properties and constants in the method area.
-
Variables referenced by JNI in the native method stack.
Four Types of References in Java | Strong, Weak, Soft, Phantom#
StrongReference
StrongReference is the most common type of reference. As long as a strong reference exists, GC will not perform garbage collection.
SoftReference
SoftReference is used to describe useful but non-essential objects. If an object has only a soft reference, the garbage collector will not collect it as long as there is enough memory; if memory becomes insufficient, it will collect the memory of these objects. As long as the garbage collector has not collected it, the object can be used by the program. Soft references can be used in conjunction with a reference queue (ReferenceQueue); if the object referenced by the soft reference is collected by the garbage collector, the JVM will add this soft reference to the associated reference queue. Soft references can be used to implement memory-sensitive caches.
WeakReference
WeakReference is a type of reference with a shorter lifecycle than soft references. When the GC scan starts, as long as it scans to an object that has only a weak reference, it will perform GC regardless of whether there is enough memory. However, since the GC thread has a low priority, it may not quickly discover these weak reference objects. Weak references can also be used in conjunction with a reference queue.
WeakReference is often used in Android.
PhantomReference
PhantomReference is different from the other three references; it does not affect the lifecycle of an object and cannot be used to obtain an instance of the object through a phantom reference. If an object holds only a phantom reference, it is as if it has no references at all and can be collected by the garbage collector at any time. Phantom references are mainly used to track the activities of objects being collected by the garbage collector and must be used in conjunction with a reference queue.
~To be honest, I still don't understand the purpose of this design and have not used it.~
GC Algorithms#
There are mainly three types of GC algorithms: MS (Mark-Sweep), MC (Mark-Compact), and Copying.
-
MS Algorithm: Marked-Sweep, a simple mark-and-sweep algorithm. It marks unreachable objects through reachability analysis and then cleans them up. It is mostly used in the old generation. The main drawbacks of this algorithm are twofold:
- The efficiency of marking and cleaning is not high.
- It easily produces memory fragmentation; if a large object comes in and there is not enough memory, it will trigger GC.
-
MC Algorithm: Marked-Compact, after marking unreachable objects through reachability analysis, it moves surviving objects to one end (reallocate) and then recycles the remaining part, which requires remapping. This is mostly used in the old generation.
-
Copying Algorithm: Memory is divided into two blocks, using only one block at a time. During garbage collection, all surviving objects are moved to the other block. After that, the previous block of memory can be cleaned up entirely. This algorithm is used in the young generation.
Current JVM's GC generally uses generational collection, combining several garbage collection algorithms.
Next, let's analyze the generational model of JVM memory:
From the diagram, we can see that the JVM divides the memory area into young generation, old generation, and permanent generation (which is the method area, removed after JDK 1.8). The young generation is further divided into one Eden space and two survivor spaces. When an object survives the default 15 GC cycles in the young generation, it is moved to the old generation. Large objects can also be directly allocated to the old generation. These can all be configured. So why have one Eden and two survivors? The main reason is to avoid memory fragmentation. Each time, one Eden and one survivor are used. During GC, the surviving objects from Eden and S0 are moved to S1. Then, both Eden and S0 are cleaned up. Next, S1 and S0 are logically swapped. This avoids memory fragmentation. The default ratio for the HotSpot JVM is 8:2, achieving a default space utilization rate of 90%.
PS: During a FULL GC, the JVM safely pauses all executing threads (Stop The World) to reclaim memory space. During this time, all Java-related programs and code, except for garbage collection, will be halted, resulting in significant slowdowns or freezes in system response."Understanding the Java Virtual Machine" - Zhou Zhiming
GC Implementation – Garbage Collectors#
- Young Generation Collectors: Serial, ParNew, Parallel Scavenage
- Old Generation Collectors: Serial Old, Parallel Old, CMS
- Whole Heap Collection: G1
First, there are two modes for the JVM: Server (heavyweight) and Client (lightweight).
Serial: Single-threaded, simple and efficient. Stop the World. Used in Client mode JVM.
ParNew: A multi-threaded version of Serial, using Server mode. Parallel multi-threaded.
Parallel Scavenage: Closely related to throughput, also known as throughput-first collector. The goal of this collector is to achieve a controllable throughput. Parallel multi-threaded.
Serial Old: The old generation version of Serial, using the M-C garbage collection algorithm.
Parallel Old: The old generation version of Parallel Scavenage. Multi-threaded, using the M-C garbage collection algorithm. Used in scenarios sensitive to high throughput and CPU resources.
CMS Collector: Concurrent-Marked-Sweep, a collector aimed at achieving the shortest collection pause time. The CMS collector's memory recovery process runs concurrently with user threads. It is the first true concurrent collector in the HotSpot virtual machine, allowing garbage collection threads and user threads to work (almost) simultaneously.
G1 Collector: Garbage-First is a garbage collector aimed at server applications. For details, please refer to Jvm Garbage Collector (Final Edition)