67
VMware VMware Fernando Granha Jeronimo November 29, 2012 Fernando Granha Jeronimo VMware

VMware - Instituto de Computação

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VMware - Instituto de Computação

VMware

VMware

Fernando Granha Jeronimo

November 29, 2012

Fernando Granha Jeronimo VMware

Page 2: VMware - Instituto de Computação

VMware

Plan

1 IntroductionVMwareVMM

2 WorkstationI/O

3 ESX ServerBalloning

4 Hardware AssistIntel VT-xMemory Management Virtualization

Fernando Granha Jeronimo VMware

Page 3: VMware - Instituto de Computação

VMware

Plan

1 IntroductionVMwareVMM

2 WorkstationI/O

3 ESX ServerBalloning

4 Hardware AssistIntel VT-xMemory Management Virtualization

Fernando Granha Jeronimo VMware

Page 4: VMware - Instituto de Computação

VMware

Plan

1 IntroductionVMwareVMM

2 WorkstationI/O

3 ESX ServerBalloning

4 Hardware AssistIntel VT-xMemory Management Virtualization

Fernando Granha Jeronimo VMware

Page 5: VMware - Instituto de Computação

VMware

Plan

1 IntroductionVMwareVMM

2 WorkstationI/O

3 ESX ServerBalloning

4 Hardware AssistIntel VT-xMemory Management Virtualization

Fernando Granha Jeronimo VMware

Page 6: VMware - Instituto de Computação

VMware

Introduction

VMware

VMware

The importance of the hypervisor:

With the mindset of trap-and-emulate, the x86 virtualizationwas considered impossible

In 1998, VMware was founded by a group of highly skilledprofessionals

Due to a perfect combination of situations, the x86 processingpower has grown and most servers were underutilized, thecompany has greatly succeeded

Fernando Granha Jeronimo VMware

Page 7: VMware - Instituto de Computação

VMware

Introduction

VMware

VMware

Fernando Granha Jeronimo VMware

Page 8: VMware - Instituto de Computação

VMware

Introduction

VMware

VMware

It has an amazing marketshare of 80%

DataCenter virtualization (vMotion key component)

Virtualization is the base of Cloud Computing

VMMs/Hypervisors are becoming commodities and the focusnow is in the management stack

Fernando Granha Jeronimo VMware

Page 9: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Category

System virtual machine from x86 to x86.

Groundbreaking

There was a general misunderstanding about the x86 virtualizationcapacity.The mindset was that a virtualizable architecture is capable ofrunning the guest opering system in a privilege level inferior to theVMM, so that behaviour/control sensitive instructions wouldgenerate a trap and their behaviour would be emulated. Actually,it is only one way of achieving the Popek and Goldbergvirtualization criteria.

Fernando Granha Jeronimo VMware

Page 10: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Question

VMware runs all the guest software in a deprivileged mode, so howcan it ensure that the behaviour of instructions such as popf thatdo not trap in user mode will not loose its semantics?

Answer

VMware achieves this goal through dynamic binary translation(DBT). When it encounters instructions such as popf, thetranslated code will make a call or inline an emulation routine. Thistechnique is not only useful for non-virtualizable instructions, butalso for instructions that may generate traps, once trap handlinghas a major performance penalty in the out-or-order architectures.

Fernando Granha Jeronimo VMware

Page 11: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Question

VMware runs all the guest software in a deprivileged mode, so howcan it ensure that the behaviour of instructions such as popf thatdo not trap in user mode will not loose its semantics?

Answer

VMware achieves this goal through dynamic binary translation(DBT). When it encounters instructions such as popf, thetranslated code will make a call or inline an emulation routine. Thistechnique is not only useful for non-virtualizable instructions, butalso for instructions that may generate traps, once trap handlinghas a major performance penalty in the out-or-order architectures.

Fernando Granha Jeronimo VMware

Page 12: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Question

One important requisite of Popek and Goldberg it that mostinstructions run natively without any modification. Is it necessaryto translate all guest code?

Answer

No, only guest operating system (code supposed to run withCPL=0) needs to be translated which represent a small part of theexecuted code.

Fernando Granha Jeronimo VMware

Page 13: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Question

One important requisite of Popek and Goldberg it that mostinstructions run natively without any modification. Is it necessaryto translate all guest code?

Answer

No, only guest operating system (code supposed to run withCPL=0) needs to be translated which represent a small part of theexecuted code.

Fernando Granha Jeronimo VMware

Page 14: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

How is the translation done?

As it is usual for a DBT, translation is done on-demand to avoidthe problem of telling apart code and data.

1 The translator starts from current source PC up to 12instructions or stops before if it finds a control flow changeinstruction such as: call, jumps and branches

2 These instructions forms the translation unit (TU) that arelatter translated to an intermediate representation

3 Finally, compiled code fragments (CCF) are generated

Fernando Granha Jeronimo VMware

Page 15: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Even in system code, most translations yield what is called IDENTtranslations, no modification is needed. The following modificationare desirable or mandatory:

PC-relative instruction: Similarly to other DBT the translated code goesto a translation cache (TC) changing the original code layout

Direct control flow: same reason of PC-relative

Indirect control flow: needs hash lookup

Non-virtualizable instructions: the replacement of this instructions byemulation routines is mandatory to the execution correctness

Privileged instructions: once the guest OS was deprivileged, suchinstructions will trap, causing a performance hit. Therefore, it is desirableto proactively replace them by emulation routines instead of waiting for atrap

Fernando Granha Jeronimo VMware

Page 16: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Adaptative

Sometimes non-privileged instruction access priviliged datasuch as load and stores to the page table. Once the page table wasprotected by the VMM, a trap will be generated and the VMM willhave to emulate. As stated, traps are a great source ofperformance penalty, so it may be better to replace it for a call toan emulation routing. The DBT starts with the premise thateverybody is innocent, but after a few traps it aggressively adaptsto the guilty and loosely adapts from the guilty to innocent.

Fernando Granha Jeronimo VMware

Page 17: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Fernando Granha Jeronimo VMware

Page 18: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

The VMware DBT approach satisfies the Popek and Goldbergvirtualization criteria:

Efficiency: all user mode code that represents the majority ofguest code run directly without intervention

Resource Control: the guest runs in a deprivileged state(CPL=3), as a result, it has no power to change systemresources

Equivalence: all the semantics are kept by the emulationroutines and there is support for self-modifying code. VMwareargues that the trap-and-emulate is just an implementationsatisfying the virtualizable condition.

Fernando Granha Jeronimo VMware

Page 19: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

VMM uses segmentation to protect itself.

Fernando Granha Jeronimo VMware

Page 20: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

VMM uses segmentation to protect itself.

Most operating systems use paging (segmentation is rarelyused)

The VMM is placed in the upper 4 MB of the address spaceand needs to be protected.

The code in the TC must be accessible, this is achieved byletting the cs contain the whole address space. However, it isimportant to avoid writes coming from the guest to VMMspace. So all segments are truncated, except gs that is usedby the VMM to access its own data. This force non-identtranslations for instructions that use the gs.

Fernando Granha Jeronimo VMware

Page 21: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Memory Management - Shadow Page Tables

Guest OS: gVA ⇒ gPA

VMM:

gPA ⇒ hPAShadow: gVA =⇒ hPA

Fernando Granha Jeronimo VMware

Page 22: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Feature Summary

Binary

Dynamic

On demand

Subsetting

Do not optimize

Chaining

Adaptative

Two modes: BT (kernel mode) and direct execution for(user mode)

Fernando Granha Jeronimo VMware

Page 23: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Translation Example

Fernando Granha Jeronimo VMware

Page 24: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Translation Example

Fernando Granha Jeronimo VMware

Page 25: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Translation Example

Fernando Granha Jeronimo VMware

Page 26: VMware - Instituto de Computação

VMware

Introduction

VMM

Virtual Machine Monitor

Translation Example

Fernando Granha Jeronimo VMware

Page 27: VMware - Instituto de Computação

VMware

Workstation

VMware Workstation

In the beginning, VMware was a start-up trying to launch avirtualization technology in a new market: the commodityhardware.In those early days, two requirements were very important:

As a new technology, it could not force users to replace OS

Implement and maintain the myriad of PC device driverswould not be feasible

Fernando Granha Jeronimo VMware

Page 28: VMware - Instituto de Computação

VMware

Workstation

VMware Workstation

In the first version the target audience was mostlyprogrammers

From the requirements, the hosted architecture was the bestalternative

Fernando Granha Jeronimo VMware

Page 29: VMware - Instituto de Computação

VMware

Workstation

VMware Workstation

Hosted Architecture

vmApp: runs in ring 3 (user space) and is responsible forissuing syscalls on the behalf of the VMM to access hostdevices

VMM: runs in the ring 0 and is responsible for exposing auniform virtual hardware layer to the VM

vmDriver: runs in ring 0 and is responsible for thecommunication between vmApp and VMM

Fernando Granha Jeronimo VMware

Page 30: VMware - Instituto de Computação

VMware

Workstation

VMware Workstation

Hosted Architecture

Fernando Granha Jeronimo VMware

Page 31: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

How is the I/O done?

The VMM exposes well supported standard devices to theVM. For instance, it uses the AMD Lance NIC.

VMM is aware of the semantics of each I/O port

The VM uses the IN/OUT instructions and the VMMtranslates them to requests to the vmApp so that they becomesystem call in host operating system (e.g. can become a read)

Fernando Granha Jeronimo VMware

Page 32: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Network I/O path

Fernando Granha Jeronimo VMware

Page 33: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Sources of overhead

Once the VMM runs in ring 0 and is not part of the host OSthe context switch in this case is more expensive and calledworld switch once privileged state must also be saved

The VMM has delegated the device handling to the host OS,so if the VMM receives an interrupt it cannot do anyprocessing. It must make a world switch to the host OS toprocess it and latter by the vmApp. For instance if the vmApphas read a new package another world switch must take place,so that the VMM can give the package to a VM

Native mode workflows that were I/O bound can become CPUbound in a virtualized environment due to the extra processing

Fernando Granha Jeronimo VMware

Page 34: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Sources of overhead

Fernando Granha Jeronimo VMware

Page 35: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Improvements

Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches

Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages

Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp

Fernando Granha Jeronimo VMware

Page 36: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Improvements

Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches

Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages

Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp

Fernando Granha Jeronimo VMware

Page 37: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Improvements

Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches

Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages

Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp

Fernando Granha Jeronimo VMware

Page 38: VMware - Instituto de Computação

VMware

Workstation

I/O

VMware Workstation

Perfomance measurements after improvements

Fernando Granha Jeronimo VMware

Page 39: VMware - Instituto de Computação

VMware

ESX Server

ESX Server

Native Virtual Machine

Targeting a new and more important market, the commodity servermarket, VMware created a native virtualization system. With thissolution, it had to create its own drivers, luckily, in the serverenvironment only a limited number of devices are certified to run.

Fernando Granha Jeronimo VMware

Page 40: VMware - Instituto de Computação

VMware

ESX Server

ESX Server

Fernando Granha Jeronimo VMware

Page 41: VMware - Instituto de Computação

VMware

ESX Server

ESX Server

Hypervisor vs. VMM

Hypervisor: resposible for multiplexing host system resourcesand providing policies, such as scheduling. It is composed bya vmkernel that resembles an operating system speciallytailored to virtualization

VMM: responsible for creating the virtual hardware layer tothe VM. Its goal is to provide mechanism and each VMrequires a separate VMM

Fernando Granha Jeronimo VMware

Page 42: VMware - Instituto de Computação

VMware

ESX Server

Balloning

ESX Server

Balloning

Motivation

The hypervisor is capable of overcommiting its memory,allowing more guests to run on a single host

The hypervisor must be able reclaim memory to give it tomore prioritary VMs

It could invalidate a shadow page table entries and use freedhost physical page (hPP), however it does not have thenecessary information

Fernando Granha Jeronimo VMware

Page 43: VMware - Instituto de Computação

VMware

ESX Server

Balloning

ESX Server

Balloning

Ballon Device

A pseudo device that is installed in the guest OS

It communicates with the VMM trough a channel

When memory is needed, it inflates the ballon, the devicerequests pages to the OS forcing its allocation algorithm andthese pages are pinned

The ballon can inform the hypervisor which pages it manageto allocate. These pages can be used as free pages

Fernando Granha Jeronimo VMware

Page 44: VMware - Instituto de Computação

VMware

ESX Server

Balloning

ESX Server

Balloning

Fernando Granha Jeronimo VMware

Page 45: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x

Vanderpool Technology

In 2005, Intel launched the Vanderpool Technology, VT-x, toaddress the classical x86 non-virtualizable problem

With the new virtual machine extension (vmx), the processorcan be virtualized without recurring to dynamic translation orany guest code modification and the classical architecture oftrap-and-emulate is now perfectly possible

AMD has also created, the Pacifica Technology to address thesame problem and it resembles VT-x.

Fernando Granha Jeronimo VMware

Page 46: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - General Architecture

The processor is multiplexed in two modes the vmx root andnon-root, in both modes all the rings are available. As a result,the guest operating system can run in the ring 0 of the non-rootmode.

Fernando Granha Jeronimo VMware

Page 47: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - General Architecture

Basically, the VMM configures in which circumstances theexecution must leave the non-root mode (vmexit) and whatautomated actions to perform when processing certain sensitiveactions. When an exiting condition is met, the processor stores ameaningful information about the exit, so that the VMM candecode it and take the appropriate action. This is exactly thetrap-and-emulate architecture.

Fernando Granha Jeronimo VMware

Page 48: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - General Architecture

It is important to note that in this first generation of hardwareassist the problem of memory management was not addressed.

Fernando Granha Jeronimo VMware

Page 49: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - VMCS

The workflow is much simpler than in a sophisticated DBT. TheVMM creates a structure called VMCS (Virtual Machine ControlStructure) responsible for:

storing guest state

configuring automated actions to be performed withoutleaving the non-root mode (e.g. apply an offset to the TSC)

configuring entry and exit conditions from the root tonon-root mode

storing meaningful exit information. In the case of an I/Oinstruction it has the port number, the width and thedirection of the access.

Fernando Granha Jeronimo VMware

Page 50: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - VMCS

A size of 4KB must be destinated to the VMCS structure andit must be explicitly activated by a VMPTRLD instruction.Each logical processor must have its own VMCS.

Most part of the VMCS is implementation dependent anddoes not make part of the architecture. For this reason itmust be changed not by regular loads or stores, but by issuingVMREAD and VMWRITE.

Fernando Granha Jeronimo VMware

Page 51: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - VMCS

Figure: VMCS snippet

Fernando Granha Jeronimo VMware

Page 52: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - VM Entry and VM Exit

For the first time the VMM wants to execute a VM, it must use theVMLAUNCH instruction, later on it can simply use VMRESUME.

Fernando Granha Jeronimo VMware

Page 53: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Intel VT-x - VM Entry and VM Exit

Fernando Granha Jeronimo VMware

Page 54: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Performance Analysis

Figure: SPECint 2000 and SPECjbb 2005

Fernando Granha Jeronimo VMware

Page 55: VMware - Instituto de Computação

VMware

Hardware Assist

Intel VT-x

Performance Analysis

Figure: Virtualization nanobenchmarks

Fernando Granha Jeronimo VMware

Page 56: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Second Generation

The second generation of hardware assist addressed the memorymenagement problem, the major remaining source ofvirtualization overhead. Intel created Extended Page Table (EPT)and AMD created Rapid Virtualization Indexing (RVI) by 2007/08.

Fernando Granha Jeronimo VMware

Page 57: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: Background on 32-bit paging

Fernando Granha Jeronimo VMware

Page 58: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: Background on 64-bit paging

Fernando Granha Jeronimo VMware

Page 59: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: Nested paging

Fernando Granha Jeronimo VMware

Page 60: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Fernando Granha Jeronimo VMware

Page 61: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Quadratic worst-case: O(l1l2)

Need CPU hardware assists: cannot be used with DBT (“amajor VMware sorrow source”)

Benefits: Intel claims that EPT can make virtualization 20%faster on average

Fernando Granha Jeronimo VMware

Page 62: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: Kernel Microbenchmarks

Fernando Granha Jeronimo VMware

Page 63: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: Apache compilation (MMU-intensive)

Fernando Granha Jeronimo VMware

Page 64: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Intel VT - RVI and EPT

Figure: SPECjbb2005 (stress TLB)

Fernando Granha Jeronimo VMware

Page 65: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

That’s all!

Thanks!

Fernando Granha Jeronimo VMware

Page 66: VMware - Instituto de Computação

VMware

Hardware Assist

Memory Management Virtualization

Questions?

Fernando Granha Jeronimo VMware

Page 67: VMware - Instituto de Computação

VMware

Hardware Assist

Reference

Bibliographie

K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization. In

ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programminglanguages and operating systems, pages 2–13, 2006.

Intel Corporation. Intel Virtualization Technology Specification for the IA-32 Intel Architecture, April 2005.

J. Sugerman, G. Venkitachalam, and B.-H. Lim. Virtualizing I/O devices on VMware Workstation’s hosted

virtual machine monitor. In USENIX Annual Technical Conference, General Track, pages 1–14, 2001.

VMware. Timekeeping in VMware Virtual Machines, May 2010.

http://www.vmware.com/vmtn/resources/238.

C. A. Waldspurger. Memory resource management in VMware ESX server. SIGOPS Oper. Syst. Rev.,

36(SI):181–194, 2002.

Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, and Pratap Subrahmanyam. 2010. The evolution of an x86

virtual machine monitor. SIGOPS Oper. Syst. Rev. 44, 4 (December 2010), 3-18

Mendel Rosenblum and Tal Garfinkel. 2005. Virtual Machine Monitors: Current Technology and Future

Trends. Computer 38, 5 (May 2005), 39-47.

VMware. Understanding full virtualization, paravirtualization, and hardware assist

Intel R© 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes:1, 2A, 2B, 2C, 3A, 3B,

and 3C

Nikhil Bhatia, Performance Evaluation of Intel EPT Hardware Assist

Fernando Granha Jeronimo VMware