Skip to content
Snippets Groups Projects
Commit 7580bea8 authored by Florent Gluck's avatar Florent Gluck
Browse files

Initial commit

parents
No related branches found
No related tags found
No related merge requests found
Showing
with 1719 additions and 0 deletions
# Welcome to the "Advanced Systems Virtualization (jour)" course!
In this repository you can find:
- The slides explaining the theoretical contents of the course
- The course's practical labs
- Miscellaneous resources/information related to the course's topic
---
author: Florent Gluck - Florent.Gluck@hesge.ch
title: Advanced Systems Virtualization
date: \vspace{.5cm} \footnotesize \today
pandoc-latex-fontsize:
- classes: [tiny]
size: tiny
- classes: [verysmall]
size: scriptsize
- classes: [small]
size: footnotesize
- classes: [huge, important]
size: huge
---
[//]: # ----------------------------------------------------------------
## Course's resources
- Course's portal on Cyberlearn
- [\textcolor{myblue}{https://cyberlearn.hes-so.ch/course/view.php?id=14955}](https://cyberlearn.hes-so.ch/course/view.php?id=14955)
- enrollment key: "asv24"
\vspace{.2cm}
- Course's material on git
\vspace{.2cm}
- Course's chat on Mattermost
[//]: # ----------------------------------------------------------------
## Goals
::: incremental
- Acquire a better understanding of how hypervisors work
- Be able to implement a simple hypervisor featuring emulated and paravirtualized devices
- Study, summarize and present a research article about virtualization
:::
[//]: # ----------------------------------------------------------------
## Topics
::: incremental
- Platform virtualization reminder
- KVM API
- How to use KVM to implement an hypervisor from scratch
- Device paravirtualization using hypercalls
- Device emulation through state machines
- Performance analysis between emulation and paravirtualization
- Study and summary of a research article
:::
[//]: # ----------------------------------------------------------------
## Work method
::: incremental
- **Be present** and **pay attention**: information will often be given "on the spot"
- missing classes means missing explanations and useful advice!
- **Take notes** as slides are **incomplete**
- Work on a **regular basis**: last minute work won't cut it \frownie{}
- **Be proactive**: please **ask questions**, there are no dumb questions \smiley{}
- no questions = I assume everything is understood...
- Don't **blindly** copy/paste code found elsewhere (hello stackoverflow and chatgpt!)
- the goal is that **you understand** what you're doing!
:::
[//]: # ----------------------------------------------------------------
## Labs
- Labs are not directly graded
- Labs help you:
- to **truly understand** the course's concepts
- **improve** your programming skills
- **succeed** your live programming exams
- \textcolor{myred}{\textbf{Failing} to complete the labs will almost certainly mean failing the class! \frownie{}}
[//]: # ----------------------------------------------------------------
## Grading
- Evaluation:
- Practical exam (40-50%)
- Theory exam (30-40%)
- Article presentation (20%)
[//]: # ----------------------------------------------------------------
## Questions
\centering
![](images/questions.png){ width=80% }
SRCS=$(wildcard *.md)
PDFS=$(SRCS:%.md=%.pdf)
UID=$(shell id -u)
GID=$(shell id -g)
all: $(PDFS)
%.pdf: %.md
docker run --user $(UID):$(GID) --rm --mount type=bind,src="$(PWD)",dst=/src thxbb12/md2pdf build_slides $<
clean:
rm -f $(PDFS)
content/images/questions.png

1000 KiB

---
author: Florent Gluck - Florent.Gluck@hesge.ch
title: Platform Virtualization - reminder
date: \vspace{.5cm} \footnotesize \today
pandoc-latex-fontsize:
- classes: [tiny]
size: tiny
- classes: [verysmall]
size: scriptsize
- classes: [small]
size: footnotesize
- classes: [huge, important]
size: huge
---
[//]: # ----------------------------------------------------------------
## What is Platform virtualization?
:::::: {.columns}
::: {.column width="50%"}
\small
- Virtualization of a **whole hardware platform** $\rightarrow$ allows concurrent execution of multiple OS on the same physical machine (host system)
- **Virtual machine (VM)**, also called guest domain = efficient, isolated duplicate of the real physical machine
- A VM is supported by a virtualization layer = **virtual machine monitor (VMM)** or **hypervisor**
:::
::: {.column width="50%"}
\vspace{0.5cm}
\centering
![](images/platform_virt.png){ width=100% }
\small
\vspace{.3cm}
- The OS running in the VM is called the **Guest OS**
:::
::::::
[//]: # ----------------------------------------------------------------
## Platform virtualization
- Sometimes called "hardware virtualization"
- Type of virtualization that **virtualizes a whole machine**
- Three main components must be virtualized:
- CPU
- memory (MMU - Memory Managing Unit)
- devices (also called Input/Output or I/O): hard drive, disk controllers, display, mouse, keyboard, etc.
[//]: # ----------------------------------------------------------------
# CPU virtualization
[//]: # ----------------------------------------------------------------
## CPU virtualization techniques
The CPU can be virtualized using 4 different techniques:
- Full virtualization using Trap-and-Emulate (historical)
- Full virtualization using Binary Translation
```{.verysmall}
qemu-system-x86_64 ...
```
- Hardware-assisted full virtualization
```{.verysmall}
qemu-system-x86_64 -enable-kvm ...
```
- Paravirtualization
[//]: # ----------------------------------------------------------------
## CPU full virtualization: hardware-assisted
:::::: {.columns}
::: {.column width="45%"}
\footnotesize
- Also called "Accelerated Virtualization" and "Hardware Virtual Machine" (HVM)
- Exists since the release of Intel VT-x & AMD-V Pacifica in 2005:
- \footnotesize solves issue with the 17 "problem" instructions
- adds new modes: \textcolor{myred}{root\textsuperscript{$\star$}}/\textcolor{mygreen}{non-root}
- VMM runs in \textcolor{myred}{root} mode
- Guest OS runs in \textcolor{mygreen}{non-root} mode
:::
::: {.column width="55%"}
\vspace{0.5cm}
\centering
![](images/hardware_assisted_virt.png){ width=100% }
:::
::::::
\vfill
\textcolor{myred}{\textsuperscript{$\star$}}\scriptsize Completely unrelated to root user in Linux/UNIX!
[//]: # ----------------------------------------------------------------
## CPU hardware-assisted virtualization, root/non-root modes
:::::: {.columns}
::: {.column width="53%"}
\small
- Guest OS runs in \textcolor{mygreen}{non-root} mode:
- \footnotesize ring 3: user applications
- ring 0: OS
- VMM runs in \textcolor{myred}{root} mode:
- \footnotesize ring 0: VMM
\vspace{.2cm}
- In \textcolor{mygreen}{non-root} mode, certain privileged operations cause traps (\textcolor{myorange}{VMexits}) $\rightarrow$ trigger switch to \textcolor{myred}{root} mode (VMM)
:::
::: {.column width="47%"}
\centering
![](images/vmentry_vmexit.png){ width=100% }
:::
::::::
[//]: # ----------------------------------------------------------------
## CPU hardware-assisted virtualization: pros and cons
- \textcolor{mygreen}{Pros}
- guest OS kernel' source code does not need to be modified
- guest OS can run on real hardware
- much more **efficient** than Binary Translation thanks to dedicated hardware instructions
\vspace{.2cm}
- \textcolor{myred}{Cons}
- only available if CPU implements the dedicated hardware instructions
[//]: # ----------------------------------------------------------------
# Device virtualization
[//]: # ----------------------------------------------------------------
## Device virtualization techniques
Devices can be virtualized using 4 techniques:
- \textcolor{myblue}{Full virtualization using emulation}
```{.verysmall}
qemu-system-x86_64 -drive file=disk.qcow,index=0,media=disk,format=qcow2 ...
```
- \textcolor{myblue}{Paravirtualization}
```{.verysmall}
qemu-system-x86_64 -drive file=disk.qcow,index=0,media=disk,format=qcow2,if=virtio ...
```
- Hardware-assisted full virtualization (using VT-d hardware)
- Passthrough
[//]: # ----------------------------------------------------------------
## Device virtualization: full virtualization using emulation
:::::: {.columns}
::: {.column width="68%"}
\footnotesize
- VM **presents a "real" device** to the guest OS
- Guest OS must have drivers for the real device
- VMM intercepts all device accesses
- VMM **emulates** a real device that's likely **not physically present** on the host
- **\textcolor{mygreen}{Pros}**
- \footnotesize VM decoupled from physical device
- VM migration
- device sharing
- guest OS can run on real hardware (provided it has the required drivers)
- **\textcolor{myred}{Cons}**
- \footnotesize emulating a real device can be complex
- low performance due to lots of VM exits
:::
::: {.column width="32%"}
\centering
![](images/device_emul.png){ width=100% }
:::
::::::
[//]: # ----------------------------------------------------------------
## Device virtualization: paravirtualization
:::::: {.columns}
::: {.column width="68%"}
\footnotesize
- VM **presents a virtual device** to the guest OS
- Guest OS must have driver for the virtual device
- Driver **much simpler** than for a real device
- Driver uses the virtual device's API to control it
- \scriptsize driver uses **hypercalls and shared memory** to communicate with VMM
- **simple and highly efficient**
- **\textcolor{mygreen}{Pros}**
- \scriptsize VM decoupled from physical device
- VM migration
- device sharing
- no need to emulate a real device
- easy to implement & high performance
- **\textcolor{myred}{Cons}**
- \scriptsize guest OS requires specific driver
- guest OS cannot run on real hardware
:::
::: {.column width="32%"}
\centering
![](images/device_paravirt.png){ width=100% }
:::
::::::
[//]: # ----------------------------------------------------------------
## Platform virtualization nowadays
Nowadays, VMM that implement platform virtualization use a combination of virtualization types:
- **Hardware-assisted** full virtualization for CPU and devices
- **Paravirtualization** for devices
- typically for performance-critical devices, such as disk and network
- **Full virtualization** (emulation) for devices
- typically used for better compatibility: when guest OS lacks paravirtualized drivers
[//]: # ----------------------------------------------------------------
# Hypervisor examples
[//]: # ----------------------------------------------------------------
## KVM + QEMU
:::::: {.columns}
::: {.column width="50%"}
\small
- KVM means "Kernel Virtual Machine"
- KVM is a Linux kernel module
- \footnotesize adds virtualization capabilities (API) to the Linux kernel
- Linux kernel provides hardware management + runs regular Linux applications
- VMs support through QEMU which uses the KVM API
- First released in 2006
:::
::: {.column width="50%"}
\centering
![](images/kvm.png){ width=100% }
:::
::::::
[//]: # ----------------------------------------------------------------
## Resources
\scriptsize
- [\textcolor{myblue}{"Bringing Virtualization to the x86 Architecture with the Original VMware Workstation"}](https://infoscience.epfl.ch/record/183742); E. Bugnion, S. Devine, M. Rosenblum, J. Sugerman, E. Wang; ACM Transactions on Computer Systems, 2012\
- [\textcolor{myblue}{"Virtual Machine Monitors"}](https://pages.cs.wisc.edu/~remzi/OSTEP/vmm-intro.pdf) from "Operating Systems: Three Easy Pieces"; Remzi H. et Andrea C. Arpaci-Dusseau; Arpaci-Dusseau Books\
- "Hardware and Software Support for Virtualization"; E. Bugnion, J. Nieh, D. Tsafrir; Morgan & Claypool Publishers, 2017
- "Virtual Machines: Versatile Platforms for Systems and Processes"; J. Smith, R. Nair; Morgan Kaufmann, 2005
- [\textcolor{myblue}{Understanding Full Virtualization, Paravirtualization, and Hardware Assist}](https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/VMware_paravirtualization.pdf), VMWare White Paper, 2007
---
author: Florent Gluck - Florent.Gluck@hesge.ch
title: KVM
date: \vspace{.5cm} \footnotesize \today
pandoc-latex-fontsize:
- classes: [tiny]
size: tiny
- classes: [verysmall]
size: scriptsize
- classes: [small]
size: footnotesize
- classes: [huge, important]
size: huge
---
[//]: # ----------------------------------------------------------------
# Introduction to KVM
[//]: # ----------------------------------------------------------------
## What is KVM?
- **KVM** stands for **K**ernel based **V**irtual **M**achine
- Linux kernel module providing hardware-assisted virtualization
- **Provide a virtualization API** for VMMs[^5]
- **\textcolor{myred}{Requires}** Intel VT-x or AMD-V
- Originally, KVM virtualized only CPU and memory
- devices (I/O) had to be emulated by QEMU
- Nowadays, KVM supports device virtualization (PIC and probably more)
- Being part of Linux, KVM is open-source software
[^5]: \scriptsize Reminder: a VMM (Virtual Machine Monitor) is also referred to as an hypervisor
[//]: # ----------------------------------------------------------------
## Evolution of KVM
- Introduced to make VT-x/AMD-V available to user space
- expose virtualization features securely
- interface: `/dev/kvm`
- Quickly merged into Linux mainline
- available since kernel 2.6.20 (2006)
- from first LKML posting to kernel merge: only 3 months!
- 7300 lines of C code! (as of Linux 5.8.12)
- Evolved significantly since 2006
- ported to other architectures: RAM, s390, PowerPC, IA64
- became recognized & driving part of Linux kernel
- quick support of latest virtualization features
[//]: # ----------------------------------------------------------------
## Can a host run KVM?
\small
- Check for hardware virtualization support (Intel or AMD):
```{.tiny}
$ lscpu|grep Flags|grep "vmx\|svm"
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm ...
```
- Check the `kvm` module is loaded into the kernel:
```{.verysmall}
$ lsmod|grep kvm
kvm_intel 487424 4
kvm 1437696 3 kvm_intel
irqbypass 12288 1 kvm
```
- To load a module, use the `modprobe` command (as root)
- For instance, to load the `kvm` module:
```{.verysmall}
sudo modprobe kvm
```
[//]: # ----------------------------------------------------------------
## Accessing KVM
- To access the KVM device, `/dev/kvm`, one **must** either (depends on the Linux distro):
- be in the `kvm` [\textcolor{myblue}{group}](https://linuxize.com/post/how-to-add-user-to-group-in-linux/])[^1]
- have the proper [\textcolor{myblue}{ACL permissions}](https://www.redhat.com/sysadmin/linux-access-control-lists])[^2]
\vspace{.3cm}
- Typical examples of KVM API use:
- QEMU when launched with `-enable-kvm`
- any other Linux-based VMM using KVM
- any application using `/dev/kvm`, typically a VMM
[^1]: \scriptsize [\textcolor{myblue}{https://linuxize.com/post/how-to-add-user-to-group-in-linux/}](https://linuxize.com/post/how-to-add-user-to-group-in-linux/)
[^2]: \scriptsize [\textcolor{myblue}{https://www.redhat.com/sysadmin/linux-access-control-lists}](https://www.redhat.com/sysadmin/linux-access-control-lists)
[//]: # ----------------------------------------------------------------
## KVM model (1/2)
- A VMM is just a **regular user process** that uses the KVM API to:
- create an empty VM "object"
- create one or more virtual CPUs (vCPUs) for the VM
- define the virtual machine (VM) address space
- run the vCPUs[^9] and handle VMexits
- Any Linux user process can create VMs
- all they need is access to `/dev/kvm`
[^9]: \scriptsize vCPUs are **mapped** to Linux kernel threads
[//]: # ----------------------------------------------------------------
## KVM model (2/2)
\centering
![](images/kvm_model.png){ width=80% }
[//]: # ----------------------------------------------------------------
## Architectural benefits of KVM model
- **Proximity of guest and user space VMM**
- both run in user space $\rightarrow$ lighter context switch
- only one address space switch: guest $\leftrightarrow$ host
- **Massive Linux kernel reuse**
- memory management, scheduler
- I/O stacks, power management, host CPU hot-plugging, etc.
- drivers
- **Massive Linux user space reuse**
- plethora of user libraries
- networking, files, etc.
- tracing, debugging
[//]: # ----------------------------------------------------------------
# KVM API - overview
[//]: # ----------------------------------------------------------------
## Overview
- Device `/dev/kvm` provides access to the KVM API
- `kvm` module must be loaded!
- Requests performed through `ioctl` calls on file descriptors
- Provide 3 types of resources, each accessed by a dedicated file descriptor:
- **kvm**: used for: API version, VM creation
- **VM**: used for: vCPU creation, memory mappings, hardware interrupts
- **vCPU**: used for: read/write vCPU registers, vCPU execution
[//]: # ----------------------------------------------------------------
## KVM workflow from VMM's perspective
::: incremental :::
1. Create a KVM device
1. Create a VM
1. Allocate RAM for the VM
1. Map allocated RAM into the VM's address space
1. Load guest OS binary blob into the VM's RAM
1. Create a vCPU and initialize its registers
1. Run the vCPU on the guest OS' code until a `VMexit` is triggered
1. Handle `VMexits` to either:
- handle an **hypercall** request from the guest OS (paravirtualization)
- **emulate** the guest OS' expected behavior (emulation)
1. Resume vCPU execution in (7)
:::
[//]: # ----------------------------------------------------------------
## (1) Create a KVM device
- To obtain a file descriptor on the kvm device (here, `kvmfd`) and check the stable version of the API is available:
\vspace{.3cm}
```{.c .verysmall}
int kvmfd = open("/dev/kvm", O_RDWR | O_CLOEXEC);
if (kvmfd < 0) err(1, "%s", "/dev/kvm");
int version = ioctl(kvmfd, KVM_GET_API_VERSION, NULL);
if (version < 0) err(1, "KVM_GET_API_VERSION");
if (version != KVM_API_VERSION) err(1, "Unsupported version of the KVM API");
```
[//]: # ----------------------------------------------------------------
## (2) Create a VM
- To obtain a file descriptor (here, `vmfd`) on a newly created VM:
```{.c .verysmall}
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
if (vmfd < 0) err(1, "KVM_CREATE_VM");
```
- This file descriptor allows us to:
- defines the VM's memory address space
- create and associate vCPUs
- create and handle hardware interrupts
[//]: # ----------------------------------------------------------------
## Example of VM memory mapping
![](images/kvm_memory_mapping.png){ width=100% }
[//]: # ----------------------------------------------------------------
## (3) Allocate RAM for the VM
- Memory allocated for the guest **must** be:
- aligned to a page boundary (4KB)
- a multiple of a page size (4KB)
- `malloc` does not fullfill these requirements
- Instead, we must use `mmap` to allocate memory (pages):
\vspace{.3cm}
```{.c .verysmall}
// Alloc 4KB for the guest
u_int ram_size = 4096;
uint8_t *mem = mmap(NULL, ram_size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
if (!mem) err(1, "Allocating guest memory");
```
[//]: # ----------------------------------------------------------------
## (4) Map allocated RAM into the VM's address space
- **Define where** in the VM address space, the memory is physically mapped
- Below code maps `mem` at physical address 0 in the VM (`.guest_phys_addr = 0`)
- Each memory mapping (*region* in KVM lingo) must be associated to a **different slot**, here 0 (`.slot = 0`)
\vspace{.3cm}
```{.c .tiny}
struct kvm_userspace_memory_region memreg = {
.slot = 0,
.guest_phys_addr = 0, // MUST be aligned to a page boundary (4KB)
.memory_size = ram_size, // MUST be a multiple of a page size (4KB)
.userspace_addr = (uint64_t)mem,
.flags = 0
};
if (ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg) < 0) err(1, "KVM_SET_USER_MEMORY_REGION");
```
<!--
[//]: # ----------------------------------------------------------------
## Read-only memory mapped into VM's address space
- Allows VMM to be notified **only** when guest attempts to write
- Reads to this area do not trigger VMexits
- Same as RAM mapping, except mapping must be **marked as read-only**
- triggers a `KVM_EXIT_MMIO` when guest writes to it!
- Below code maps `mmio` at physical address 32 MB in the guest
\vspace{.3cm}
```{.c .tiny}
struct kvm_userspace_memory_region mmioreg = {
.slot = 1,
.guest_phys_addr = 32*1024*1024, // MUST be aligned to a page boundary (4KB)
.memory_size = mmio_size
.userspace_addr = (uint64_t)mmio,
.flags = KVM_MEM_READONLY // Mandatory for MMIO!
};
if (ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg) < 0) err(1, "KVM_SET_USER_MEMORY_REGION");
```
-->
[//]: # ----------------------------------------------------------------
## (5) Load guest OS into VM's RAM
- Guest OS must be "loaded" (by the VMM) into the guest address space
- The simplest guest OS is a mini bare-metal OS
- VMM simply performs the following operations:
1. read the binary file generated from compiling/linking the guest OS' code + data
1. read the file into the pages allocated for the guest RAM
1. easiest is to load it at address 0 in guest physical memory
- later on, we'll set the CPU instruction pointer to 0 as well
[//]: # ----------------------------------------------------------------
## (6) Create a vCPU
- A vCPU is referenced through a file descriptor, `vcpufd` below
- The vCPU is represented as a memory-mapped file
- the memory-mapped area is a `kvm_run` structure, `run` below
\vspace{.3cm}
```{.c .tiny}
int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
if (vcpufd < 0) err(1, "KVM_CREATE_VCPU");
int vcpu_mmap_sz = ioctl(kvmfd, KVM_GET_VCPU_MMAP_SIZE, NULL);
if (vcpu_mmap_sz < 0) err(1, "KVM_GET_VCPU_MMAP_SIZE");
if (vcpu_mmap_sz < sizeof(struct kvm_run)) err(1, "KVM_GET_VCPU_MMAP_SIZE unexpectedly small");
struct kvm_run *run = mmap(NULL, vcpu_mmap_sz, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);
if (!run) err(1, "mmap vcpu");
```
[//]: # ----------------------------------------------------------------
## (6) Initialize vCPU registers (1/2)
- Initialize segment registers to zero (Intel/AMD specific):
\vspace{.3cm}
```{.c .verysmall}
struct kvm_sregs sregs;
if (ioctl(vcpufd, KVM_GET_SREGS, &sregs) < 0) err(1, "KVM_GET_SREGS");
sregs.cs.base = 0; sregs.cs.selector = 0;
sregs.ds.base = 0; sregs.ds.selector = 0;
sregs.es.base = 0; sregs.es.selector = 0;
sregs.ss.base = 0; sregs.ss.selector = 0;
if (ioctl(vcpufd, KVM_SET_SREGS, &sregs) < 0) err(1, "KVM_SET_SREGS");
```
[//]: # ----------------------------------------------------------------
## (6) Initialize vCPU registers (2/2)
- Initialize instruction pointer `rip` to point to the beginning of OS' code (address 0)
- Initialize stack pointer `rsp` to point to top of RAM (here, `ram_size`); reminder: stack grows downard!
- Initialize flags register `rflags` (bit 1 is reserved: must be 1)
\vspace{.3cm}
```{.c .verysmall}
struct kvm_regs regs;
memset(&regs, 0, sizeof(regs));
regs.rip = 0;
regs.rsp = ram_size;
regs.rflags = 0x2;
if (ioctl(vcpufd, KVM_SET_REGS, &regs) < 0) err(1, "KVM_SET_REGS");
```
[//]: # ----------------------------------------------------------------
## (7) Run the vCPU
- Use the `KVM_RUN` `ioctl` on the vCPU file descriptor to run it
- This `ioctl` **blocks** until the vCPU triggers a `VMexit`!
\vspace{.3cm}
```{.c .tiny}
bool done = false;
while (!done) {
// Runs the vCPU until encoutering a VM_EXIT (blocking call)
if (ioctl(vcpufd, KVM_RUN, NULL) < 0) err(1, "KVM_RUN");
switch (run->exit_reason) { // See struct kvm_run in "(6) Create a vCPU"
case KVM_EXIT_IO: // Encountered a PMIO VMexit
break; // This VMexit should be handled here...
case KVM_EXIT_MMIO: // Encountered a MMIO VMexit
break; // This VMexit should be handled here...
case KVM_EXIT_HLT: // "halt" CPU instruction
printf("Encounterd HLT instruction\n");
done = true;
break;
case KVM_EXIT_SHUTDOWN:
case KVM_EXIT_FAIL_ENTRY:
case KVM_EXIT_INTERNAL_ERROR:
default:
fprintf(stderr, "KVM error");
done = true;
break;
}
}
```
[//]: # ----------------------------------------------------------------
## vCPU execution and VMexits
- **vCPU execution runs at native speed**, without interruption until guest OS code generates a `VMexit`
- A `VMexit` occurs when guest OS code:
- **\textcolor{myred}{reads/writes}** from/to an **\textcolor{myred}{I/O port (PMIO)}**
- **\textcolor{myred}{reads/writes}** from/to a physical address that has **\textcolor{myred}{no memory mapping}**
- executes specific **privilege machine instructions** (very few)
- triggers special error (very rare)
- After a `VMexit` is handled, the VMM **resumes** the vCPU's execution
[//]: # ----------------------------------------------------------------
# Controlling devices
[//]: # ----------------------------------------------------------------
## Devices
::: incremental
- Example of devices?
- screen, keyboard, mouse, hard drive, timer, etc.
- To be used, devices must be controlled/programmed
- Devices are controlled by a special piece of code: a **device driver**
- Each device is controlled differently: each requires specific controlling code!
- How to program a device is described in its datasheet/manual
- Each device is controlled through specific device **registers**
- How to read/write to these specific device registers?
:::
[//]: # ----------------------------------------------------------------
## Device addressing
- Devices are programmed using dedicated registers
- Each device has a different set of registers
- usually command, data, status, etc.
- each register has a specific access type (read, write, both)
- **\textcolor{myred}{Device registers can be accessed in two different ways: MMIO or PMIO}**
- Whether a device is accessed in MMIO or PMIO depends on the system
- the **only** way to know is by reading the system's documentation!
[//]: # ----------------------------------------------------------------
## Memory-Mapped I/O devices (MMIO)
- Device registers are **mapped** into the CPU **physical address space**
- **RAM and devices registers share the same address space!**
- Called **MMIO**: Memory-Mapped Input/Output
- Read/write from/to these devices happen exactly like memory (RAM)
- All CPU instructions dealing with memory operands can interact with these devices
- e.g. `mov` instruction (x86)
```{.verysmall .assembler}
mov al,[42] ; reads 8-bits at MMIO address 42
; and stores it into al register
```
[//]: # ----------------------------------------------------------------
## Port-Mapped I/O devices (PMIO)
- Device registers are **mapped** into a **specific** memory space, **distinct** from the CPU physical address space
- Require **specific** CPU instructions to access these devices
- e.g. `in` and `out` instructions (x86)
```{.verysmall .assembler}
in al,42 ; reads 8-bits at PMIO address 42
; and stores it into al register
```
```{.verysmall .assembler}
mov al,17 ; writes 8-bits value 17
out 42,al ; to PMIO address 42
```
[//]: # ----------------------------------------------------------------
## Device address spaces
\centering
![](images/mmio_vs_pmio.png){ width=100% }
\vfill
:::::: {.columns}
::: {.column width="50%"}
\footnotesize
- MMIO address space, accessed using regular memory instructions (`mov`)
:::
::: {.column width="50%"}
\footnotesize
- Distinct memory space, **only** accessed using specific machine instructions (`in`, `out`)
:::
::::::
[//]: # ----------------------------------------------------------------
## Example of VM memory mapping, including MMIO and PMIO
![](images/kvm_memory_mapping_with_mmio_pmio.png){ width=100% }
[//]: # ----------------------------------------------------------------
## Example of MMIO driver code: write to UART
```{.tiny .c}
// Base UART register
#define UART_BASE 0x10009000
// Data register
uint32_t* uart_dr = (uint32_t)(UART_BASE+0x00);
// Flag register
uint32_t* uart_fr = (uint32_t)(UART_BASE+0x18);
#define FR_TXFF (1 << 5)
void uart_putchar(char c) {
while (*uart_fr & FR_TXFF);
*uart_dr = c;
}
void uart_write(char *data) {
while (*data) {
uart_putchar(*data);
data++;
}
}
```
[//]: # ----------------------------------------------------------------
## Example of PMIO driver: read a sector (1/2)
\footnotesize
The following code reads a sector from the first IDE[^10] disk (on the first bus), using PMIO device registers
```{.tiny .c}
// A sector has a size of 512 bytes
#define SECTOR_SIZE 512
// IDE status register (on first bus)
#define STATUS_PORT 0x1F7
// IDE base port (on first bus)
#define DATA_PORT 0x1F0
// IDE control register (on first bus)
#define CONTROL_PORT 0x3F6
// Assembly function that writes 8-bits data to address port
extern void outb(uint16_t port, uint8_t data)
// Assembly function that reads 8-bits from address port
extern uint8_t inb(uint16 port)
// Assembly function that reads 16-bits from address port
extern uint16_t inb(uint16 port)
```
[^10]:\scriptsize IDE is the grandfather of the SATA protocol
[//]: # ----------------------------------------------------------------
## Example of PMIO driver: read a sector (2/2)
\small
This function reads a sector in LBA[^11] mode:
```{.tiny .c}
// Read sector n into buffer data (512 bytes and already allocated).
void read_sector(int n, uint8_t *data) {
while ((inb(STATUS_PORT) & 0xC0) != 0x40); // Wait for drive to be ready
// Prepare disk for read or write at specified sector in 28-bit LBA mode
outb(0x1F2, 1); // Set sector count
outb(0x1F3, n & 0xff); // Set bits 00-07 of LBA
outb(0x1F4, (n >> 8) & 0xff); // Set bits 08-15 of LBA
outb(0x1F5, (n >> 16) & 0xff); // Set bits 16-23 of LBA
outb(0x1F6, ((n >> 24) & 0x0f) | 0xe0); // Set bits 24-27 of LBA
// + set LBA mode
outb(STATUS_PORT, 0x20); // Command: read sector with retry
while ((inb(STATUS_PORT) & 0xC0) != 0x40); // Wait for drive to be ready
uint16_t *data = (uint16_t *)src;
for (int i = 0; i < SECTOR_SIZE/2; i++) { // Read the sector,
*data = inw(DATA_PORT); // 16-bits at a time
data++;
}
}
```
[^11]: \tiny LBA means sectors are indexed from 0 to N-1, by opposition to the old CHS "Cylinder-head-sector" addressing
[//]: # ----------------------------------------------------------------
# KVM API - handling VMexits
[//]: # ----------------------------------------------------------------
## KVM workflow from VMM's perspective
::: incremental :::
1. \textcolor{lightgray}{Create a KVM device}
1. \textcolor{lightgray}{Create a VM}
1. \textcolor{lightgray}{Allocate RAM for the VM}
1. \textcolor{lightgray}{Map allocated RAM into the VM's address space}
1. \textcolor{lightgray}{Load guest OS binary blob into the VM's RAM}
1. \textcolor{lightgray}{Create a vCPU and initialize its registers}
1. \textcolor{lightgray}{Run the vCPU on the guest OS' code until a `VMexit` is triggered}
1. Handle `VMexits` to either:
- handle an **hypercall** request from the guest OS (paravirtualization)
- **emulate** the guest OS' expected behavior (emulation)
1. Resume vCPU execution in (7)
:::
[//]: # ----------------------------------------------------------------
## Reminder: vCPU execution and VMexits
- **vCPU execution runs at native speed**, without interruption until guest OS code generates a `VMexit`
- A `VMexit` occurs when guest OS code:
- **\textcolor{myred}{reads/writes}** from/to an **\textcolor{myred}{I/O port (PMIO)}**
- **\textcolor{myred}{reads/writes}** from/to a physical address that has **\textcolor{myred}{no memory mapping}**
- executes specific **privilege machine instructions** (very few)
- triggers special error (very rare)
- After a `VMexit` is handled, the VMM **resumes** the vCPU's execution
[//]: # ----------------------------------------------------------------
## Most interesting VMexits
- OS code can trigger many types of `VMexits` , but these ones are especially interesting:
- `KVM_EXIT_IO`: the vCPU executed a port I/O instruction (PMIO) that cannot be satisfied by KVM
- means the **\textcolor{myred}{guest read/wrote from/to a port}**
- `KVM_EXIT_MMIO`: the vCPU executed a memory-mapped I/O (MMIO) instruction that cannot be satisfied by KVM
- means the **\textcolor{myred}{guest read/wrote from/to an address that has no RAM mapping}**
\vspace{.5cm}
\definecolor{palechestnut}{rgb}{0.87, 0.68, 0.69}
\setlength{\fboxsep}{6pt}
\fcolorbox{black}{palechestnut!50}{\parbox{10cm}{How to use these VMexits, to either \textbf{paravirtualize} or \textbf{emulate} a device?}}
[//]: # ----------------------------------------------------------------
# Device paravirtualisation
[//]: # ----------------------------------------------------------------
## From device emulation to device paravirtualization
::: incremental :::
- Device emulation can be complex and difficult to implement
- Device drivers that trigger many `VMexits` have poor performance
- each `VMexit` triggers a context switch
- many context switches lead to poor performance
- Paravirtualized devices designed to be simple and efficient with few `VMexits`
- How?
:::
[//]: # ----------------------------------------------------------------
## Hypercalls
- Mechanism for the guest OS to request the help of the VMM
- The VMM exposes an "API" of what functions are available to the guest OS
- typically used to access devices
- Examples of hypercalls:
- guest OS wants to read a disk sector
- guest OS wants to display something
- guest OS wants to send a network packet
[//]: # ----------------------------------------------------------------
## Hypercalls vs system calls
- Mechanism similar to a system call between an application and an OS:\
\vspace{1cm}
\centering ![](images/hypercalls_vs_syscalls.png){ width=80% }
[//]: # ----------------------------------------------------------------
## Benefits of hypercalls
::: incremental :::
- No need to emulate the real hardware
- **much simpler device drivers!**
\vspace{.3cm}
- Very few `VMexits`
- **much better performance!**
:::
[//]: # ----------------------------------------------------------------
## Hypercall principle
- How does the guest OS request an hypercall to the VMM?
- An hypercall is simply a **specific `VMexit`** triggered by the guest OS, associated to a function **number**
- Hypercalls arguments (parameters) are stored in an area of **shared memory** between VMM and guest OS
[//]: # ----------------------------------------------------------------
## Hypercalls: PMIO or MMIO?
- Hypercalls can be triggered by guest OS using either MMIO or PMIO
- However, given `KVM_EXIT_IO` (PMIO) is much faster[^6] on Intel/AMD architectures, this is what is presented in the following example
[^6]: \scriptsize [\textcolor{myblue}{From official doc}](https://www.kernel.org/doc/html/latest/virt/kvm/api.html): "KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO"
[//]: # ----------------------------------------------------------------
## Guest OS: basic hypercall implementation example
- Writes hypercall arguments to **hypercall buffer** (shared memory between VMM and guest)
- Triggers an hypercall by writing to a **specific hypercall address**[^7] (PMIO)
- writes the hypercall **number** to indicate what function to call
[^7]: \scriptsize Obviously, this PMIO address should not be used by some real hardware
[//]: # ----------------------------------------------------------------
## VMM: basic hypercall implementation example
- Allocate and map **hypercall buffer** in VM's address space
- this memory area is **shared** between VMM and guest OS
- allocated and mapped similarly to RAM in the VM
- When `KVM_EXIT_IO` encountered, check address to determine whether it's an hypercall
- If hypercall:
1. extract hypercall **number** by reading value written by guest
1. extract hypercall **arguments** from hypercall buffer
1. perform the **expected behavior**
- use Linux host libraries/systemcalls to access devices
- possibly write an output to hypercall buffer
[//]: # ----------------------------------------------------------------
## Hypercalls: shared buffer
\centering
![](images/kvm_memory_mapping_with_hypercall.png){ width=100% }
[//]: # ----------------------------------------------------------------
## Hypercalls: parameters
```{.verysmall}
typedef struct {
uint8_t x;
int32_t val;
uint64_t msg; // a pointer
} __attribute__((packed)) params_t;
```
IMPORTANT: if VMM and guest architecture are of different size, e.g. guest is 32-bits and VMM is 64, the above structure must use the **largest** data storage type!
[//]: # ----------------------------------------------------------------
## Writing/read to/from shared buffer
- Compiler optimization:
- use memcpy if VMM must write something that must be read back by the guest OS!
- otherwise, compiler won't read the value again (will use the existing value as it doesn't know the value was modified in another thread/process)
[//]: # ----------------------------------------------------------------
# VMExits: retrieving values
[//]: # ----------------------------------------------------------------
## KVM_EXIT_IO: retrieving data written by the guest
\footnotesize
- Guest wrote a value (8, 16, or 32 bits) to a PMIO address (I/O port)
- VMM retrieves: value, address, size written (8, 16, 32 bits)
```{.c .tiny}
if (run->io.direction == KVM_EXIT_IO_OUT) { // See struct kvm_run in "(6) Create a vCPU"
uint8_t *addr = (uint8_t *)kvm_run + run->io.data_offset;
uint32_t value;
switch (run->io.size) {
case 1: // Retrieve the 8-bit value written by the guest
value = *(uint8_t *)addr;
break;
case 2: // Retrieve the 16-bit value written by the guest
value = *(uint16_t *)addr;
break;
case 4: // Retrieve the 32-bit value written by the guest
value = *(uint32_t *)addr;
break;
default:
fprintf(stderr, "Unsupported size in KVM_EXIT_IO\n");
value = 0;
}
printf("PMIO guest write: size=%d port=0x%x value=0x%x\n", run->io.size, run->io.port, value);
}
```
[//]: # ----------------------------------------------------------------
## KVM_EXIT_MMIO: retrieving data written by the guest
\footnotesize
- Guest wrote a value (8, 16, or 32 bits) to a MMIO address
- VMM retrieves: value, address, size written (8, 16, 32 bits)
```{.c .tiny}
if (run->mmio.is_write) { // See struct kvm_run in "(6) Create a vCPU"
int bytes_written = run->mmio.len;
uint32_t value;
switch (bytes_written) {
case 1: // Retrieve the 8-bit value written by the guest
value = *((uint8_t *)run->mmio.data);
break;
case 2: // Retrieve the 16-bit value written by the guest
value = *((uint16_t *)run->mmio.data);
break;
case 4: // Retrieve the 32-bit value written by the guest
value = *((uint32_t *)run->mmio.data);
break;
default:
fprintf(stderr, "Unsupported size in KVM_EXIT_MMIO\n");
value = 0;
}
printf("MMIO guest write: addr=0x%llx value=0x%x len=%d\n", run->mmio.phys_addr, value, bytes_written);
}
```
[//]: # ----------------------------------------------------------------
# VMExits: injecting values
[//]: # ----------------------------------------------------------------
## KVM_EXIT_IO: injecting data into the guest
\footnotesize
- Guest read a value (8, 16, or 32 bits) from a PMIO address (I/O port)
- VMM retrieves: address, size read (8, 16, 32 bits)
- VMM injects a specific value (the one read by the guest)
```{.c .tiny}
if (run->io.direction == KVM_EXIT_IO_IN) { // See struct kvm_run
uint8_t *addr = (uint8_t *)kvm_run + run->io.data_offset;
switch (run->io.size) {
case 1: { // Guest is reading 8 bits from the port
*addr = 0x12; // 8-bit example value injected into the guest
} break;
case 2: { // Guest is reading 16 bits from the port
*((uint16_t *)addr) = 0x1234; // 16-bit example value injected into the guest
} break;
case 4: { // Guest is reading 32 bits from the port
*((uint32_t *)addr) = 0x12345678; // 32-bit example value injected into the guest
} break;
default:
fprintf(stderr, "Unsupported size in KVM_EXIT_IO\n");
}
printf("PMIO guest read: size=%d port=0x%x [value injected by VMM=0x%x]\n", run->io.size, run->io.port, injected_val);
}
```
[//]: # ----------------------------------------------------------------
## KVM_EXIT_MMIO: injecting data into the guest
\footnotesize
- Guest read a value (8, 16, or 32 bits) from a MMIO address
- VMM retrieves: address, size read (8, 16, 32 bits)
- VMM injects a specific value (the one read by the guest)
```{.c .tiny}
if (!run->mmio.is_write) { // See struct kvm_run
int bytes_read = run->mmio.len;
switch (bytes_read) {
case 1: { // Guest is reading 8 bits
uint8_t *addr = (uint8_t *)run->mmio.data;
*addr = 0x12; // 8-bit example value injected into the guest
} break;
case 2: { // Guest is reading 16 bits
uint16_t *addr = (uint16_t *)run->mmio.data;
*addr = 0x1234; // 16-bit example value injected into the guest
} break;
case 4: { // Guest is reading 32 bits
uint32_t *addr = (uint32_t *)run->mmio.data;
*addr = 0x12345678; // 32-bit example value injected into the guest
} break;
default:
fprintf(stderr, "Unsupported size in KVM_EXIT_MMIO\n");
}
fprintf(stderr, "MMIO guest read: addr=0x%llx injected=0x%x len=%d\n", run->mmio.phys_addr, injected_val, bytes_read);
}
```
[//]: # ----------------------------------------------------------------
# Device emulation
[//]: # ----------------------------------------------------------------
## Reminder: MMIO registers
When the VM is created:
- The VMM constructs the VM address space by:
- mapping the RAM into the VM address space
- if the VM exposes some device programmed through MMIO registers, it must ensure there is no memory mapping (RAM) where device registers are located:
- ensure `VMexits` (`KVM_EXIT_MMIO`) will be triggered when OS driver code read/write to these addresses (registers)
[//]: # ----------------------------------------------------------------
## Reminder: VMexits
When the VM is being executed:
- PMIO `VMexits` (`KVM_EXIT_IO`) are triggered when guest OS reads/writes from/to I/O ports
- MMIO `VMexits` (`KVM_EXIT_MMIO`) are triggered when guest OS reads/writes from/to an address that has no RAM mapping
[//]: # ----------------------------------------------------------------
## Device emulation principle
- For each PMIO or MMIO `VMexit`, the VMM can retrieve :
- the address the guest OS wrote to or read from
- the value written by the guest OS
- the size of the value written to or read from by the guest OS (8, 16, 32, 64 bits)
- VMM can **keep track** of where and how the guest OS read/write to device registers
- allow to **infer** what the OS is doing and **emulate** the desired behavior
[//]: # ----------------------------------------------------------------
## Device emulation
- Emulating a real device allows a guest OS implementing a driver for the **real hardware** to use it
- Most OSes implement drivers to support various popular physical devices, e.g.:
- VGA graphic card, PS/2 mouse, PS/2 keyboard, SATA drive, etc.
- Reminder: device drivers write and read to specific device registers
- either MMIO or PMIO or both
- How does a VMM emulate a device?
[//]: # ----------------------------------------------------------------
## Device emulation: example
\small
Graphic card driver code excerpt (using PMIO) from guest OS code, to initialize VGA 400x300 mode:
```{.c .tiny}
// Code excerpt of initialization sequence for VGA 400x300 mode
outb(0x3C2, 0x67);
outw(0x3C4, 0x0F02); // enable writing to all planes
outw(0x3CE, 0x0506); // graphic mode
while (1) {
if (inb(0x3DA) == 0x80) break;
}
outb(0x3C0, 0x20); // enable video
outb(0x3C5, 0x0F);
```
\footnotesize
- The code above would typically be part of the VGA driver in the guest OS
- How can the VMM emulate the behavior of a real PC running this code?
- \footnotesize by analyzing the code ran by the guest OS
- if the VMM detects the exact code above, it then emulates the behavior on the host
- for instance by opening a 400x300 pixels window in which pixels will be rendered
[//]: # ----------------------------------------------------------------
## Device emulation: state machine
```{.c .tiny}
outb(0x3C2, 0x67);
outw(0x3C4, 0x0F02); // enable writing to all planes
outw(0x3CE, 0x0506); // graphic mode
while (1) {
if (inb(0x3DA) == 0x80) break;
}
outb(0x3C0, 0x20); // enable video
outb(0x3C5, 0x0F);
```
\centering
![](images/emul_state_machine.png){ width=100% }
[//]: # ----------------------------------------------------------------
## State machine representation
- Each device to emulate corresponds to a specific state machine
- Intuitively, a state machine is a set of states associated to conditions and actions, in a kind of "big switch case"
- However, writing new, but similar code for each device is repetitive, error prone, non-scalable and difficult to maintain
- How to represent a state machine in a generic way?
- We would like a representation of the state machine that is generic, with actions and conditions easily and clearly expressed
[//]: # ----------------------------------------------------------------
## Generic state machine representation (1/2)
Represent each state by a structure that defines:
- the operation to perform
- the written/read address
- the expected written value or value to inject
- the size of the operation (8, 16, or 32 bits)
- possibly a custom user function that would be executed at the beginning or end of the state
[//]: # ----------------------------------------------------------------
## Generic state machine representation (2/2)
\small
:::::: {.columns}
::: {.column width="60%"}
State machine for previously shown VGA 400x300 graphic initialization driver:
```{.c .tiny}
state_t states[] = {
{ OP_WRITE_EQUAL, 0x3C2, 0x67, 1, NULL },
{ OP_WRITE_EQUAL, 0x3C4, 0xF02, 2, NULL },
{ OP_WRITE_EQUAL, 0x3CE, 0x506, 2, NULL },
{ OP_READ_INJECT, 0x3DA, 0x80, 1, NULL },
{ OP_WRITE_EQUAL, 0x3C0, 0x20, 1, NULL },
{ OP_WRITE_EQUAL, 0x3C5, 0x0F, 1, NULL },
{ OP_EMUL_END, 0, 0, 0, NULL }
};
```
Here, the last field is a function called when the field is non-NULL
:::
::: {.column width="40%"}
```{.c .tiny}
outb(0x3C2, 0x67);
// enable writing to all planes
outw(0x3C4, 0x0F02);
// graphic mode
outw(0x3CE, 0x0506);
while (1) {
if (inb(0x3DA) == 0x80) {
break;
}
}
// enable video
outb(0x3C0, 0x20);
outb(0x3C5, 0x0F);
```
:::
::::::
[//]: # ----------------------------------------------------------------
# VMM software architecture
[//]: # ----------------------------------------------------------------
## VMM execution: issue
::: incremental :::
- Most of the VMM time is spent **blocked** in the `KVM_RUN` `ioctl`
- **\textcolor{myred}{Issue}**: a single threaded VMM can only do other things **if** and **when** a `VMexit` occurs
- Solution?
- Use multiple threads
:::
[//]: # ----------------------------------------------------------------
## Recommended VMM architecture
- Dedicate a thread for every vCPU
- If using the SDL library to manage the display, **make sure** that only the main thread calls SDL functions[^8]
- Dedicate a thread to interact with the user
- if using SDL, handle it in the same thread (main) as the one calling SDL functions
- If VMM emulates a timer, have it run in a dedicated thread
[^8]: \scriptsize [\textcolor{myblue}{https://documentation.help/SDL/thread.html}](https://documentation.help/SDL/thread.html)
[//]: # ----------------------------------------------------------------
## Gracefully exiting the VM: case 1
**Guest OS explicitely stops the CPU**
- Guest: executes the `hlt` machine instruction to stop the CPU
- a buggy guest may unknowingly trigger `KVM_EXIT_HLT`!
- VMM: `hlt` triggers the `KVM_EXIT_HLT` `VMExit`
- VMM: must:
- terminate all threads
- deallocate all allocated resources/memory
[//]: # ----------------------------------------------------------------
## Gracefully exiting the VM: case 2
\small
::: incremental :::
**Execution of guest OS never stops** (infinite loop)
- The thread running the `KVM_RUN` `ioctl` never stops either!
- The VMM could detect a specific key press $\rightarrow$ exit **initiated** on the host
- ... But how to stop the thread blocked in the `KVM_RUN` `ioctl`?
- **\textcolor{mygreen}{Solution}**: send a cancel message to terminate the thread:
- \footnotesize configure the thread to be interruptible
```{.c .tiny}
// Te be called at the very beginning of the thread function!
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
```
- \footnotesize when the VM must be stopped, send the thread to be cancelled a cancel notification with:
```{.c .tiny}
pthread_cancel(thread);
```
:::
[//]: # ----------------------------------------------------------------
## Shutting down the VMM
- When the VMM is shutting down, make sure to **deallocate** all VM's resources:
- close all open file descriptors (KVM device, VM, vCPUs, etc.)
- use `munmap` to free all memory regions allocated with `mmap`
- All threads must terminate as well
- make sure threads are properly synchronized
- requires communication/notifications between the threads so that:
- they free resources in the correct order
- they terminate in the correct order
[//]: # ----------------------------------------------------------------
# Emulating hardware interrupts
[//]: # ----------------------------------------------------------------
## Hardware interrupts
- Hardware interrupts are **asynchronous** notifications generated by the hardware, e.g.:
- timer generates an interrupt every microsecond
- mouse generates an interrupt whenever it's moved or a button is pressed
- keyboard generates an interrupt whenever a key is pressed or released
- sound card generates an interrupt when a sound finished playing
- hard drive controller generates an interrupt when a sector is written/read
- etc.
[//]: # ----------------------------------------------------------------
## Why hardware interrupts?
- To avoid constantly polling devices
- Polling must be avoided:
- polling frequently: low latency, but high CPU usage
- polling infrequently: low CPU usage, but high latency
- To handle timers
- A timer offers a time base independent of the CPU frequency
- necessary to implement `sleep` or similar functionalities
[//]: # ----------------------------------------------------------------
## Anatomy of hardware interrupts (simplified)
- Each interrupt is identified by a number ($\geq$ 0)
- Each device generates a different interrupt
- Whenever the CPU receives an interrupt, it must handle it, i.e. execute some particular code, e.g.:
- keyboard: read the pressed key and store it in an internal buffer
- mouse: read the new position and display the cursor at the new position
- etc.
- When the CPU finishes handling the interrupt, it must send a specific command (EOI = End Of Interrupts) to the interrupt controller to indicate it has finished handling the interrupt
- otherwise that interrupt will never be fired again!
[//]: # ----------------------------------------------------------------
## Interrupt Vector Table
- Each hardware interrupt is associated to a specific routine (function)
- This routine is executed by the CPU whenever the corresponding interrupt is received
- such a routine is called an Interrupt Service Routine (ISR), or Interrupt Handler
- A table is used to store, in RAM, the ISR of each hardware interrupt
- This table is called the Interrupt Vector Table (IVT)
[//]: # ----------------------------------------------------------------
## Hardware interrupts and IVT (simplified)
\centering
![](images/hardware_interrupts_IVT.png){ width=100% }
[//]: # ----------------------------------------------------------------
## Hardware interrupts emulation
::: incremental :::
- How to emulate hardware interrupts?
- By having the VMM **inject** hardware interrupts into the guest!
:::
[//]: # ----------------------------------------------------------------
## Hardware interrupts injection - VMM side
1. Create a virtual "Programmable Interrupt Controler" (PIC)
- \textcolor{myred}{important: must be created \textbf{before} creating any vCPU!}
1. Configure the PIC to automatically issue End Of Interrupts (EOI)
- allow to simplify the guest OS code
1. Create a file descriptor for event notification
1. Link this file descriptor to a KVM interrupt request object, specifying the hardware interrupt number
1. Then, **to trigger a hardware interrupt in the guest, simply write to this file descriptor**
[//]: # ----------------------------------------------------------------
## Injection VMM side: (1) create virtual PIC
A PIC is created using the following `ioctl` call on the VM file descriptor:
\vspace{.3cm}
```{.verysmall .c}
if (ioctl(vmfd, KVM_CREATE_IRQCHIP, 0) < 0) {
err(1, "KVM_CREATE_IRQCHIP failed");
}
```
[//]: # ----------------------------------------------------------------
## Injection VMM side: (2) configure virtual PIC
Configure the PIC to automatically issue End Of Interrupts:
\vspace{.3cm}
```{.verysmall .c}
struct kvm_irqchip irqchip = {
.chip_id = 0, // 0 = PIC1, 1 = PIC2, 2 = IOAPIC
};
// Retrieve the PIC registers
ioctl(vmfd, KVM_GET_IRQCHIP, &irqchip);
// Configure the PIC registers so that the Guest OS doesn't need
// to acknowledge the hardware interrupt by issuing an EOI.
irqchip.chip.pic.auto_eoi = 1;
// Set the PIC registers
ioctl(vmfd, KVM_SET_IRQCHIP, &irqchip);
```
[//]: # ----------------------------------------------------------------
## Injection VMM side: (3) fd for even notification
Create a file descriptor for event notification by using the `eventfd` function:
\vspace{.3cm}
```{.verysmall .c}
// Create a file descriptor for event (hardware interrupts) notification
int fd = eventfd(0, 0);
if (fd == -1) {
err(1, "eventfd failed");
}
```
[//]: # ----------------------------------------------------------------
## Injection VMM side: (4) link eventfd to KVM
- Link the event file descriptor to a KVM interrupt request object which specifies which hardware interrupt must be associated to the file descriptor
- Here, we link hardware interrupt 5 to file descriptor `fd`:
\vspace{.3cm}
```{.verysmall .c}
int interrupt_number = 5;
struct kvm_irqfd irqfd = {
.gsi = interrupt_number,
.fd = fd,
};
if (ioctl(vm->vmfd, KVM_IRQFD, &irqfd) < 0) {
err(1, "KVM_IRQFD error");
}
```
[//]: # ----------------------------------------------------------------
## Injection VMM side: (5) trigger hardware interrupt
To trigger a hardware interrupt in the guest, simply write any 64 bits value to the event file descriptor:
\vspace{.3cm}
```{.verysmall .c}
uint64_t dummy_val = 0;
write(fd, &dummy_val, sizeof(uint64_t));
```
[//]: # ----------------------------------------------------------------
## Hardware interrupts injection - guest OS side
Perform the same steps as on a real physical system[^4]:
1. Create an Interrupt Vector Table (IVT)
1. Initialize the IVT entries for which hardware interrupts must be handled
- implement ISRs for all potential hardware interrupts that may be triggered
1. Unmask hardware interrupts so that they will be received
\textcolor{myred}{Receiving a hardware interrupts for which there is no properly initialized IVT entry will result in a shutdown/reboot of the guest (as would on a physical machine)}
[^4]: \scriptsize With one exception: in the ISR, no need to send an EOI command to the PIC
[//]: # ----------------------------------------------------------------
# Miscellaneous
[//]: # ----------------------------------------------------------------
## Memory optimization with KSM
- KSM stands for **K**ernel **S**amepage **M**erging
- Linux kernel feature that deduplicates "identical" pages found across user processes
- **\textcolor{mygreen}{benefit}: massive gain in memory usage!**
- \textcolor{myred}{drawback}: security (theoretical)
- Spawn `ksmd` daemon which inspects pages for possible merges
- Kernel must be compiled with `CONFIG_KSM=y` (> 2.6.32)
- KSM [\textcolor{myblue}{controlled}](https://www.kernel.org/doc/Documentation/vm/ksm.txt)[^3] by writing to `/sys/kernel/mm/ksm/run`:
- 0: stop `ksmd` from running but keep merged pages
- 1: run `ksmd`
- 2: stop `ksmd` and unmerge all pages currently merged
[^3]: \scriptsize [\textcolor{myblue}{https://www.kernel.org/doc/Documentation/vm/ksm.txt}](https://www.kernel.org/doc/Documentation/vm/ksm.txt)
[//]: # ----------------------------------------------------------------
## Resources
\small
- KVM FAQ\
\footnotesize [\textcolor{myblue}{https://www.linux-kvm.org/page/FAQ}](https://www.linux-kvm.org/page/FAQ)
\small
- KVM API reference\
\footnotesize [\textcolor{myblue}{https://www.kernel.org/doc/html/latest/virt/kvm/api.html}](https://www.kernel.org/doc/html/latest/virt/kvm/api.html)
\small
- Using the KVM API\
\footnotesize [\textcolor{myblue}{https://lwn.net/Articles/658511/}](https://lwn.net/Articles/658511/)
\small
- *Increasing memory density by using KSM* by A. Arcangeli, I. Eidus, C. Wright\
\footnotesize [\textcolor{myblue}{https://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf}](https://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf)
\small
- \footnotesize Virtio Driver Implementation\
\scriptsize [\textcolor{myblue}{http://www.dumais.io/index.php?article=aca38a9a2b065b24dfa1dee728062a12}](http://www.dumais.io/index.php?article=aca38a9a2b065b24dfa1dee728062a12)
<!--
[//]: # ----------------------------------------------------------------
# Kernel modules
[//]: # ----------------------------------------------------------------
## Linux kernel modules
\small
**Module = kernel code that can be loaded/unloaded at runtime**
- Allows to add/remove kernel features while system is running
- Modules have **full privileges** and control of the system $\rightarrow$ \textcolor{myred}{buggy modules may crash the kernel!}
- Make it easy to develop drivers without rebooting
- Help keep kernel image size to a minimum
- Help reduce boot time: avoid spending time initializing devices and kernel features that will only be needed later
- Modules installed in `/lib/modules/<kernel_version>/kernel` and have the `.ko` extension
[//]: # ----------------------------------------------------------------
## Loading modules
- To load a single module **without** its dependencies:
```{.small}
sudo insmod <module_path>.ko
```
- To load a module **with** its dependencies:
```{.small}
sudo modprobe <module_name>
```
- `modprobe` reads `/lib/modules/<kernel_version>/modules.dep.bin` to determine:
- each module’s location (path)
- each module’s dependencies
[//]: # ----------------------------------------------------------------
## Module utilities
- To get information about a module (parameters, license, description, dependencies, etc.):
```{.small}
modinfo <module_name>
modinfo <module_path>.ko
```
- To display all loaded modules (see `/proc/modules`):
```{.small}
lsmod
```
- To remove a module (and its depedencies with `-r`):
```{.small}
rmmod <module_name>
```
-->
SRCS=$(wildcard *.md)
PDFS=$(SRCS:%.md=%.pdf)
UID=$(shell id -u)
GID=$(shell id -g)
all: $(PDFS)
%.pdf: %.md
docker run --user $(UID):$(GID) --rm --mount type=bind,src="$(PWD)",dst=/src thxbb12/md2pdf build_slides $<
clean:
rm -f $(PDFS)
course/images/device_emul.png

93.2 KiB

course/images/device_paravirt.png

101 KiB

course/images/emul_state_machine.png

287 KiB

course/images/hardware_assisted_virt.png

222 KiB

course/images/hardware_interrupts_IVT.png

292 KiB

course/images/hypercalls_vs_syscalls.png

38.9 KiB

course/images/kvm.png

154 KiB

course/images/kvm_memory_mapping.png

232 KiB

course/images/kvm_memory_mapping_with_hypercall.png

423 KiB

course/images/kvm_memory_mapping_with_mmio_pmio.png

335 KiB

course/images/kvm_model.png

136 KiB

course/images/mmio.png

46.5 KiB

course/images/mmio_vs_pmio.png

151 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment