VFIO details¶
Example Usage¶
prepare
unlink device to pci driver in sysfs
link device to vfio driver in sysfs (and bind to vfio group)
provide the user with access to the vfio group if unprivileged operation is desired
user access vfio
Open vfio container (
/dev/vfio/vfio
) and vfio group (e.g./dev/vfio/26
)Add group to container:
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
Set IOMMU model of the container:
ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
Use
mmap()
to allocate some space and setup a DMA mapping:ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
Get device from vfio_group:
device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
Get device information for userspace driver?
device_info
:VFIO_DEVICE_GET_INFO
region_info
:VFIO_DEVICE_GET_REGION_INFO
irq_info
:VFIO_DEVICE_GET_IRQ_INFO
user access vfio2
vfio container, group, device are all fd.
dma_map:
struct vfio_iommu_type1_dma_map
vfio_device_info
,vfio_region_info
,vfio_irq_info
VFIO User API:
include/linux/vfio.h
-
vfio_add_group_dev()
: vfio core begin tracking the specified iommu_group, register the specified dev by VFIO bus driver.
prepare for vfio-platform¶
echo "vfio-platform" > /sys/bus/$bus/devices/$device/driver_override
echo $device > /sys/bus/$bus/devices/$device/driver/unbind
echo $device > /sys/bus/$bus/drivers_probe
1. set "vfio-platform" to device/driver_override
2. set device_name to driver/unbind
3. set device_name to bus/drivers_probe
device/driver_override
規定只有跟
driver_override
的 value 同名的 driver 才能 binding 到 device. 當 binding 的方式為bus/drivers_probe
時適用. 其他情況不確定.DEVICE_ATTR_RW(driver_override); // => sysfs device_attribute // setter, getter = driver_override_store, driver_override_show // driver_override_store() set parameter to [platform_device] pdev.driver_override // platform_match() use pdev.driver_override // of, acpi, id table match?
driver/unbind
把 driver 跟現在 driver binding 的 device 進行 unbind.
bus_find_device_by_name(driver->bus, buf)
+device_release_driver(dev)
bus/drivers_probe
drivers_probe 會去掃描 bus 下的所有 driver. 如果這些 driver 跟 device 有 match 的話(
platform_match()
), 就讓 driver 跟 device 進行 binding(probe).BUS_ATTR(drivers_probe, S_IWUSR, NULL, store_drivers_probe); store_drivers_probe() => bus_rescan_devices_helper() => device_attach() // device_attach() walks through the list of driver in bus. // calls driver_probe_device() for each pair if driver_match_device() // for each driver in bus: __device_attach_driver() __device_attach_driver() if driver_match_device() // use pdev.driver_override. driver_probe_device() driver_match_device() => driver->bus->match() => platform_match() for platform_device
background API:
device, driver, and bus use DEVICE_ATTR, DRIVER_ATTR, BUS_ATTR to declare sysfs entry
1.a. store and show operations are like setter and getter.
一些整理?¶
vfio_device_fops
裏面的region_info
,irq_info
很像 Linux Driver 使用硬體的 API:request_memory_region()
,request_irq()
.set vfio_group to vfio_container (ioctl vfio_group) => the remaining ioctl becomes available
VFIO IOMMU access
get device fd in vfio_group
vfio device API
describe device
IO region
read/write/mmap offset
mechanism to register interrupt notification
kernel 內部實作¶
VFIO 的介面有三種 fd: container, group, and device.
固對應的 Linux Kernel 也有實作三組 fops 介面: vfio_fops
, vfio_group_fops
, vfio_device_fops
.
kernel 內部元件跟介面的關聯, 可以參考別人整理的一張圖.
VFIO 內部由三個部份所組成: VFIO core + VFIO bus driver + VFIO IOMMU driver. MMIO region map 疑似在 VFIO bus driver 完成的
details¶
vfio_fops:
open, release, read, write
read/write: iommu_driver->ops->read/write
ioctl:
get_api_version, check_extension
set_iommu
vfio_group_fops:
open, release, ioctl
ioctl: get_status, set/unset_container, get_device_fd
vfio_device_fops
abstraction of device->ops
介面 detail¶
fops
vfio_fops
是基於 misc_device:misc_register(&vfio_dev);
vfio_group_fops
是基於 char device:cdev_init(&vfio.group_cdev, &vfio_group_fops);
vfio_device_fops
是基於 anonymous fd, 當 process 呼叫vfio_group_get_device_fd()
API 時會動態產生該介面的 fd 給 process.
vfio_device_fops
的 functions, 會去呼叫參數 vfio_device 裏面的 op structure.比如說
vfio_device_read()
會呼叫vfio_device->ops->read()
vfio_fops
的 ioctl() 除了三個 option 以外, 都會 redirect 到vfio_iommu_driver.ioctl()
內部元件 detail¶
vfio_group_set_container(struct vfio_group *group, int container_fd)
[iommu_driver] driver->ops->attach_group(container->iommu_data, group->iommu_group);
vfio_iommu_driver 實作了 container 的 ioctl options VFIO_IOMMU_MAP_DMA
container fops ioctl() VFIO_IOMMU_MAP_DMA 會 redirect 到
vfio_iommu_driver.ioctl()
vfio_pci (vfio bus driver) 如何跟 vfio interface connect?
vfio_pci 會 implement 一個 pci driver 並註冊:
pci_register_driver()
vfio_pci 註冊後, 會創造對應的
vfio_device
跟vfio_group
, 並且用一個 idr 去 index 該vfio_group
pointer.vfio_group fops open() 時, 會透過 idr 找到 vfio_pci 產生的
vfio_group
other about functionality¶
將 DMA 暴露到 userspace
將 interrupt 暴露到 userspace
userspace 使用 eventfd
透過 VFIO_DEVICE_GET_REGION_INFO 得到 interrupt 訊息
透過
eventfd()
產生一個 eventfd, 並用 VFIO_DEVICE_SET_IRQ 把這個 eventfd 跟 interrupt 相連起來.可以
select/poll/epoll()
該 eventfd
QEMU VFIO (VFIO_PLATFORM)¶
User Interface¶
QEMU device parameter:
1. -device vfio-platform,host=ee300000.sata
2. -device vfio-platform,sysfsdev=/sys/bus/platform/devices/ee300000.sata
Source Code:
hw/vfio/platform.c
Ref: https://elinux.org/R-Car/Virtualization/VFIO#SATA_Pass-Through
Kernel APIs¶
VFIO:
container fd:
- from /dev/vfio/vfio - ioctl() - VFIO_IOMMU_MAP_DMA - VFIO_IOMMU_UNMAP_DMA - VFIO_SET_IOMMUgroup fd:
- from /dev/vfio/$GROUP - ioctl() - VFIO_GROUP_SET_CONTAINER - VFIO_GROUP_GET_DEVICE_FD - VFIO_GROUP_GET_STATUS, VFIO_GROUP_FLAGS_VIABLEdevice fd:
- from VFIO_GROUP_GET_DEVICE_FD - ioctl() - VFIO_DEVICE_GET_REGION_INFO - VFIO_DEVICE_GET_IRQ_INFO - VFIO_DEVICE_SET_IRQS - mmap() - read/write()
KVM:
VM fd:
- from KVM_CREATE_VM - ioctl() - KVM_CREATE_DEVICE - KVM_IRQFDvfio_kvm_device_fd:
- from KVM_CREATE_DEVICE, type=KVM_DEV_TYPE_VFIO - ioctl() - KVM_SET_DEVICE_ATTR(p.s. /dev/kvm and VCPU fd are not used here)
Trace code¶
Data:
FDs:
- vbasedev->fd
- vfio_kvm_device_fd
QEMU MemoryRegion:
- region->mem
- region->mmaps[i].mem
QEMU EventNotifier:
- intp->interrupt
- intp->unmask
QEMU EventNotifier
// Abstraction of eventfd, contains rfd and wfd
// APIs:
// event_notifier_set(): set wfd
// event_notifier_test_and_clear(): check if rfd has been triggered
// event_notifier_get_fd(): return rfd (used by epoll())
Code:
[MMIO]
vfio_populate_device()
vfio_region_setup() // configure vdev->regions[i]
ioctl: VFIO_DEVICE_GET_REGION_INFO
memory_region_init_io(region->mem, &vfio_region_ops)
// configure region->nr_mmaps and region->mmaps
// 1. region->nr_mmaps = 1
// 2. vfio_setup_region_sparse_mmaps()
/*
* vfio_region_ops seems like a fallback mechanism of MMIO,
* if some MMIO registers wouldn't be covered by region->mmaps.
* (but why and which condition the MMIO register isn't covered by region->mmaps?)
*
* Guest OS access of MMIO register will trap and the switch to QEMU,
* then QEMU could access MMIO register via read/write() to VFIO device fd.
* (QEMU is more like a userspace driver using VFIO in this fallback mechanism.)
*/
vfio_region_ops
vfio_region_read()
pread(vbasedev->fd, region->fd_offset + addr)
vbasedev->ops->vfio_eoi(vbasedev);
vfio_region_write()
pwrite(vbasedev->fd, region->fd_offset + addr)
vbasedev->ops->vfio_eoi(vbasedev);
/*
* Create GPA to HPA mapping of MMIO registers
*/
vfio_platform_realize()
for each vdev->regions:
vfio_region_mmap(vdev->regions[i])
for each mmap (nr_mmaps):
// Use VFIO API to create mapping of MMIO register (VA => PA).
// QEMU process could see the MMIO register.
region->mmaps[i].mmap = mmap(region->vbasedev->fd)
// Use KVM API to create mapping of MMIO register (GPA => HPA).
// KVM API argument of MMIO register is VA in QEMU, so QEMU needs to see MMIO registers at first.
// Guest OS could see MMIO registers.
memory_region_init_ram_device_ptr(®ion->mmaps[i].mem, region->mmaps[i].mmap)
memory_region_add_subregion(region->mem, region->mmaps[i].mem)
[IRQ]
vfio_populate_device()
ioctl: VFIO_DEVICE_GET_IRQ_INFO
vfio_init_intp()
event_notifier_init(intp->interrupt)
if VFIO_IRQ_INFO_AUTOMASKED:
event_notifier_init(intp->unmask)
sysbus_connect_irq()
sbd->connect_irq_notifier()
vfio_start_irqfd_injection()
vfio_set_trigger_eventfd()
kvm_vm ioctl: KVM_IRQFD
ioctl: VFIO_DEVICE_SET_IRQS
.flag = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER
.data = fd // from intp->interrupt
[IOMMU]
vfio_memory_listener
vfio_listener_region_add()
vfio_dma_map() => ioctl: VFIO_IOMMU_MAP_DMA
vfio_listener_region_del()
vfio_dma_unmap() => ioctl: VFIO_IOMMU_UNMAP_DMA
[Init]
vfio_base_device_init()
vfio_get_group(groupid)
if new VFIO group:
open() the VFIO group (/dev/vfio/$GROUP)
ioctl: VFIO_GROUP_GET_STATUS, VFIO_GROUP_FLAGS_VIABLE
vfio_connect_container()
// see below
qemu_register_reset(vfio_reset_handler)
vfio_reset_handler => VFIODeviceOps: vfio_compute_needs_reset, vfio_hot_reset_multi
vfio_get_device()
ioctl: VFIO_GROUP_GET_DEVICE_FD
ioctl: VFIO_DEVICE_GET_INFO
vfio_populate_device()
ioctl: VFIO_DEVICE_GET_REGION_INFO
memory_region_init_io()
ioctl: VFIO_DEVICE_GET_IRQ_INFO
vfio_init_intp()
vfio_connect_container()
if existing VFIO container:
ioctl: VFIO_GROUP_SET_CONTAINER (group fd -> container fd)
vfio_kvm_device_add_group()
kvm_vm ioctl: KVM_CREATE_DEVICE, type=KVM_DEV_TYPE_VFIO
vfio_kvm_device_fd ioctl: KVM_SET_DEVICE_ATTR, attr=KVM_DEV_VFIO_GROUP_ADD
if new VFIO container:
open() the VFIO container (/dev/vfio/vfio)
vfio_init_container()
ioctl: VFIO_GROUP_SET_CONTAINER
ioctl: VFIO_SET_IOMMU: iommu_type
vfio_ram_block_discard_disable()
vfio_host_win_add(container, iova_pgsizes) // VFIOHostDMAWindow
vfio_kvm_device_add_group()
register vfio_memory_listener