Wednesday 15 July 2009

tracking dirty pages without patching the kernel?

The main idea is to replace all writable mappings (both anonymous and vm_file) with a VM_SHARED pseudo file mapping, which will imitate the right behavior of the vm_area.
Each vm_area gets a separate pseudo file/address space which stores the original mapping's properties.

Initially all pages (also the anonymous ones that got COW-ed from a private mapping) are converted to shared mappings of this pseudo file. Reverse mappings and page cache need to be updated consistently (i.e. each page has to be linked to the pseudo address space and the vm_area has to be included into the address space's priority tree.)

Each iteration of the incremental update clears the write bit of all PTEs (belonging to the dirty pages).

The benefit of the pseudo file mapping is the address space callbacks, page_mkwrite() in do_wp_page() and fault() in __do_fault().
We will be always notified when a page gets written to first, while we don't utilize the dirty bit (swapping can still work) and don't miss page writes in case of an mprotect() call after the write (since the write itself had been logged before mprotect()).


Considering the different cases of the original mappings:
(anon vs. file backed / private vs. shared, bold text means the original)

vm_file && VM_SHARED:

init: we need to replace the address space operations, all pages are mapped as shared anyway, no COW necessary.
(the pseudo file's write_page will call the original write_page, ensuring that we actually modify the original file)

fault(): load the page through the original fault().

page_mkwrite(): calls the original and logs the event.

vm_file && !VM_SHARED
:

init: we need to iterate the page table and find the pages present, the ones that are writable are now anonymous, so we have to convert them to shared mappings of the pseudo file mapping, most importantly the reverse mappings have to be taken care. the usage counter of these pages will be 1, so we will never copy them in subsequent writes.

(the pseudo file's write_page will not call the original write_page, ensuring that we only work on our private copy in the memory)

fault(): load the page through the original read

page_mkwrite(): here we have to make a copy of the original page because normally, COW would ensure that we get our own private copy. However instead of mapping it as anonymous page, we will map it as a shared page of the pseudo file so that consequent write faults will not find the page as anonymous and page_mkwrite() will be called again.
(we either make a new copy of all written pages in each iteration or keep track the ones which have been actually copied by us, it is not necessary to copy those again)

!vm_file && VM_SHARED:

init: (this is the case of shared memory) update the vma that it is a file mapping now, and see which other vmas map these pages (update those as well?-> we can find all mappings through the reverse priority tree)

BUT: are we supposed to migrate a process that shares memory with someone else??
(if we ensure that all the processes that share the memory are our targets, it could be done..)

fault(): creates a new mapping and zeros the page

page_mkwrite(): logs the write


!vm_file && !VM_SHARED:

init: our private malloc()ed memory and memory that we inherited from fork() (if no exec() occurred after)
(this is a problem if our process was fork()ed from a big address space and it is actually not using the most of it, but is that really a relevant case?)

1.) update the vma that it is a file mapping now with the pseudo file
2.) iterate PTEs and check the ones which are present:
- if writable, we already COW-ed, map as file shared of the pseudo file (this is the most common case, malloc()ed memory)
- if not yet writable, make a copy (as COW) and map as file shared of the pseudo file, this can be very expensive, if the address space is big... *

(* for avoiding the immediate copy, we could create a fake vma, link the page to it anonymously, store the address that it belonged to that page and mark the real pte as non-present. In this case during fault() we could check this fake vma and see if the given address was already present originally, if so we make a copy and actually map it in.)

fault(): creates a new mapping and zeros the page (or see *)

page_mkwrite(): we have our own copy, just log the write