Wednesday 15 July 2009

tracking dirty pages without patching the kernel?

The main idea is to replace all writable mappings (both anonymous and vm_file) with a VM_SHARED pseudo file mapping, which will imitate the right behavior of the vm_area.
Each vm_area gets a separate pseudo file/address space which stores the original mapping's properties.

Initially all pages (also the anonymous ones that got COW-ed from a private mapping) are converted to shared mappings of this pseudo file. Reverse mappings and page cache need to be updated consistently (i.e. each page has to be linked to the pseudo address space and the vm_area has to be included into the address space's priority tree.)

Each iteration of the incremental update clears the write bit of all PTEs (belonging to the dirty pages).

The benefit of the pseudo file mapping is the address space callbacks, page_mkwrite() in do_wp_page() and fault() in __do_fault().
We will be always notified when a page gets written to first, while we don't utilize the dirty bit (swapping can still work) and don't miss page writes in case of an mprotect() call after the write (since the write itself had been logged before mprotect()).


Considering the different cases of the original mappings:
(anon vs. file backed / private vs. shared, bold text means the original)

vm_file && VM_SHARED:

init: we need to replace the address space operations, all pages are mapped as shared anyway, no COW necessary.
(the pseudo file's write_page will call the original write_page, ensuring that we actually modify the original file)

fault(): load the page through the original fault().

page_mkwrite(): calls the original and logs the event.

vm_file && !VM_SHARED
:

init: we need to iterate the page table and find the pages present, the ones that are writable are now anonymous, so we have to convert them to shared mappings of the pseudo file mapping, most importantly the reverse mappings have to be taken care. the usage counter of these pages will be 1, so we will never copy them in subsequent writes.

(the pseudo file's write_page will not call the original write_page, ensuring that we only work on our private copy in the memory)

fault(): load the page through the original read

page_mkwrite(): here we have to make a copy of the original page because normally, COW would ensure that we get our own private copy. However instead of mapping it as anonymous page, we will map it as a shared page of the pseudo file so that consequent write faults will not find the page as anonymous and page_mkwrite() will be called again.
(we either make a new copy of all written pages in each iteration or keep track the ones which have been actually copied by us, it is not necessary to copy those again)

!vm_file && VM_SHARED:

init: (this is the case of shared memory) update the vma that it is a file mapping now, and see which other vmas map these pages (update those as well?-> we can find all mappings through the reverse priority tree)

BUT: are we supposed to migrate a process that shares memory with someone else??
(if we ensure that all the processes that share the memory are our targets, it could be done..)

fault(): creates a new mapping and zeros the page

page_mkwrite(): logs the write


!vm_file && !VM_SHARED:

init: our private malloc()ed memory and memory that we inherited from fork() (if no exec() occurred after)
(this is a problem if our process was fork()ed from a big address space and it is actually not using the most of it, but is that really a relevant case?)

1.) update the vma that it is a file mapping now with the pseudo file
2.) iterate PTEs and check the ones which are present:
- if writable, we already COW-ed, map as file shared of the pseudo file (this is the most common case, malloc()ed memory)
- if not yet writable, make a copy (as COW) and map as file shared of the pseudo file, this can be very expensive, if the address space is big... *

(* for avoiding the immediate copy, we could create a fake vma, link the page to it anonymously, store the address that it belonged to that page and mark the real pte as non-present. In this case during fault() we could check this fake vma and see if the given address was already present originally, if so we make a copy and actually map it in.)

fault(): creates a new mapping and zeros the page (or see *)

page_mkwrite(): we have our own copy, just log the write

Wednesday 4 March 2009

git: pulling into a dirty tree

When you are in the middle of something, you learn that there are upstream changes that are possibly relevant to what you are doing. If your local changes conflict with the upstream changes, git pull refuses to overwrite your changes. In such a case, you can stash your changes away, perform a pull, and then unstash, like this:
$ git pull  
... 
file foobar not up to date, cannot merge.
$ git stash 
$ git pull 
$ git stash apply

Tuesday 3 March 2009

IPMI, remote servers

ipmitool -I lan -H sun20-sp -U root chassis power on
ipmitool -I lan -H sun20-sp -U root chassis power forceoff
ipmitool -I lan -H sun20-sp -U root chassis power reset
ipmitool -I lan -H sun20-sp -U root chassis power status

ssh root@sun20-sp

start /SP/AgentInfo/console

Sunday 22 February 2009

spin lock variations...

spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

Disables interrupts on the local processor and stores the current interrupt state in flags. Note that all of the spinlock primitives are defined as macros, and that the flags argument is passed directly, not as a pointer.

spin_lock_irq(spinlock_t *lock);

Acts like spin_lock_irqsave, except that it does not save the current interrupt state. This version is slightly more efficient than spin_lock_irqsave, but it should only be used in situations in which you know that interrupts will not have already been disabled.

spin_lock_bh(spinlock_t *lock);

Obtains the given lock and prevents the execution of bottom halves.

spin_unlock(spinlock_t *lock);
spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
spin_unlock_irq(spinlock_t *lock);
spin_unlock_bh(spinlock_t *lock);

These functions are the counterparts of the various locking primitives described previously. spin_unlock unlocks the given lock and nothing else. spin_unlock_irqrestore possibly enables interrupts, depending on the flags value (which should have come from spin_lock_irqsave). spin_unlock_irq enables interrupts unconditionally, and spin_unlock_bh reenables bottom-half processing. In each case, your function should be in possession of the lock before calling one of the unlocking primitives, or serious disorder will result.

spin_is_locked(spinlock_t *lock);
spin_trylock(spinlock_t *lock)
spin_unlock_wait(spinlock_t *lock);

spin_is_locked queries the state of a spinlock without changing it. It returns nonzero if the lock is currently busy. To attempt to acquire a lock without waiting, use spin_trylock, which returns nonzero if the operation failed (the lock was busy). spin_unlock_wait waits until the lock becomes free, but does not take possession of it.


A table for locking among different contexts:
http://www.kernel.org/pub/linux/kernel/people/rusty/kernel-locking/c214.html

Thursday 19 February 2009

tcp_sock.ucopy always clean in BLCR checkpoint

extremely fortunate: while BLCR's checkpoint signal handler is executed tcp_sock's fasthpath is surely not in action and the ucopy field is clear. tcp_recvmsg() breaks the main loop (with -ERESTARTSYS) in case of signal_pending(), but ucopy is cleaned up first. As a consqeuence, during cr_restart we can simply reinject packages by calling ip_rcv_finish() on each, tcp_recvmsg() will be reexecuted and it will process the queues properly.

on netfilter's NF_QUEUE verdict..

Installing a netfilter hook that returns NF_QUEUE on certain packets will cause the kernel to find a nf_queue_handler and call it with the given packet. If no handler is installed tha packet is discarded. A handler can be registered with the nf_register_queue_handler(). (ip_queue module uses this for expose the packets to userspace.) After your queue handler is done, you are supposed to insert it back to the network stack by calling nf_reinject().

Sunday 18 January 2009

TCP send in Linux 2.6

tcp_sendmsg() copies the data from userspace by buiding socket buffers and calling skb_entail() on each packet. skb_entail() calls tcp_add_write_queue_tail() for adding the buffer to the sk_write_queue of the socket and setting sk_send_head to the packet if it's not yet set.
Note the difference between sk_write_queue and sk_send_head, send_head denotes the first package which has not been requested for transmitting through the lower layer (IP) while all that the packages remain on the write_queue (until they are acked, check tcp_ack() during receiving)
In case the last packet added is the only one waiting for transmission (i.e. skb == skb_send_head) tcp_push_one() is called in order to advance sk_send_head and to call tcp_transmit_skb() on the package, otherwise it is simply enqueued.
Before returning the number of bytes copied from userspace tcp_sendmsg() calls tcp_push(), which calls __tcp_push_pending_frames(), which calls tcp_write_xmit(), the general function for iterating packets from sk_send_head and calling tcp_transmit_skb() for each of them.
Both tcp_push_one() and tcp_write_xmit() call tcp_transmit_skb() for the actuall transmission of a package through the socket's icsk_af_ops->queue_xmit() function.
Both tcp_push_one() and tcp_write_xmit() call tcp_event_new_data(), which advances sk_send_head and sets up the TCP_TIME_RETRANS timer of the socket if it's not set yet.

TCP receive in Linux 2.6

There are two ways of receiving packets from the NIC in the kernel, either through interrupts or by polling the interface.
During receiving through interrupts the netif_rx() function is called with the skb sk_buff, which enques the packet on the softnet_data's input_pkt_queue and schedules the NET_RX_SOFTIRQ softirq. (For the case when network cards would generate too many interrupts, the driver can register a poll function and switch off interrupts, by using the NAPI interface.)
The NET_RX_SOFTIRQ's func, net_rx_action() iterates the per processor softnet_data's poll queue - on which the backlog pseudo device is present as well - and calls each poll function. The backlog device's poll function, process_backlog(), is actually the one which processes the softnet_data's input_pkt_queue and pushes the sk_buff packets to the upper layers by calling netif_receive_skb() on each packet.
netif_receive_skb() will iterate the network packet handlers which match the protocol type and calls deliver_skb() function which in turn calls the protocol's "func" function, IP's packet handler function is ip_rcv().
ip_rcv() eventually calls ip_rcv_finish() which calls ip_route_info() on the skb in order to fill in the dst field. The dst field is a "rtable" struct which embeds a "dst_entry" struct. The dst_entry's ipunt field is a function pointer which is set to ip_local_deliver() in case of a packet that has to be delivered locally.
There is a global inet_protos array (hash) of "net_protocol" structs which represents the registered transport layer protocols, TCP's net_protocol is called tcp_protocol. ip_local_deliver() dereferences the protocol array and calls its handler function, TCP's handler is tcp_v4_rcv().
There are three receive queues of a socket, the sk_backlog, the ucopy.prequeue and the main sk_receive_queue.
The three queues have the following purposes: prequeue is responsible for putting off in order data processing to process context in case a process is waiting on the socket (called fastpath), receive queue is the standard way of receiving packets if no reader process is waiting when the package was received. Backlog is used to temporarily store received packages in case a reader is processing the receive queue.
tcp_v4_rcv() calls tcp_rcv_established() if the connection is established and puts a socket buffer on the preqeue if the socket is not locked but there is a user process waiting for reading (ucopy.task is set) and the socket buffer contains in order data according to the expected sequence number.
The main purpose of the prequeue mechanism is to allow processing of socket buffers in process
context and therefore decreasing the amount of time spent in bottom half (softirq) processing.
The process is awaken after putting the buffer on the queue. If there is no user process waiting tcp_v4_do_rcv() is called directly in the softirq context and the skb is placed on the main receive queue. If the socket is locked, the packet goes to the backlog queue. tcp_recvmsg() is the function called from process context in order to copy segments to user space. It processes the receive queue and/or installs itself as a waiter for data in case the queue is empty or the requested amount of bytes were not available yet. Doing so it calls tcp_data_wait() which will wait until tcp_v4_rcv() receives packages from the IP layer.
Backlog queue is also processed in tcp_recvmsg() right before the socket is released after a read operation.
Note that tcp_v4_do_rcv() can be called either in softirq or process context. Process context is responsible for data transfer to userspace, while softirq will place the buffer on the receive queue. tcp_v4_do_rcv() is called in process context through the sk_backlog_rcv field of the socket when the prequeue is iterated in tcp_prequeue_process().
Note that tcp_rcv_established() calls tcp_ack() as well, which cleans the socket's write_queue (i.e. retransmit queue) according to the ACK received in the package (check send side).