In italian


The Enhanced Network Block Device Linux Kernel Module

(Last revised: 26 May. 2005)
SourceForge.net Logo Support This Project

    The Enhanced NBD is the result of an industrially funded academic research project with Realm Software of Atlanta, GA, to toughen up the kernel's NBD. It started back in 2.0 times, when I back-ported the nascent NBD by Pavel Machek from the 2.1 development kernel.

    What is an NBD?

    An NBD is "a long pair of wires". It makes a remote disk on a different machine act as though it were a local disk on your machine. It looks like a block device on the local machine where it's typically going to appear as /dev/nda. The remote resource doesn't need to be a whole disk or even a partition. It can be a file.

    NBD transports a physicalblock device over the net

    The intended use for ENBD in particular is for RAID over the net. You can make any NBD device part of a RAID mirror in order to get real time mirroring to a distant (and safe!) backup. To make it clear: start up an NBD connection to a distant NBD server, and use its local device (probably /dev/nda) where you would normally use a local partition in a RAID setup.

    The original kernel device has been hardened in many ways in moving to ENBD from kernel NBD: the ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to take the networking transparantly over SSL channels (see the Makefile for the compilation options).

    To summarize briefly, the important changes in ENBD with respect to the standard kernel driver are


    I haven't been following Pavel's driver closely but we are in friendly contact, and bugfixes pass between the two when the differing architectures permit.

    Requirements

    Kernel version 2.2.10 to .15 or thereabouts onwards, or kernel 2.4.0 and 2.6.3 onwards.  "Legacy" branch versions enbd-2.2.25 onwards also work under at least the early 2.4 kernels. The "current" branch versions enbd-2.4.* work on both 2.2 and 2.4 kernels, and probably on 2.0 kernels too (as do the enbd-2.2.* series). The kernel driver is approximately the same in all cases, only the user-side daemons differ - compatibility layers in the driver code serve to emulate the newer kernel interfaces on older kernel architectures.

    Current version as of Mar. 2004

    The last legacy codes for the 2.2 kernel are at 2.2-current   (currently version enbd-2.2.29). See the ftp area for the full set of releases. The latest stable release is enbd-2.4.31 (which in theory supports the 2.2 kernels as well as the 2.4 kernels), available here. The development version is currently enbd-2.4.32pre, in the same directory. Yes, this information is probably out of date when you see it.

    I repeat: you can use either code with 2.2 kernels, but the enbd-2.4.* codes are the ones being worked on and developed while the 2.2 codes are no longer developed.

    Documentation

    I'll refer you for general orientation to the Linux Journal article (vol #73 May 2000), but please regard the documentation bundled with the distribution as authoratative. The journal article is accurate only for the procedures for 2.2 series codes. I keep a copy in postscript format here.

    For a taste, here are some performance measurements:

    NBD performance figures

    These are old figures now - taken under 2.0.36, as I recall, with a much older version of NBD than the current one, but they're still useful. The testbed was a pair of 64MB P200s on a 100BT switched circuit using 3c905 NICs. The best speed I could get out of raw TCP between them was 58.3Mb/s, tested using netperf.

    Of course, the current NFS implementations have improved too.

    HOWTO

    Do all the compilation on the client machine. It has the kernel configuration that we need to match during the compilation - the enbd server never talks to its own kernel.


    Also set the environment variable LINUXDIR to the location of the kernel source directory for your target kernel, because nowadays /usr/src/linux (the classical location) seems to be some fake that points to whatever glibc was compiled against, not to the source of the kernel you are running. The kernel directory must also contain the target kernels .config file, as it's read during the make.


    1. % tar xzvf nbd-2.2-current.tgz
    2. % cd nbd-2.2.23

    3. And from there on in, I'll quote the INSTALL directions:
       

    4. It should be sufficient to type "make" in this directory. That will build enbd.o, enbd-server , enbd-client in /tmp. Change BUILD in the Makefile to change the build directory. Run " make config all" to really really really make sure that everything is set up for you, but just "make" should normally do the job. Look inside the Makefile if you want to understand why and what.
    5. Edit the Makefile and replace SERVER and CLIENT with the name of your build machine and a willing target machine respectively. The target must be running a kernel into which the enbd.o module will load. You will have to figure out what that means and make it so - running the same kernel as the build machine should be enough. The target can be the build machine too. Setting both SERVER and CLIENT to "localhost" is a safe bet.
    6. Make sure that sudo and ssh are installed on both SERVER and CLIENT machine, and that you are a sudoer, and have done the appropriate ssh-keygen and -export trickery both sides, so you can login seamlessly between the two machines. If you have never heard of these two utilities ... well, really, you are a real rookie of an administrator and you should not be touching this kind of thing! I'd have given odds of 5:1 against you getting this far!
    7. Then run "make test". This depends on the presence of both sudo and ssh for its working. Observe that the module enbd.o is loaded (use /sbin/lsmod for that). Observe also that enbd-server and enbd-client are running (use pstree for that). Check that both server and client have branched off slave servers and clients to handle the connections (again, use pstree to visualize the situation). Check that the state of the device is good by doing a "cat /proc/nbdinfo". You should see indications of /dev/nda being up and running, and several subpartitions - which correspond to connections - also being good. If anything is wrong, look in your system logs for error messages and send me the state shown by /proc/nbdinfo.

    What happens in the test above?

    Well, "make test" should set up a small file in /tmp on the server machine and serve it to the client machine on the client's nbd. That'll be /dev/nda. The makefile may also compile and run "nbd-test" on it. If all has gone well, you can make a file system with "mke2fs /dev/nda" and play. I suggest:
     
    % mke2fs /dev/nda
    % mount /dev/nda /mnt
    % cd /mnt
    % bonnie ...


    The ndxN devices must exist on the client for this to work. I've provided a script called MAKEDEV to make them. On the client, do "cd /dev; sh path_to_MAKEDEV".
     

    Be careful ... there is already a script called MAKEDEV in /dev. Name yours something different or look inside it and see what it does and make the devices it makes by hand. You need block devices /dev/nda, /dev/nda1, /dev/nda2 , /dev/nda3, etc, with major 43 (or whatever the kernel sets for NBD_MAJOR) and minors 0, 1, 2, 3, etc. "mknod /dev/nda b 43 0; mknod /dev/nda1 b 43 1; ..." should do the trick.
     

    To stop the test, you can try running "make stop" or  "make rescue". I don't guarantee a rescue in all circumstances, but it'll try, and you can elaborate the Makefile to suit your circumstances.

    The difficulty is in stopping the self-repairing code! Sending a kill -USR1 to the daemons should shut them down and error out the pending device queue requests. A kill -USR2 will try even harder to shut them down. A kill -TERM should then murder the daemons safely, allowing you to unload the kernel module.
     

    Look at the output from /proc/nbdinfo to gauge the state of the device. In particular, you should see the number of active sockets, and the number of active client threads.
     

    Device a:       Open
    [a] State:      initialized, verify, rw, last error 0
    [a] Queued:     +0 curr reqs/+0 real reqs/+10 max reqs
    [a] Buffersize: 86016   (sectors=168)
    [a] Blocksize:  1024    (log=10)
    [a] Size:       2097152
    [a] Blocks:     2048
    [a] Sockets:    4       (+)     (+)     (*)     (+)
    [a] Requested:  2048+0  (602)   (462)   (431)   (553)
    [a] Dispatched: 2048    (602)   (462)   (431)   (553)
    [a] Errored:    0       (0)     (0)     (0)     (0)
    [a] Pending:    0+0     (0)     (0)     (0)     (0)
    [a] Kthreads:   0       (0 waiting/0 running/1 max)
    [a] Cthreads:   4       (+)     (+)     (+)     (+)
    [a] Cpids:      4       (9489)  (9490)  (9491)  (9492)
    Device b-p:     Closed


    In the above I see four client threads (Cthreads) all currently within the kernel (+). They're probably waiting for work to do. I see four network sockets open and known good (+) with the third of them having been active last (*). The first socket seems to have taken more of the work available than the rest, but the difference is not significant. There are no errors reported and no requests waiting in internal queues. If you send in a bug report, make sure to include the output from /proc/nbdinfo.
     

    GOTCHA!

    Some people run into trouble just when they've got a bit of confidence and try setting things up for themselves. They run successfully and then stop the server and the client for a while and then try again. They can't reconnect! What's going on?

    The server generates a signature that is implanted into the clients nbd device at first contact. Any attempt to afterwards connect to a server with a different signature will be rejected. It's an anti-spoofing device. The client doesn't really know the signature either - it's buried in the kernel and the client can only ask if it's been given the right signature or not.

    Some find out that they can remove the kernel module and then start again successfully. Of course! That wipes the embedded signature. But it's not the solution. The right thing to do is to
     

    1. generate the same signature in the server every time you start it, using its "-i foobar" option.
    2. if you restart the server without restarting the client, signal the client with SIGPWR ("kill -PWR 19645 " or whatever the pid is).


    Most people are caught by GOTCHA! #1, but some people hit #2, which is why I mentiion it here.


    The signal with SIGPWR is normally taken care of by the assistant daemons, nbd-sstatd and nbd-cstatd, but ten to one they haven't been installed yet. I'll explain briefly ... the handshake sequence is longer for a first contact than for a reconnect, and without the SIGPWR the clients will try the short sequence instead of the long.

    HOWTO-2

    I'll lay out in a bit more detail what the "make test" does for you so that you can duplicate it for yourself. The first set of instructions are for a enbd-2.2.* code, and you'll find instructions for the enbd-2.4.* codes immediately after them. Please watch out for command line differences:
     
    1. choose a resource (file or partition) on the serving machine and choose some ports on which to serve it out to the client. Then start the server:

    2.     nbd-server 1100 1101 1102 1103 /dev/sda1
    3. on the client, load the nbd module (make sure to get the right one, using absolute path names if in doubt)

    4.     insmod nbd.o
    5. on the client machine, start the client:

    6.    nbd-client your.server 1100 1101 1102 1103 /dev/nda

    That was for an enbd-2.2.*. For an enbd-2.4.*, the sequence is as follows:
     
    1. choose a resource (file or partition) on the serving machine and a single control port. Then start the server:

    2.     enbd-server 1099 /dev/sda1
    3. on the client, load the nbd module (make sure to get the right one, using absolute path names if in doubt)

    4.     insmod enbd.o
    5. on the client machine, start the client. Note that you give the server control port plus the number of channels you want it to set up. It'll find and set up on its own different ports for the data channels:

    6.     enbd-client your.server:1099  -n 4 /dev/nda .
       

    What about resources > 2GB?

    It's really up to the server, and thus a userspace question. Look: if the server system can use lseek() to move across more than 2GB, then the NBD network protocol will support it, because it passes 64 bit offsets. And on the clientside, the daemon certainly interacts with the kernel using 64bit offsets too (whether you can access >2GB files on the client is again a userspace question).

    If you don't have a native 64 bit server system, from what I can find out from the current confused state of affairs in the linux world ... under glibc2 and kernels 2.2.* and 2.4.* you need to compile the nbd-server code with _LARGEFILE64_SOURCE defined. It's all set up for you from nbd-2.2.26 and nbd-2.4.5 on.

    If you do not have Large File Support on your system, the ENBD still supports resource aggregation, via either linear or striping RAID, to any size, unlimited by the 2GB file size maximum, provided only that the individual components of the aggregate resource are below 2GB in size. Check out the command line arguments for the server. Just listing multiple resources on the command line is enough to cause some form of aggregation to occur!

    Setting up for failover

    ... or how to make ENBD work with heartbeat, the well known failover infrastructure. Matthew Sackman has written a very good HOWTO on this subject. You'll find his document here . I've added the scripts necessary to the distribution archive (enbd-2.4.30 on) under the nbd/etc/ha.d directory. Flash: Steve Purkis has adapted the scripts for RedHat-based platforms, and I've included his scripts in the latest archives (enbd-2.4.32 on) in the nbd/etc-RH/ha.d directory.

    Sorry about the links. I hate non-inline documentation myself. In compensation, I'll describe something of what one is trying to achieve with failover; heartbeat is only a means to an end and in many instances a simple little shell script will be just as good or better and this description may help you construct it!

    The idea is that server and client are both capable of using a single "floating IP address". This floating IP is normally held by the client, but it moves to the server when the client dies, and it moves back again when the client comes back up and has been brought up to date again. The floating IP is normally that announced in DNS for some vital service such as samba or http.

    Heartbeat is simply a general mechanism for detecting when the client or server has failed, and for running the appropriate scripts in response.

    Overview: the client will normally be running a raid1 (mirror) composed of the NBD device and a local resource. When the client dies, the floating IP is handed off to the server, which then starts serving from the physical resource of which the NBD device is/was a virtual image. When the client comes back up, its local mirror component has to be resynced from the NBD device component, but the client can take the IP immediately, as the mirror resyncs in the background while it continues working.

    Abstraction: There are 4 possible states in which the pair of machines can be: (1) server alive, client alive, (2) server alive, client dead, (3) server dead, client alive, (4) server dead, client dead. Of these, (1) is "normal" and (4) is impossible, for our purposes - failover would have failed. The transitions (1)-(2), (2)-(1) and (1)-(3), (3)-(1) are what we are interested in. Heartbeat initiates actions on the surviving machine or machines after each transition.

    More detail: Let's look at the (1)-(2) transition. The server is the survivor. It will run its 'endbd-client start' script because it now has to take up the role of the client. If the client got the chance before dying, it would also have run its 'enbd-client stop' script. Leave aside what these scripts do for the moment and just focus on the naming convention. In the (2)-(1) transition, when the client comes back up, it runs its 'enbd-client start' script, and the server runs its 'enbd-client stop' script.

    Similarly, in the (1)-(3) transition, where the client is the survivor, it must take up the role of the server and so it runs 'enbd-server start'. The server, if it got the chance before dying, runs 'enbd-server stop'. On the reverse transition, (3)-(1), the client runs 'enbd-server stop' and the server runs 'enbd-server start'.

    What do these scripts do?

    Look at (1)-(2) again, where the client dies and the server survives and takes the clients role. The server has to kill its enbd server daemon, fsck the raw partition if it wasn't journaled, and then mount it in the place where its apache and samba services expect to find it. So that's what 'enbd-client start' does for it.

    The matter of taking the IP is normally handled by heartbeat, but one can do it manually with a simple ifconfig eth0:0 foobar command in the script. The same goes for starting and stopping the apache and samba services - i.e. that's handled by heartbeat too.

    If the client got a chance to run its 'enbd-client stop' script before dying, it would have unmounted the raid mirror, then stopped the mirror and stopped the enbd client daemon that it was running. So that's what 'enbd-client stop' does do.

    The (2)-(1) transition is the one that restarts the client in the client role. Usually the server will live to see this transition through, and its 'enbd-client stop' script will unmount the raw partition, start the enbd-server daemon on it, and that's all. The client's 'enbd-client start' script, on the other hand, has to carefully start the enbd-daemon, wait for the NBD device to come up, then start the mirror with the NBD device as primary component. Oh yes, it'll also steal the floating IP address - well, that's normally handled by heartbeat itself.

    The (1)-(3) transition should be thought about in the same way, but it's linked to an easier set of scripts than (1)-(2), since the apache and samba services don't need to be relocated - they stay on the client.

    The client is the survivor. It takes the role of the server with 'enbd-server start', so this script should kill its enbd-client daemon (the mirror component was dead anyway). It does not need to do anything else since the mirror itself has survived. It could take the NBD component out of the mirror with raidhotremove, but it does not need to. If the server got to run 'enbd-server stop' before dying, it should have killed its enbd-server daemon and that's all.

    The reverse transition (3)-(1) is harder. This is where the server has to be reintegrated. It runs 'enbd-server start', which starts up its enbd-server daemon. The client does the reintegration work - it runs 'enbd-server stop', which starts the enbd-client demon, waits for the NBD device to come up, then integrates it into the mirror as a secondary, using raidhotadd.

    The scripts are in the HOWTO, and in the distribution archive. Phew!

    Late news - intelligent mirroring

    In enbd 2.4.31, you can run under fr1  instead of plain kernel raid1, and get a "fully integrated" networked RAID1 solution. Get the fr1 patch, apply it to your kernel source, choose fr1 as a raid module in the kernel config (make menuconfig, etc.), and recompile (make modules) for the module. Load the new md.o module, and the fr1.o module on top of it. It's a replacement for raid1.o. It is fully backward compatible with old kernel RAID1.

    The intelligence in fr1 is in what happens when one of the enbd servers fails, and what happens afterwards. In ordinary RAID1 mirroring, a failed disk is usually replaced by the operator after a short delay, and then the RAID controller takes it upon itself to resync the new disk from the surviving good disk in the background, while the RAID device pretends to the operating system that all systems are go, as normal. Unfortunately, in the networked scenario

    1. temporary network failures are much more common than real disk blow-ups, so we probably only need to catch up with a bit of the data when the network comes back, not write the whole disk from scratch!
    2. the reason you are working over the network is probably to aggregate a large number of disks, with a total size in the terrabytes or petabytes (if you can get there, 16TB is currently the aggregated limit on 32bit systems on linux), so resyncing all of a disk is something you definitely do not want to do.
    3. unless you are using Gigabit ethernet, or some other very fast medium, the transport is probably slower than the disks, so you want to avoid network transfers when possible.
The reengineered mirroring in fr1 notices exactly what block writes are missing on the missing server, and when it comes back into contact, updates only those blocks.

That's a great speed up and time saver. It can reduce the time period in which the servers don't have redundancy from a matter of hours to a matter of seconds. And the resync is automatic when contact with an enbd server is reestablished. It doesn't require human intervention, because the enbd client issues the hotadd instruction.

To set it up, run an enbd  device as one component of a fr1 ("raid1") mirror. That's it (so sue me, MacDonalds).

There is now an fr5 driver too, but even without it you can get at least the automatic resync on reconnect by patching the kernel for fr1 and then using the patched md.o module under ordinary kernel raid5.


    Bugs

    What's that? Oh yes, there are bugs.

    To Do

    Latest News

    Mailing List

    Tummy.com have kindly set up a mailing list for nbd. Send mail containing the word "help" or "subscribe enbd" to enbd-request@lists.community.tummy.com . You will find complete instructions on their web page for the list. The list itself is at enbd@lists.community.tummy.com .
     

    Downloads

    As well as the main site at ftp://oboe.it.uc3m.es/pub/Programs/ , tummy.com have set up a mirror at ftp://mirrors.tummy.com/pub/mirrors/linux-ha/enbd/ .

    Contacts

    Contact me - not least to encourage me to start a mailing list (yay! done, thanks to SuSE and tummy.com) and improve this page. The change list in the driver source is really impressive.
     

    Peter T. Breuer ptb@inv.it.uc3m.es