A Blog

Forget ssh: mosh and tmux

by on May.01, 2015, under Tinkergeek

Forget SSH? Not really, but two of my favorite remote command line tools are mosh and tmux.

First off, mosh is a “remote terminal application that allows roaming, supports intermittent connectivity, and provides intelligent local echo and line editing of user keystrokes.” It is essentially a screen synchronizing mechanism based on UDP wrapped with heavy encryption. I have resumed sessions from different continents, just opened my laptop and everything was good to go. On top of that, you do not have to run any new external services since all session spawning is done over SSH. Disadvantages to mosh are that you do not get scroll back or multiple sessions..

Tmux comes to make up the difference. It is “the better screen”, written from the ground up to manage multiplexed terminal sessions with ease. Not only is the default key combo better (Control-B instead of Control-A) but tmux makes sharing sessions with a room full of admins a breeze. Give tmux a try next time instead of screen, you will be surprised.

Comments Off on Forget ssh: mosh and tmux :, , more...

Archiving with S3QL and DreamObjects

by on Apr.29, 2015, under Clouds

Besides my NAS and backups, I find it useful to have a third tier of storage. These are files that I rarely access and usually never change after creation, but I still need convenient access to them on occasion. For example, virtual machine images for projects that are completed. When I worked for Purdue, I could save copies into the Fortress HPSS archive system which is backed by fast holding disk and tape. Now, I need to search out an alternative.

I settled eventually on trying the S3QL FUSE file system backed by a cloud object service. Given my interest in Ceph, I have been watching the work of DreamHost and the progress of their DreamObjects service. Their bulk pricing is fairly decent and the cost structure is easy enough to understand.

The following chunk of commands sets up an S3QL file system. For information on setting up an object storage location with DreamObjects please see their wiki.

apt-get install s3ql
mkdir .s3ql
cat < .s3ql/authinfo2
storage-url: s3c://objects.dreamhost.com/bucket/s3ql
mkfs.s3ql s3c://objects.dreamhost.com/bucket/s3ql
fsck.s3ql --authfile /home/alex/.s3ql/authinfo2 --cachedir /.archive-cache s3c://objects.dreamhost.com/bucket/s3ql
mount.s3ql --metadata-upload-interval 7200 --authfile /home/alex/.s3ql/authinfo2 --allow-other --threads 4 --log syslog --cachedir /.archive-cache --cachesize 20971520 s3c://objects.dreamhost.com/bucket/s3ql /archive/

First we install the necessary packages, followed by writing out DreamObjects login credentials and a passphrase to encrypt the file system. Then we format the file system and check it. Finally, we mount the file system and it is ready to store our stuff.

It is worth reading through the S3QL website and documentation.

How is performance? The most significant impact on performance was, unsurprisingly, my home Internet connection. Thanks to Metronet, I am able to write at 5-6MB/s and read at 13-14MB/s. Certainly enough for write and (mostly) read files.

Comments Off on Archiving with S3QL and DreamObjects :, , , more...

duplicity and hpss

by on Mar.01, 2013, under Purdue

What if you had a hidey hole where you could store infinite amounts of data? How would that change how you compute?

Welp, that is the situation I find myself in. The lovely place where I work has an installation of HPSS, the High Performance Storage System. It is an HSM, a Hierarchical Storage Manager, system that currently has a pool of cache disk and a tape library behind it. Files are migrated seamlessly between disk and tape making for a file system that looks infinite. The caveat, of course, is that files on tape are not immediately available for use without first pausing a little bit while a robot goes to fetch your data from a storage medium most people assume died decades ago.

Also, given the nature of the beast, it is best not to store small files by the hundreds of thousands. (Really, try putting in a few thousand files and you will hear the angry yells of a storage admin from the depths of the earth.)

So, I have an infinite file store, what should I do with it? I used to do the normal: remember that I probably wanted to keep something forever, package it up in a tarball, and push it to the archive. That is labor intensive and most of all: requires me to remember to do something.

After some recent work I was doing with duplicity for machine backups, it came to me. Using duplicity’s HSI backend (a wrapper that uses the HSI utility to speak with HPSS), I could have a long running “backup” of my home directory (or any directory really) that tracked all the files I ever made or deleted.

  • First step, install duplicity into your home directory. The duplicity site can be found here.
  • Second, edit the hsibackend.py script to point to your HSI binary (ask your user support team if necessary to find this or just have the binary in your PATH). You’ll also want to play with the list function and edit the array slice [3:] to take into account your site’s system header.
    def list(self):
        commandline = '%s "ls -l %s"' % (hsi_command, self.remote_dir)
        l = os.popen3(commandline)[2].readlines()[3:]

    Thanks to sydelko for sharing this bit.

  • Finally, make some cron jobs:
    0 * * * * if [ -f ~/.make_backup ]; then ~/bin/duplicity incr ~ hsi://hostname/home/foobar/home_directory; rm ~/.make_backup; fi
    0 18 * * 5 ~/bin/duplicity incr ~ hsi://hostname/home/foobar/home_directory
    0 18 1 */2 * ~/bin/duplicity full ~ hsi://hostname/home/foobar/home_directory

This gets you a system where incremental backups can be made every hour by creating the ~/.make_backup file. Then regular incrementals are taken every Friday at 6pm and finally a full is taken on the first day of the month every other month. Obviously adjust the intervals according to your preference. My goal is to balance having regular snapshots against the number of files I am putting into the archive.

The magic is that duplicity will never expire a backup on its own. If you desire, you can obviously expire a backup, but if you find yourself having an infinite amount of space, then your backups can run infinity long and keep around important files you find yourself needing long after you thought you did not.

If you think there are a lot of missing steps from above, there are. Each site that runs HPSS is going to be a little different, this is a chose your own adventure sort of thing.

Do you happen to handle sensitive data (or maybe handle it without necessarily knowing that you are)? Did I mention duplicity has great support for encryption built in? It also handles many other backends people might enjoy..

Comments Off on duplicity and hpss more...

Running Lustre on top of a Ceph’s RBD

by on Feb.05, 2011, under Tinkergeek

Lustre is a popular parallel file system. It features one nagging problem, it requires reliable underlying disk. Ceph is a new distributed file system that offers redundancy and scalability. Ceph also can deliver a redundant block device. While Ceph has a POSIX file system interface, it requires a Linux kernel 2.6.27 or greater. RHEL5 only has a kernel of version 2.6.18, more or less, but is supported by Lustre. The crazy idea came about to try getting Lustre running on top of Ceph’s RBD layer and export cheap disk to RHEL5 clients.

As an exercise in excitement, let’s try making it happen.. For this endeavor, we will need four machines:

  • Ceph-1: Ceph Monitor and OSD running Debian Squeeze with a btrfs file system
  • Ceph-2: Ceph Metadata server and OSD running Debian Squeeze with a btrfs file system
  • Bits: Ceph client and iSCSI target running Debian Squeeze
  • Lustre: Lustre MGS, MDS, OSS server running CentOS 5

Go ahead and install the base OS’s on each, putting the btrfs file systems on ceph-* at /ceph.

Now, let’s install the Ceph server pieces on both ceph-1 and ceph-2 at the same time:
* Handy URLs: Ceph’s Debian Repositories, Basic Ceph Debian HOWTO

cat > /etc/apt/sources.list
deb http://ceph.newdream.net/debian/ squeeze ceph-unstable
deb-src http://ceph.newdream.net/debian/ squeeze ceph-unstable

apt-get update
apt-get install -y bzip2 ceph linux-headers-$(uname -r) build-essential 

ssh-keygen -t dsa
scp /root/.ssh/id_dsa.pub ceph-2:/root/.ssh/authorized_keys

ceph-1: mkdir -p /ceph/mon0
ceph-1: mkdir /ceph/osd0

ceph-2: mkdir /ceph/mon1

cat /etc/ceph/ceph.conf
       pid file = /var/run/ceph/$name.pid
       mon data = /ceph/mon$id
       host = ceph-1
       mon addr =
       host = ceph-2
       osd data = /ceph/osd$id
       osd journal = /ceph/osd$id/journal
       osd journal size = 512
       host = ceph-1
       host = ceph-2

mkcephfs -c /etc/ceph/ceph.conf --allhosts -v -k /etc/ceph/keyring.bin

/etc/init.d/ceph -a start

cclass -a

ceph class activate rbd 1.3

rbd create lustre --size 60000

rbd list

Now let’s install a Ceph-enabled version of the Linux kernel and setup the iSCSI target drvier on bits:
* Handy URLs: Making a custom, vanilla kernel for Debian, SCST iSCSI Howto

cd /usr/src
git clone git://ceph.newdream.net/git/ceph-client.git
cd ceph-client
git checkout -b unstable origin/unstable
cp /boot/config-2.6.32-5-amd64 .
make menuconfig # enable libceph, ceph, and rbd
make-kpkg clean
make-kpkg --rootcmd fakeroot --initrd --revision=custom.001 kernel_image kernel_headers
cd ..
dpkg -i *.deb
echo "`hostname ceph-1` name=admin rbd lustre" > /sys/bus/rbd/add

svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk
cd trunk
make scst scst_install iscsi iscsi_install scstadm scstadm_install

cat  /etc/scst.conf
HANDLER vdisk_fileio {
        DEVICE disk01 {
                filename /dev/rbd0
                nv_cache 1
        enabled 1

        TARGET iqn.2006-10.net.vlnb:tgt {
                LUN 0 disk01
                enabled 1

modprobe scst
modprobe scst_vdisk
modprobe iscsi-scst
scstadmin -config /etc/scst.conf

Now, we have a block device and are exporting it through iSCSI. Let’s set up the Lustre server on lustre:
* Handly URLs: My blog post about Lustre 1.8, Configuring an iSCSI client on RHEL

wget "lustre binaries from Oracle's lustre.org"
rpm -ivh *.rpm

yum install iscsi-initiator-utils
service iscsi start
iscsiadm -m discovery -t sendtargets -p bits
service iscsi restart
fdisk -l

mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt
mkdir -p /lustre/mds
mount -t lustre -o loop /tmp/mdt /lustre/mds

mkfs.lustre --reformat --fsname lustre --ost --mgsnode=lustre /dev/sda
mkdir -p /lustre/ost
mount -t lustre -o loop /tmp/ost /lustre/ost

Viola! At this point you should be able to mount your Lustre file system on your Lustre clients.. I performed my work on a bunch of VMs with already slow underlying disk, so things went even considerably more slowly after I mounted Lustre on a file system.

Some things that need to be thought about.. The number of replicas and the striping strategy for the block devices will impact performance on the Ceph pool, by default these settings are 4MB blocks with no striping. As for the iSCSI layer, tuning here for performance will be important as well as exporting the same LUN from multiple iSCSI targets and doing active-passive failover for the iSCSI LUNs. As long as Ceph does not loose too many replicas and the mds/mon services are still operational, your block device should not go away. As well, configuring redundancy between Lustre OSS’s will be important for redundancy.

At the time of this post, Ceph required Linux kernel 2.6.37 or newer and Lustre did not support anything newer than 2.6.32. In the future, it will probably be prudent to eliminate the iSCSI translation layer and access the block devices directly from the Lustre OSS’s. Also, Ceph is not production-ready, so there is that small bit to come about.

1 Comment more...

MooseFS Testing Complete

by on Dec.24, 2010, under Purdue

It has been many months since I began looking at MooseFS. Testing the file system using the same machines for clients and servers was interesting, with some of the results you can find in the last post. During the past few months, my coworkers and I moved the machines around and redesigned the test. What we were looking for was stability, scalability, and to see how a truly distributed file system worked using a bunch of machines.

The cluster was re-architected. In the end, we kept 96 Dell GX745’s with three 160GB SATA disks in the storage cluster. All the disks in a machine were in a RAID0 stripe. This provided MooseFS with 96 ~400GB volumes for a total storage capacity of around 35TB. Each machine connected to a central Cisco 6500 switch using a 1Gbps connection. The switch had four 10Gbps connections into the core of our network on campus. The clients were taken from a random sampling of machines on our network, most clients were AMD-based systems that each had a 10Gbps connection to their resource-level networks, with resources connecting to the core of the network at 20Gbps or 40Gbps.

During testing, we lost a total of 13 machines due to hardware failure related to both the disks and other general component failure. No data was lost because of the problems. Testing ran for two months at full-tilt. We did not use a redundant master setup, but we also did not experience a master failure during the testing. Overall, hosting a stable file system on top of 96 machines past their prime is an impressive feat. I remarked in the last post about the file system driver being stable, and to reiterate here, we did not see any client mounts wedge themselves during testing.

The file system performed remarkably well given the type of hardware making up the test system. The highest sustained performance mark was 30Gbps reads and 5Gbps writes to clients. We had clients sustaining over 1Gbps traffic to/from the cluster. MooseFS did not perform in the same ways a Lustre parallel file system would, but that was not expected. The theoretical performance of the underlying disks seems to suggest we could sustain more bandwidth than we saw, but our particular mix of reads and writes probably caused the difference in achieved versus measured performance. Overall, this is admirable performance.

The last category of information we hoped to glean from testing is what impact running a distributed system would exact from the system hosting it. Definitely running the storage backends full-tilt will produce a fair amount of additional system load. At the height of testing, CPU load on any given backend ranged from 20% to 30%. The MooseFS chunk server daemon used approximately 500MB of memory and the system dedicated the remaining amount of memory to file cache. The storage machines moved constantly between 200Mbps and 400Mbps of traffic a piece. This an appreciable amount of additional load placed on machines, if one wanted to mix storage providers and consumers then some care should be taken to characterize the impact on the machine’s primary function.

As for the general maintainability of the system, no major problems occurred. System failures were handled automatically and the system was very responsive if a machine was marked administratively offline, it quickly re-balance and re-replicated the necessary blocks. As well, adding machines back into the cluster caused a rebalancing of blocks to occur. This process was seamless and after the initial replication phase, the additional machines quickly added capacity to the system. The MooseFS status webpages were very handy during testing, there is no doubt that they provide a wealth of performance information.

After two months of poking and prodding the system, we are quite impressed. Integrating a distributed file system into our environment will not be easy if we choose to integrate the file system into machines on a secondary function basis, but that will be another project for down the road.

2 Comments more...

MooseFS Testing

by on Aug.28, 2010, under Purdue

It’s been a while since anything new popped up on here.. But, now I have something interesting to talk about. A few many months ago, I was toying around with Lustre and Gluster. Lustre is the standard parallel file system on many large supercomputers today, and it is an open source project now at Oracle. Gluster can be both or either a parallel or a distributed file system. Lustre has a centralized manager/metadata server whereas Gluster is much more decentralized. One currently lacking bit of Lustre is that redundancy comes from the underlying OS and hardware, which can be both expensive and more difficult to set up. Gluster has a notion about redundancy built into it and does not necessarily rely on anything else.

There are both good and less good aspects to the designs and implementations of Lustre and Gluster. However, it seems like the holy grail in storage is a distributed file system that can both scale its performance easily and take advantage of left over storage on servers without requiring large amounts of effort to add and remove parts of the storage system. If a backing storage piece in Lustre disappears, all files that resided on that backend become unavailable, which is a problem on less than reliable hardware. Glustre can get around this problem, but its configuration is slightly combersome and changing the layout of the storage cluster is fun. These are not necessarily problems in themselves, specially when using dedicated resources however can be problems when using willy-nilly, glued together backing storage.

Given those comments on Lustre and Gluster, there are two distributed file systems I’ve been looking at: Ceph and MooseFS. Ceph is the new kid on the block and is currently undergoing heavy development. I think the designers certainly kept their eye on scalability and redundancy, but getting the file system going has been a pain. The pain mostly focuses around my use of a Linux distribution that is made up of older components (kernel and user land bits). In time, I’ll revisit Ceph and run it through its paces.

Until then, MooseFS is the most recent file system on my chopping block. The other file systems I’ve posted about got tested on a variety of hardware that was sufficiently underpowered. Today, I have more hardware to play with..

I think to understand the performance that I saw with MooseFS, I should say some words about the hardware I used. There are two clusters I used for testing. The first is a 20 node cluster with 4 cores per machine and 8GB of RAM per machine. The storage is a single SATA disk and the network is a 1Gbps drop. There are 5 nodes on each switch and 4x1gbps uplinks from each switch to an aggregation switch. The network is slightly oversubscribed at 5:4, but I don’t believe this was a major problem during testing for this small number of machines.

The next cluster is a 96 node cluster with two cores per machine, 3x 160GB SATA disks in a RAID0, 1Gbps network drop, and every 48 nodes connects to one switch. The two switches are interconnected using a single 10Gbps link.

Testing done on the first cluster was done using 20 iozone instances, one per node, that read and wrote upto 8gb files each at various block sizes. The speed of the file system did not appear to be dependent on block size. That makes sense because MooseFS uses 64MB blocks internally and all operations take place on a whole block. Reads and writes using files smaller than 8GB saw amazing performance, as they all fit into memory cache. At the worse extreme using 8GB files and seeking to random positions inside the files, the performance was equal to the speed of the hard drives doing sequencial reads/writes on 64MB files. This is a potential downside, because iozone may have only wanted a 4KB chunk, but an entire 64MB block was exchanged between client and server leaving the rest to be thrown out.

A distributed file system able to get native hardware speed is ok, but the most important lesson learned what in the way MooseFS gets mounted on a client. MFS uses a FUSE driver on the client to mount the file system. In the past, I’ve had bad luck with these drivers becoming upset and gumming up the works. However, after running tests on the 20 node cluster for an entire weekend and pushing 60TB of data in and out of the FUSE driver from MFS, everything was still functional and no mounts were hung. This is what pushed me to continue testing and push it onto larger hardware.

I’ve not yet completed a comprehensive test using iozne on the 96 node cluster, but after doing some simple tests, I’m encouraged by the results. Running 48 writers, I was able to get: 540MB/s using a MFS goal of 3, 700MB/s using a goal of 2, 1400MB/s using a goal of 1, and 7000MB/s directly to the underlying file system without going through MFS on 48 nodes. The goal in MFS is the number of replicas of a file that gets stored. Replicas are written in parallel, so the worst case performance was 540MB/s * 3 or near the maximum performance I saw using no replicas. There is a discrepancy between the top end performance from MFS and what the underlying disk is able to write. I explain this because the maximum bandwidth between the two halves of the cluster is only 1250MB/s theoretically but that not all operations had to complete between client/servers on different halves, so I saw the interlink bandwidth + whatever local bandwidth was getting used.

The read performance was better than write.. 1520MB/s from 3x replicated files, 1100MB/s from 2x replicated files, 1400MB/s from single copy files, and 7100MB/s from the underlying file systems. If there is more than one copy of a file being read, each reader can pull from different servers. The performance dip from 3x to 2x is probably how the different 64MB blocks were distributed between the halves of the cluster, and I still saw a performance maximum near 1300MB/s, the single link between the two halves.

I saw a few things during these tests. The first and again most encouraging was that the FUSE driver held out and nothing blew up. The second was that the master daemon (the process keeping track of 64MB chunks and related metadata) did a lot of DNS queries, I assume for logging of traffic and house keeping. Overall, nothing terribly surprising came up. I should note that in both tests, the master machine with 4 cores, 8gb of memory, and two disks in a mirror using a single 1gbps uplink. This machine never seemed terribly busy.

The next step I believe will be to increase the internal bandwidth of the cluster and try testing again using the iozone.

Comments Off on MooseFS Testing more...

Link Layer Discovery Protocol

by on Feb.27, 2010, under Tinkergeek

Ever wondered exactly where all your network cabling goes? Have you been using Cisco and wished your computers spoke CDP too? Apparently you and everyone else would love for the computers to just say where they connected instead of chasing down network cables by hand. That seems to be the goal of the Link Layer Discovery Protocol (LLDP or 802.11ab). Unlike the Cisco Discovery Protocol (CDP), LLDP is the vender-neutral attempt to get it all right.

There is a LLDP daemon that is published at Luffy.cx that implements this under Linux. There are other daemons out there too, but this showed up first when I searched the Debian repositories for precompiled versions. Simply installing it via apt and starting up the service is enough to get your computers and your network devices discovering themselves. Although, if you’re like me and don’t have fancy new networking gear that supports LLDP, lldpd supports a wide range of other network discovery protocols too.

Once I installed lldpd on all my computers and enabled the CDP option (the -c option when starting up lldpd), I saw the magic happen:

c2950-01#show cdp neighbors 
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
                  S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone

Device ID        Local Intrfce     Holdtme    Capability  Platform  Port ID
                 Gig 0/1            107           R       Linux     eth1
                 Gig 0/2            114                   Linux     eth0
                 Fas 0/15           106                   Linux     eth0
                 Fas 0/8            146           R       C831      Eth 2
                 Fas 0/1            142         R S I     871       Fas 0

To query the neighbors discovered by lldpd on the computer side, lldpctl outputs all the current neighbors:

Interface: eth1
 ChassisID: c2950-01 (local)
 SysName:   c2950-01
   cisco WS-C2950G-48-EI running on
   Cisco Internetwork Operating System Software 
   IOS (tm) C2950 Software (C2950-I6K2L2Q4-M), Version 12.1(22)EA8a, RELEASE SOFTWARE (fc1)
   Copyright (c) 1986-2006 by cisco Systems, Inc.
   Compiled Fri 28-Jul-06 17:00 by weiliu
 Caps:      Bridge(E) 

 PortID:    GigabitEthernet0/1 (ifName)
 PortDescr: GigabitEthernet0/1

To finish up the post, I’ll note that discovery protocols have in the past, and potentially still now, have been susceptible to attacks by flooding devices with too many neighbor relations. Because of this, it might be best to ensure these protocols are disabled on switches ports connected to untrusted machines.

Comments Off on Link Layer Discovery Protocol more...


by on Feb.08, 2010, under Tinkergeek

Tinkergeek recently moved to its own dedicated server hosted by fdc servers. The transition wasn’t quite as smooth as one would hope, but they do provide cheap hosting. With moving to a dedicated server based around cheap PC hardware, I thought it’d be a great idea to bring back an old backup solution. Dirvish is a neat set of scripts that combine the goodness of hard-links and rsync. The goal is that Dirvish creates a full backup once and then stores just the changes of the target file system on the backup system.

Installing dirvish on a Debian system is fairly easy:

apt-get install dirvish

Then, one merely has to copy the example configuration files into place. The first one is the master configuration file that goes in /etc/dirvish and can be found in /usr/share/doc/dirvish/master.conf.

## Example dirvish master configuration file:
	root	22:00
expire-default: +15 days
	*   *     *   *         1    +3 months
#	*   *     1-7 *         1    +1 year
#	*   *     1-7 1,4,7,10  1
	*   10-20 *   *         *    +4 days
#	*   *     *   *         2-7  +15 days

Under bank: is going to be the place on your machine that contains all the backups. I’d suggest making this its own file system, as dirvish can eat inodes like there’s no tomorrow. Next, I’d suggest adding /proc and /sys under the global exclude: section just to ensure you don’t back these directories up.

Now, you have to make your first machine directory for backups (known as a vault in dirvish speak). This directory structure will be under the directory in the bank section from above.

mkdir -p /backup/example.com/dirvish

Now, just copy the default.conf example from /usr/share/doc/dirvish/examples into example.com/dirvish and edit.

client: thishost
tree: /
xdev: 1
index: gzip
log: gzip
image-default: %Y%m%d

The key points in this file file are the client, xdev, and exclude directives. Merely change your client: to be the machine IP or hostname that you’re backing up. Xdev tells rsync to traverse file systems on the machine; generally, you’ll want to be careful with this setting, specially if you mount NFS shares. Lastly, update the exclude list for this particular machine. If you’re backing up the local machine using dirvish, be sure to exclude the dirvish bank directory!

Now, you’re all set to create the first backup with:

dirvish --vault example.com --init

If all goes well, you’ll have a new directory under /backup/example.com with the current data and a copy of the target. If there was an error, be sure to remove the failed backup attempt from /backup/example.com and rerun the dirvish command after fixing the error.

Now, the only thing left is to run dirvish-runall via cron at some convenient time and you’re on your way to having a decent backup solution. Besure to read the remainder of the dirvish documentation to pick up the finer points of configuration.

Comments Off on Dirvish more...

Quick Howto on Lustre 1.8

by on Nov.24, 2009, under Purdue

Everybody likes fast file performance and recently I’ve been twiddling with different distributed/parallel/clustered file systems for fun and excitement. Tonight was Lustre’s turn to be toyed with. Below is how I got a small/slow Lustre going on my laptop using VirtualBox.

First, install CentOS 5 on some VMs, perhaps three? Then download the RPM packages from Sun/Lustre.org. On the servers:

rpm -ivh kernel-lustre*
rpm -ivh lustre-* lustre-ldiskfs* lustre-modules* e2fsprogs*

On the clients, install the appropriate kernel package for the patch-less client and then the lustre-client packages:

rpm -ivh --force kernel-2.6.18-128.7.1.el5.i686.rpm
rpm -ivh lustre-client- lustre-client-modules-

Now, specially if your VMs have one network for the outside world through NAT and one network for inter-VM communication, you should add the following line to /etc/modprobe.conf to make Lustre find and use the correct interface:

options lnet networks=tcp0(eth1)

Now it’s time to set up the meta-data and management servers:

mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt
mkdir -p /lustre/mds
mount -t lustre -o loop /tmp/mdt /lustre/mds

Now it’s time to set up the servers and storage targets (in this simple example, there is only one server per target). Run this on all the storage servers:

mkfs.lustre --reformat --device-size=10000000 --fsname lustre --ost --mgsnode=@tcp0 /tmp/ost
mkdir -p /lustre/ost
mount -t lustre -o loop /tmp/ost /lustre/ost

After the client has finished rebooting into it proper kernel, mounting the file system is straight forward:

mount -t lustre @tcp0:/lustre /mnt

Of course, now it’s time to write files into our file system to see that it really works:

dd if=/dev/zero of=/mnt/foo bs=1M count=100
lfs getstripe /mnt/*

At this point, if you have multiple storage targets, then you’ll see that your large file got written to just a single target. That’s sad, since we were hoping for super-ultra-fast file I/O from our VMs running Lustre. Thankfully, this can be easily fix:

mkdir /mnt/super-fast
lfs setstripe -c 2 /mnt/super-fast
dd if=/dev/zero of=/mnt/super-fast/bar bs=1M count=100
lfs getstripe /mnt/super-fast/*

Aaah, there we go. You should have gotten “twice” the performance in MB/s reported from dd and you should see that now your large file of zero’s got written out to two different targets. Of course, the Lustre manual has all sorts of useful tuning suggestions and things to try to get working (like shared targets and proper failover). Many things to try, but I just thought I should document all this in a place that I could later find it easily. As for where I got this, I mostly found quick bits at this blog and the stranger stuff in the Lustre 1.8 manual when I got stuck on my network problems.

Comments Off on Quick Howto on Lustre 1.8 more...

Super Computing 2009 – Final Day

by on Nov.20, 2009, under Purdue, Tinkergeek


Comments Off on Super Computing 2009 – Final Day : more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...