Purdue
MooseFS Testing
by Alex on Aug.28, 2010, under Purdue
It’s been a while since anything new popped up on here.. But, now I have something interesting to talk about. A few many months ago, I was toying around with Lustre and Gluster. Lustre is the standard parallel file system on many large supercomputers today, and it is an open source project now at Oracle. Gluster can be both or either a parallel or a distributed file system. Lustre has a centralized manager/metadata server whereas Gluster is much more decentralized. One currently lacking bit of Lustre is that redundancy comes from the underlying OS and hardware, which can be both expensive and more difficult to set up. Gluster has a notion about redundancy built into it and does not necessarily rely on anything else.
There are both good and less good aspects to the designs and implementations of Lustre and Gluster. However, it seems like the holy grail in storage is a distributed file system that can both scale its performance easily and take advantage of left over storage on servers without requiring large amounts of effort to add and remove parts of the storage system. If a backing storage piece in Lustre disappears, all files that resided on that backend become unavailable, which is a problem on less than reliable hardware. Glustre can get around this problem, but its configuration is slightly combersome and changing the layout of the storage cluster is fun. These are not necessarily problems in themselves, specially when using dedicated resources however can be problems when using willy-nilly, glued together backing storage.
Given those comments on Lustre and Gluster, there are two distributed file systems I’ve been looking at: Ceph and MooseFS. Ceph is the new kid on the block and is currently undergoing heavy development. I think the designers certainly kept their eye on scalability and redundancy, but getting the file system going has been a pain. The pain mostly focuses around my use of a Linux distribution that is made up of older components (kernel and user land bits). In time, I’ll revisit Ceph and run it through its paces.
Until then, MooseFS is the most recent file system on my chopping block. The other file systems I’ve posted about got tested on a variety of hardware that was sufficiently underpowered. Today, I have more hardware to play with..
I think to understand the performance that I saw with MooseFS, I should say some words about the hardware I used. There are two clusters I used for testing. The first is a 20 node cluster with 4 cores per machine and 8GB of RAM per machine. The storage is a single SATA disk and the network is a 1Gbps drop. There are 5 nodes on each switch and 4x1gbps uplinks from each switch to an aggregation switch. The network is slightly oversubscribed at 5:4, but I don’t believe this was a major problem during testing for this small number of machines.
The next cluster is a 96 node cluster with two cores per machine, 3x 160GB SATA disks in a RAID0, 1Gbps network drop, and every 48 nodes connects to one switch. The two switches are interconnected using a single 10Gbps link.
Testing done on the first cluster was done using 20 iozone instances, one per node, that read and wrote upto 8gb files each at various block sizes. The speed of the file system did not appear to be dependent on block size. That makes sense because MooseFS uses 64MB blocks internally and all operations take place on a whole block. Reads and writes using files smaller than 8GB saw amazing performance, as they all fit into memory cache. At the worse extreme using 8GB files and seeking to random positions inside the files, the performance was equal to the speed of the hard drives doing sequencial reads/writes on 64MB files. This is a potential downside, because iozone may have only wanted a 4KB chunk, but an entire 64MB block was exchanged between client and server leaving the rest to be thrown out.
A distributed file system able to get native hardware speed is ok, but the most important lesson learned what in the way MooseFS gets mounted on a client. MFS uses a FUSE driver on the client to mount the file system. In the past, I’ve had bad luck with these drivers becoming upset and gumming up the works. However, after running tests on the 20 node cluster for an entire weekend and pushing 60TB of data in and out of the FUSE driver from MFS, everything was still functional and no mounts were hung. This is what pushed me to continue testing and push it onto larger hardware.
I’ve not yet completed a comprehensive test using iozne on the 96 node cluster, but after doing some simple tests, I’m encouraged by the results. Running 48 writers, I was able to get: 540MB/s using a MFS goal of 3, 700MB/s using a goal of 2, 1400MB/s using a goal of 1, and 7000MB/s directly to the underlying file system without going through MFS on 48 nodes. The goal in MFS is the number of replicas of a file that gets stored. Replicas are written in parallel, so the worst case performance was 540MB/s * 3 or near the maximum performance I saw using no replicas. There is a discrepancy between the top end performance from MFS and what the underlying disk is able to write. I explain this because the maximum bandwidth between the two halves of the cluster is only 1250MB/s theoretically but that not all operations had to complete between client/servers on different halves, so I saw the interlink bandwidth + whatever local bandwidth was getting used.
The read performance was better than write.. 1520MB/s from 3x replicated files, 1100MB/s from 2x replicated files, 1400MB/s from single copy files, and 7100MB/s from the underlying file systems. If there is more than one copy of a file being read, each reader can pull from different servers. The performance dip from 3x to 2x is probably how the different 64MB blocks were distributed between the halves of the cluster, and I still saw a performance maximum near 1300MB/s, the single link between the two halves.
I saw a few things during these tests. The first and again most encouraging was that the FUSE driver held out and nothing blew up. The second was that the master daemon (the process keeping track of 64MB chunks and related metadata) did a lot of DNS queries, I assume for logging of traffic and house keeping. Overall, nothing terribly surprising came up. I should note that in both tests, the master machine with 4 cores, 8gb of memory, and two disks in a mirror using a single 1gbps uplink. This machine never seemed terribly busy.
The next step I believe will be to increase the internal bandwidth of the cluster and try testing again using the iozone.
Quick Howto on Lustre 1.8
by Alex on Nov.24, 2009, under Purdue
Everybody likes fast file performance and recently I’ve been twiddling with different distributed/parallel/clustered file systems for fun and excitement. Tonight was Lustre’s turn to be toyed with. Below is how I got a small/slow Lustre going on my laptop using VirtualBox.
First, install CentOS 5 on some VMs, perhaps three? Then download the RPM packages from Sun/Lustre.org. On the servers:
rpm -ivh kernel-lustre* rpm -ivh lustre-1.8.1.1* lustre-ldiskfs* lustre-modules* e2fsprogs*
On the clients, install the appropriate kernel package for the patch-less client and then the lustre-client packages:
rpm -ivh --force kernel-2.6.18-128.7.1.el5.i686.rpm rpm -ivh lustre-client-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm lustre-client-modules-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm
Now, specially if your VMs have one network for the outside world through NAT and one network for inter-VM communication, you should add the following line to /etc/modprobe.conf to make Lustre find and use the correct interface:
options lnet networks=tcp0(eth1)
Now it’s time to set up the meta-data and management servers:
mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt mkdir -p /lustre/mds mount -t lustre -o loop /tmp/mdt /lustre/mds
Now it’s time to set up the servers and storage targets (in this simple example, there is only one server per target). Run this on all the storage servers:
mkfs.lustre --reformat --device-size=10000000 --fsname lustre --ost --mgsnode=@tcp0 /tmp/ost mkdir -p /lustre/ost mount -t lustre -o loop /tmp/ost /lustre/ost
After the client has finished rebooting into it proper kernel, mounting the file system is straight forward:
mount -t lustre@tcp0:/lustre /mnt
Of course, now it’s time to write files into our file system to see that it really works:
dd if=/dev/zero of=/mnt/foo bs=1M count=100 lfs getstripe /mnt/*
At this point, if you have multiple storage targets, then you’ll see that your large file got written to just a single target. That’s sad, since we were hoping for super-ultra-fast file I/O from our VMs running Lustre. Thankfully, this can be easily fix:
mkdir /mnt/super-fast lfs setstripe -c 2 /mnt/super-fast dd if=/dev/zero of=/mnt/super-fast/bar bs=1M count=100 lfs getstripe /mnt/super-fast/*
Aaah, there we go. You should have gotten “twice” the performance in MB/s reported from dd and you should see that now your large file of zero’s got written out to two different targets. Of course, the Lustre manual has all sorts of useful tuning suggestions and things to try to get working (like shared targets and proper failover). Many things to try, but I just thought I should document all this in a place that I could later find it easily. As for where I got this, I mostly found quick bits at this blog and the stranger stuff in the Lustre 1.8 manual when I got stuck on my network problems.
Super Computing 2009 – Final Day
by Alex on Nov.20, 2009, under Purdue, Tinkergeek
So, I didn’t have much of a chance to update the blog with blow-by-blow action updates at SC this year.. Oops. However, the show was a blast. The Purdue team won the “lowest power consumption” award. It was certainly a surprise, given another team had a generally more efficient power design. In any case, we’re certainly happy we won an award.
I did not get much of a chance to walk around the floor between working both of Purdue’s booths. However, I did hear about and see one or two interesting things. The first was the new ethernet gear coming out of Voltaire. They can confederate several of their switch chassises to form a virtual switch that allows an LACP bond between the chassises to be active-active, which seems to be all the rage these days. After visiting the Voltaire booth, Cisco also has various solutions.. from the virtual port channel on the nexus gear and the virtual switch chassis on the catalyst gear. Also, it appears that perhaps Cisco will be implementing something like Woven’s multi-path layer 2 magic, so that could be fun too.
Lastly, the most interesting company on the floor that I saw was the “Two Guys and a Cluster” folks. Really, it’s just two guys that live on opposite coasts that help people acquire and run clusters. They seem to be a nice middle ground between a value-added cluster vendor and just buying hardware off a website and doing it all yourself. Their web address, with the start of an interesting blog, can be found here.
Super Computing 2009 – Day 2
by Alex on Nov.16, 2009, under Purdue
The day started early with the team wondering to the convention center to start validating all our applications. We found that the center’s power dropped voltage under load enough to cause us to overshoot our current problems badly. After taking account of all our hardware, we removed some memory and turned off our spare hardware which solved our problems.
At the end of the day, everything was good to go for the morning when we start running our benchmark for the judging. I’ll leave you with a picture of our booth this year:

Super Computing 2009 – Day 1
by Alex on Nov.16, 2009, under Purdue
After getting up at 3AM eastern time to make our flight at 6AM, the Purdue Student Cluster Challenge team made it to Portland, Oregon, on time and not terribly tired. The challenge organizers had our box sitting in our booth and we took no time in starting to unbox it.
For now, here is a link that talks about the challenge for this year: InsideHPC.
GlusterFS
by Alex on Oct.13, 2009, under Purdue
At work we’ve been evaluating different distributed file systems in our spare time. Currently, we use one large, centralized filer and have seen problems being able to push as many input/output operations through as we’d like. While that’s mostly a backend disk problem, wouldn’t it be great to have a storage system that grew as we added more cluster nodes?
In that hope, we tested some pretty alpha-level pNFS code, some Hadoop, and now GlusterFS. All these seems to have some faults, but this is what I found for Gluster…
Downloading the RPMs from the main FTP repository and installing them was pretty painless on RHEL5. The documentation is pretty spare and misleading, but eventually whipping up these config files made it all go:
#glusterfsd.vol volume posix type storage/posix option directory /glusterfs end-volume volume locks type features/locks subvolumes posix end-volume volume brick type performance/io-threads option thread-count 8 subvolumes locks end-volume volume server type protocol/server option transport-type tcp option auth.addr.brick.allow * subvolumes brick end-volume
and
volume remote1 type protocol/client option transport-type tcp option remote-host foobar-0.example.com option remote-subvolume brick end-volume volume remote2 type protocol/client option transport-type tcp option remote-host foobar-2.example.com option remote-subvolume brick end-volume volume remote3 type protocol/client option transport-type tcp option remote-host pfnstest-003.example.com option remote-subvolume brick end-volume volume remote4 type protocol/client option transport-type tcp option remote-host foobar-4.example.com option remote-subvolume brick end-volume volume remote5 type protocol/client option transport-type tcp option remote-host foobar-5.example.com option remote-subvolume brick end-volume volume remote6 type protocol/client option transport-type tcp option remote-host foobar-6.example.com option remote-subvolume brick end-volume volume replicate1 type cluster/replicate subvolumes remote1 remote2 remote3 end-volume volume replicate2 type cluster/replicate subvolumes remote4 remote5 remote6 end-volume volume distribute type cluster/distribute subvolumes replicate1 replicate2 end-volume volume writebehind type performance/write-behind option window-size 4MB subvolumes distribute end-volume volume cache type performance/io-cache option cache-size 1024MB subvolumes writebehind end-volume
The backing storage for this GlusterFS was an Ext3 file system carved out of LVM and housed on HP SATA disk trays. Mounting up that file system and running some simplistic tests, I found that using large file sizes the file system performance was about at the maximum network speed and that using small file sizes the performance was in the 5-10MB/s range. Not bad for an hour or two’s worth of effect.
Another Install Day
by Alex on Jul.19, 2009, under Purdue
It appears our last cluster was such a hit that once again my department at Purdue is going to do a single day, multi-hundred node cluster installation. Last year, the work went very quickly and we had the cluster racked before lunch. This year, we asked for fewer volunteers and added an extra step in the process, installing 10GigE network adapters. Hopefully that will slow things down enough so the VIPs can actually see us working.
As always, the marketing folks put together a short little clip to push the day:
The racking and stacking is pretty much just a matter of man power. Getting the software onto the systems is done using RHEL’s Anaconda and several scripts. All the magic was described by some of my coworkers in a paper for the USENIX’s LISA 2008 Conference.. A copy can be found here. It’s pretty fun stuff.
OSG Storage Forum – Fermilab
by Alex on Jul.05, 2009, under Purdue
Yeah, I’m a tad behind on updating my blog… So, just this week several of us from work went to Fermilab to participate in the OSG Storage Forum. Mostly, it was a couple day talk for interested parties participating in the grid to talk about their various storage solutions. The biggest presentations came from the folks involved with the CMS and ATLAS projects, stemming from the work at the LHC. To support physicists around the nation and the globe, there are a whole series of sites dedicated to providing access to the terabytes of data flowing from the LHC; all of it just waiting to be analyzed.
The biggest reason I went was to hear about other people using Hadoop as a replacement for a piece of software called dCache… Oh yeah, and it was held in the most awesome office building in the world:

That’s pretty much the interesting bits of that visit. Though, I thought Fermilab was pretty nifty.
TeraGrid 2009 – Arlington, VA
by Alex on Jul.05, 2009, under Purdue
During June, I traveled to Arlington, VA, for the Teragrid 2009 conference for work. This is certainly an interesting conference. For those that don’t know, the Teragrid is “an open scientific discovery infrastructure combining leadership class resources at eleven partner sites to create an integrated, persistent computational resource.” In other words, a fairly big project put together by the NSF to create a national cyberinfrastructure to serve the needs of science.
I was part of the student program, which included 120 other high school, undergraduate, and graduate students from all around the nation. It was certainly interesting listening to the talks given by middle and high school educators about how they are trying to integrate computer simulation and visualization into the classroom.
During the conference, I mostly had two goals: listen to talks and present some of the work being done at Purdue. I had a small poster in the poster session about deploying Hadoop and how Purdue envisions using the Hadoop Distributed File System to support high throughput computing (which, I hear we’re known for doing quite well). Also, we had a small talk in the Education and Outreach Track about how Purdue is using the SC Cluster Challenge to support undergraduate exploration of High Performance Computing. Thankfully, both my little poster and the talk drew a lot of positive attention.
Sadly, the Teragrid conference seems very focused around talking about the science being done on the grid and not so much about the technologies that have been deployed and developed to make “the grid.” (Not to say the science isn’t important, just most of it is way over my head.) The best parts of the conference were finding other site admins to talk with and the poster session.
And, here’s a picture of the capital building from when we ventured away from the hotel on the last day:

Virtualization Options
by Alex on Feb.01, 2009, under Purdue
From Amazon to Linode and to scientific clouds, everyone seems to love using Xen to run a service for others. It appears the open source and easily scriptable nature, not to mention the lack of licensing problems and cost, make this a compelling option when you need to host a lot of VMs. On the other hand, it appears VMware has made highways over Xen when it comes to mainline IT and the desire to decrease cost. Nothing beats a manager’s opinion after getting flogged by the VMware sales’ stick.
After running VMware Server 1.0.5 for work’s small physical server to VM effort and deploying Xen for a scientific cloud, I have started to look into what’s next..
VMware Server 1.x is getting old and support has begun to ware thin. So, I hitched up an Ubuntu box with VMware Server 2.0 and started poking around. The first major difference is the command and control interface; it is now a web page with a lot of javascript and a plugin to view machine consoles. This is quite a bit different from the native X11 based interface that “just worked” over SSH with X forwarding. I could not for the life of me get the plugin working on my Mac natively.
The next stop for this train was more free VMware products: VMware ESXi. I had deployed an evaluation copy of ESX before and was taken aback when the only way to manage ESXi was from a Windows-only client. After some digging, to truly manage ESXi installations, you need a Windows-based server, only then could you get even a web interface.
It just seems wrong to rely on a Windows box to manage a fleet of VM-ized servers.. VMware Server 1.x was still looking pretty good.
Then, I remembered the Xen folks were doing a lot of work toward supporting unmodified guests. The Xen HVM stuff looks very promising. It does require having a recent processor and maintaing a primary host operating system, but allows one to have fine grain control over the network bridges and be completely open source. Using Xen-tools it is nearly trivial to deploy a new Linux distribution in an image and giving just a couple of parameters in a config file, you can boot Windows. Getting an encrypted VNC session for a console is a whole lot easier to deal with than attempting to get some sort of silly browser plugin functioning.
Now, just to figure out if sacrificing VMware image compatibility is worth abandoning VMware’s rather ridiculous management concepts for a truly free choice.