A Blog

MooseFS Testing

by Alex on Aug.28, 2010, under Purdue

It’s been a while since anything new popped up on here.. But, now I have something interesting to talk about. A few many months ago, I was toying around with Lustre and Gluster. Lustre is the standard parallel file system on many large supercomputers today, and it is an open source project now at Oracle. Gluster can be both or either a parallel or a distributed file system. Lustre has a centralized manager/metadata server whereas Gluster is much more decentralized. One currently lacking bit of Lustre is that redundancy comes from the underlying OS and hardware, which can be both expensive and more difficult to set up. Gluster has a notion about redundancy built into it and does not necessarily rely on anything else.

There are both good and less good aspects to the designs and implementations of Lustre and Gluster. However, it seems like the holy grail in storage is a distributed file system that can both scale its performance easily and take advantage of left over storage on servers without requiring large amounts of effort to add and remove parts of the storage system. If a backing storage piece in Lustre disappears, all files that resided on that backend become unavailable, which is a problem on less than reliable hardware. Glustre can get around this problem, but its configuration is slightly combersome and changing the layout of the storage cluster is fun. These are not necessarily problems in themselves, specially when using dedicated resources however can be problems when using willy-nilly, glued together backing storage.

Given those comments on Lustre and Gluster, there are two distributed file systems I’ve been looking at: Ceph and MooseFS. Ceph is the new kid on the block and is currently undergoing heavy development. I think the designers certainly kept their eye on scalability and redundancy, but getting the file system going has been a pain. The pain mostly focuses around my use of a Linux distribution that is made up of older components (kernel and user land bits). In time, I’ll revisit Ceph and run it through its paces.

Until then, MooseFS is the most recent file system on my chopping block. The other file systems I’ve posted about got tested on a variety of hardware that was sufficiently underpowered. Today, I have more hardware to play with..

I think to understand the performance that I saw with MooseFS, I should say some words about the hardware I used. There are two clusters I used for testing. The first is a 20 node cluster with 4 cores per machine and 8GB of RAM per machine. The storage is a single SATA disk and the network is a 1Gbps drop. There are 5 nodes on each switch and 4x1gbps uplinks from each switch to an aggregation switch. The network is slightly oversubscribed at 5:4, but I don’t believe this was a major problem during testing for this small number of machines.

The next cluster is a 96 node cluster with two cores per machine, 3x 160GB SATA disks in a RAID0, 1Gbps network drop, and every 48 nodes connects to one switch. The two switches are interconnected using a single 10Gbps link.

Testing done on the first cluster was done using 20 iozone instances, one per node, that read and wrote upto 8gb files each at various block sizes. The speed of the file system did not appear to be dependent on block size. That makes sense because MooseFS uses 64MB blocks internally and all operations take place on a whole block. Reads and writes using files smaller than 8GB saw amazing performance, as they all fit into memory cache. At the worse extreme using 8GB files and seeking to random positions inside the files, the performance was equal to the speed of the hard drives doing sequencial reads/writes on 64MB files. This is a potential downside, because iozone may have only wanted a 4KB chunk, but an entire 64MB block was exchanged between client and server leaving the rest to be thrown out.

A distributed file system able to get native hardware speed is ok, but the most important lesson learned what in the way MooseFS gets mounted on a client. MFS uses a FUSE driver on the client to mount the file system. In the past, I’ve had bad luck with these drivers becoming upset and gumming up the works. However, after running tests on the 20 node cluster for an entire weekend and pushing 60TB of data in and out of the FUSE driver from MFS, everything was still functional and no mounts were hung. This is what pushed me to continue testing and push it onto larger hardware.

I’ve not yet completed a comprehensive test using iozne on the 96 node cluster, but after doing some simple tests, I’m encouraged by the results. Running 48 writers, I was able to get: 540MB/s using a MFS goal of 3, 700MB/s using a goal of 2, 1400MB/s using a goal of 1, and 7000MB/s directly to the underlying file system without going through MFS on 48 nodes. The goal in MFS is the number of replicas of a file that gets stored. Replicas are written in parallel, so the worst case performance was 540MB/s * 3 or near the maximum performance I saw using no replicas. There is a discrepancy between the top end performance from MFS and what the underlying disk is able to write. I explain this because the maximum bandwidth between the two halves of the cluster is only 1250MB/s theoretically but that not all operations had to complete between client/servers on different halves, so I saw the interlink bandwidth + whatever local bandwidth was getting used.

The read performance was better than write.. 1520MB/s from 3x replicated files, 1100MB/s from 2x replicated files, 1400MB/s from single copy files, and 7100MB/s from the underlying file systems. If there is more than one copy of a file being read, each reader can pull from different servers. The performance dip from 3x to 2x is probably how the different 64MB blocks were distributed between the halves of the cluster, and I still saw a performance maximum near 1300MB/s, the single link between the two halves.

I saw a few things during these tests. The first and again most encouraging was that the FUSE driver held out and nothing blew up. The second was that the master daemon (the process keeping track of 64MB chunks and related metadata) did a lot of DNS queries, I assume for logging of traffic and house keeping. Overall, nothing terribly surprising came up. I should note that in both tests, the master machine with 4 cores, 8gb of memory, and two disks in a mirror using a single 1gbps uplink. This machine never seemed terribly busy.

The next step I believe will be to increase the internal bandwidth of the cluster and try testing again using the iozone.

Leave a Comment more...

Link Layer Discovery Protocol

by Alex on Feb.27, 2010, under Tinkergeek

Ever wondered exactly where all your network cabling goes? Have you been using Cisco and wished your computers spoke CDP too? Apparently you and everyone else would love for the computers to just say where they connected instead of chasing down network cables by hand. That seems to be the goal of the Link Layer Discovery Protocol (LLDP or 802.11ab). Unlike the Cisco Discovery Protocol (CDP), LLDP is the vender-neutral attempt to get it all right.

There is a LLDP daemon that is published at Luffy.cx that implements this under Linux. There are other daemons out there too, but this showed up first when I searched the Debian repositories for precompiled versions. Simply installing it via apt and starting up the service is enough to get your computers and your network devices discovering themselves. Although, if you’re like me and don’t have fancy new networking gear that supports LLDP, lldpd supports a wide range of other network discovery protocols too.

Once I installed lldpd on all my computers and enabled the CDP option (the -c option when starting up lldpd), I saw the magic happen:

c2950-01#show cdp neighbors
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
                  S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone

Device ID        Local Intrfce     Holdtme    Capability  Platform  Port ID
ospf-01
                 Gig 0/1            107           R       Linux     eth1
home-1
                 Gig 0/2            114                   Linux     eth0
storage
                 Fas 0/15           106                   Linux     eth0
c831
                 Fas 0/8            146           R       C831      Eth 2
c871
                 Fas 0/1            142         R S I     871       Fas 0

To query the neighbors discovered by lldpd on the computer side, lldpctl outputs all the current neighbors:

Interface: eth1
 ChassisID: c2950-01 (local)
 SysName:   c2950-01
 SysDescr:
   cisco WS-C2950G-48-EI running on
   Cisco Internetwork Operating System Software
   IOS (tm) C2950 Software (C2950-I6K2L2Q4-M), Version 12.1(22)EA8a, RELEASE SOFTWARE (fc1)
   Copyright (c) 1986-2006 by cisco Systems, Inc.
   Compiled Fri 28-Jul-06 17:00 by weiliu
 MgmtIP:    172.0.0.0
 Caps:      Bridge(E) 

 PortID:    GigabitEthernet0/1 (ifName)
 PortDescr: GigabitEthernet0/1

To finish up the post, I’ll note that discovery protocols have in the past, and potentially still now, have been susceptible to attacks by flooding devices with too many neighbor relations. Because of this, it might be best to ensure these protocols are disabled on switches ports connected to untrusted machines.

Comments Off more...

Dirvish

by Alex on Feb.08, 2010, under Tinkergeek

Tinkergeek recently moved to its own dedicated server hosted by fdc servers. The transition wasn’t quite as smooth as one would hope, but they do provide cheap hosting. With moving to a dedicated server based around cheap PC hardware, I thought it’d be a great idea to bring back an old backup solution. Dirvish is a neat set of scripts that combine the goodness of hard-links and rsync. The goal is that Dirvish creates a full backup once and then stores just the changes of the target file system on the backup system.

Installing dirvish on a Debian system is fairly easy:

apt-get install dirvish

Then, one merely has to copy the example configuration files into place. The first one is the master configuration file that goes in /etc/dirvish and can be found in /usr/share/doc/dirvish/master.conf.

## Example dirvish master configuration file:
bank:
	/backup
exclude:
	lost+found/
	core
	*~
	.nfs*
Runall:
	root	22:00
expire-default: +15 days
expire-rule:
#       MIN HR    DOM MON       DOW  STRFTIME_FMT
	*   *     *   *         1    +3 months
#	*   *     1-7 *         1    +1 year
#	*   *     1-7 1,4,7,10  1
	*   10-20 *   *         *    +4 days
#	*   *     *   *         2-7  +15 days

Under bank: is going to be the place on your machine that contains all the backups. I’d suggest making this its own file system, as dirvish can eat inodes like there’s no tomorrow. Next, I’d suggest adding /proc and /sys under the global exclude: section just to ensure you don’t back these directories up.

Now, you have to make your first machine directory for backups (known as a vault in dirvish speak). This directory structure will be under the directory in the bank section from above.

mkdir -p /backup/example.com/dirvish

Now, just copy the default.conf example from /usr/share/doc/dirvish/examples into example.com/dirvish and edit.

client: thishost
tree: /
xdev: 1
index: gzip
log: gzip
image-default: %Y%m%d
exclude:
	/var/cache/apt/archives/*.deb
	/var/cache/man/**
	/tmp/**
	/var/tmp/**
        *.bak

The key points in this file file are the client, xdev, and exclude directives. Merely change your client: to be the machine IP or hostname that you’re backing up. Xdev tells rsync to traverse file systems on the machine; generally, you’ll want to be careful with this setting, specially if you mount NFS shares. Lastly, update the exclude list for this particular machine. If you’re backing up the local machine using dirvish, be sure to exclude the dirvish bank directory!

Now, you’re all set to create the first backup with:

dirvish --vault example.com --init

If all goes well, you’ll have a new directory under /backup/example.com with the current data and a copy of the target. If there was an error, be sure to remove the failed backup attempt from /backup/example.com and rerun the dirvish command after fixing the error.

Now, the only thing left is to run dirvish-runall via cron at some convenient time and you’re on your way to having a decent backup solution. Besure to read the remainder of the dirvish documentation to pick up the finer points of configuration.

Comments Off more...

Quick Howto on Lustre 1.8

by Alex on Nov.24, 2009, under Purdue

Everybody likes fast file performance and recently I’ve been twiddling with different distributed/parallel/clustered file systems for fun and excitement. Tonight was Lustre’s turn to be toyed with. Below is how I got a small/slow Lustre going on my laptop using VirtualBox.

First, install CentOS 5 on some VMs, perhaps three? Then download the RPM packages from Sun/Lustre.org. On the servers:

rpm -ivh kernel-lustre*
rpm -ivh lustre-1.8.1.1* lustre-ldiskfs* lustre-modules* e2fsprogs*

On the clients, install the appropriate kernel package for the patch-less client and then the lustre-client packages:

rpm -ivh --force kernel-2.6.18-128.7.1.el5.i686.rpm
rpm -ivh lustre-client-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm lustre-client-modules-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm

Now, specially if your VMs have one network for the outside world through NAT and one network for inter-VM communication, you should add the following line to /etc/modprobe.conf to make Lustre find and use the correct interface:

options lnet networks=tcp0(eth1)

Now it’s time to set up the meta-data and management servers:

mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt
mkdir -p /lustre/mds
mount -t lustre -o loop /tmp/mdt /lustre/mds

Now it’s time to set up the servers and storage targets (in this simple example, there is only one server per target). Run this on all the storage servers:

mkfs.lustre --reformat --device-size=10000000 --fsname lustre --ost --mgsnode=@tcp0 /tmp/ost
mkdir -p /lustre/ost
mount -t lustre -o loop /tmp/ost /lustre/ost

After the client has finished rebooting into it proper kernel, mounting the file system is straight forward:

mount -t lustre @tcp0:/lustre /mnt

Of course, now it’s time to write files into our file system to see that it really works:

dd if=/dev/zero of=/mnt/foo bs=1M count=100
lfs getstripe /mnt/*

At this point, if you have multiple storage targets, then you’ll see that your large file got written to just a single target. That’s sad, since we were hoping for super-ultra-fast file I/O from our VMs running Lustre. Thankfully, this can be easily fix:

mkdir /mnt/super-fast
lfs setstripe -c 2 /mnt/super-fast
dd if=/dev/zero of=/mnt/super-fast/bar bs=1M count=100
lfs getstripe /mnt/super-fast/*

Aaah, there we go. You should have gotten “twice” the performance in MB/s reported from dd and you should see that now your large file of zero’s got written out to two different targets. Of course, the Lustre manual has all sorts of useful tuning suggestions and things to try to get working (like shared targets and proper failover). Many things to try, but I just thought I should document all this in a place that I could later find it easily. As for where I got this, I mostly found quick bits at this blog and the stranger stuff in the Lustre 1.8 manual when I got stuck on my network problems.

Comments Off more...

Super Computing 2009 – Final Day

by Alex on Nov.20, 2009, under Purdue, Tinkergeek

So, I didn’t have much of a chance to update the blog with blow-by-blow action updates at SC this year.. Oops. However, the show was a blast. The Purdue team won the “lowest power consumption” award. It was certainly a surprise, given another team had a generally more efficient power design. In any case, we’re certainly happy we won an award.

I did not get much of a chance to walk around the floor between working both of Purdue’s booths. However, I did hear about and see one or two interesting things. The first was the new ethernet gear coming out of Voltaire. They can confederate several of their switch chassises to form a virtual switch that allows an LACP bond between the chassises to be active-active, which seems to be all the rage these days. After visiting the Voltaire booth, Cisco also has various solutions.. from the virtual port channel on the nexus gear and the virtual switch chassis on the catalyst gear. Also, it appears that perhaps Cisco will be implementing something like Woven’s multi-path layer 2 magic, so that could be fun too.

Lastly, the most interesting company on the floor that I saw was the “Two Guys and a Cluster” folks. Really, it’s just two guys that live on opposite coasts that help people acquire and run clusters. They seem to be a nice middle ground between a value-added cluster vendor and just buying hardware off a website and doing it all yourself. Their web address, with the start of an interesting blog, can be found here.

Comments Off : more...

Super Computing 2009 – Day 2

by Alex on Nov.16, 2009, under Purdue

The day started early with the team wondering to the convention center to start validating all our applications. We found that the center’s power dropped voltage under load enough to cause us to overshoot our current problems badly. After taking account of all our hardware, we removed some memory and turned off our spare hardware which solved our problems.

At the end of the day, everything was good to go for the morning when we start running our benchmark for the judging. I’ll leave you with a picture of our booth this year:

sc09 booth photo

Comments Off : more...

Super Computing 2009 – Day 1

by Alex on Nov.16, 2009, under Purdue

After getting up at 3AM eastern time to make our flight at 6AM, the Purdue Student Cluster Challenge team made it to Portland, Oregon, on time and not terribly tired. The challenge organizers had our box sitting in our booth and we took no time in starting to unbox it.

For now, here is a link that talks about the challenge for this year: InsideHPC.

Comments Off : more...

Transparent Squid Proxy using WCCP

by Alex on Oct.13, 2009, under Tinkergeek

Using a web caching proxy can help save bandwidth and provide a good log of web traffic. At the house, I have a slow-ish DSL link to the world and would certainly love to have video not freeze when also browsing the web. And as a friend points out, sometimes it’s easier to have a log of traffic when digging into problems. Although, my one biggest pet peeve is having to manually reconfigure my laptop when moving between campus and home. OSX does have the concept of “network locations,” but usually half the time I forget to change it. So, the only solution to this problem is a transparent web-caching proxy server. Thankfully, I run a server in the basement for fun and excitement anyway.
First things first, one needs to set up the “Web Cache Control Protocol” on the router. At home, I use a Cisco 871, so I issued these changes:

config terminal

ip wccp web-cache
interface vlan1000
  ip wccp web-cache redirect in

exit

This sets up WCCP and applies it to the virtual network my wireless network uses. Next, we need to set up the Ubuntu box with Squid, thankfully that’s pretty easy. “apt-get install squid” is good enough to get the package. Then we need to ensure these lines are in squid.conf, the rest of the default options are probably good enough for a first go:

wccp2_router IP_OF_ROUTER
wccp_version 4
wccp2_forwarding_method 1
wccp2_return_method 1
wccp2_service standard 0
http_port 3128 transparent
acl our_networks src WORKSTATION_SUBNET/24
http_access allow our_networks

So, at this point, the router is using WCCP and our server has Squid going. Now, we just need some magic to tie the two together. This is done using a magical combination of iptables and a GRE tunnel. First things first, we need to load the “ip_gre” kernel module on boot: “echo ip_gre >> /etc/modules” should do the trick. For now, “modprobe ip_gre” is good.
Then we need to construct the tunnel by issuing the following commands:

/sbin/ip link set wccp1 mtu 1476
/sbin/ip tunnel add wccp1 mode gre remote ROUTER_IP local MACHINE_IP dev ETHERNET_DEVICE
/sbin/ip addr add MACHINE_IP dev wccp1
/sbin/ip link set wccp1 up
/sbin/sysctl -w net.ipv4.conf.wccp1.rp_filter=0
/sbin/sysctl -w net.ipv4.conf.ETHERNET_DEVICE.rp_filter=0
/sbin/iptables-restore < /etc/default/iptables

This constructs the tunnel and brings it up. At the house, the server has a separate connection on a separate virtual network from everything else. WCCP should handle the situation where your web cache is in the subnet whose traffic is getting redirected, but I wanted to ensure there weren't going to be any difficulties.
Now, the final piece of the puzzle are the iptables rules that search for traffic coming down the tunnel and redirects them to Squid:

# /etc/default/iptables
# Allow in everything, from everywhere
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
COMMIT

*nat
:PREROUTING ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
# Reroute HTTP requests to the proxy server
-A PREROUTING -i wccp1 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 3128
-A PREROUTING -i wccp1 -p tcp -m tcp --dport 8000 -j REDIRECT --to-ports 3128
-A PREROUTING -i wccp1 -p tcp -m tcp --dport 8080 -j REDIRECT --to-ports 3128
COMMIT

Once all of this is in place and running, you should be able to see web traffic get cached and logged into /var/log/squid3/access.log. From my logs after several days, I'm seeing the cache serve about 4% of my traffic. Given the amount of dynamic content on the Internet today, that's not very surprising. What's it worth all this effort to get going? Probably not, but it was sure fun.
I need to point out that most of the content of this post can be found on the Squid site and wiki, and that community should receive any positive credit for their fine work.

Comments Off : more...

GlusterFS

by Alex on Oct.13, 2009, under Purdue

At work we’ve been evaluating different distributed file systems in our spare time. Currently, we use one large, centralized filer and have seen problems being able to push as many input/output operations through as we’d like. While that’s mostly a backend disk problem, wouldn’t it be great to have a storage system that grew as we added more cluster nodes?
In that hope, we tested some pretty alpha-level pNFS code, some Hadoop, and now GlusterFS. All these seems to have some faults, but this is what I found for Gluster…
Downloading the RPMs from the main FTP repository and installing them was pretty painless on RHEL5. The documentation is pretty spare and misleading, but eventually whipping up these config files made it all go:

#glusterfsd.vol
volume posix
  type storage/posix
  option directory /glusterfs
end-volume

volume locks
  type features/locks
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 8
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option auth.addr.brick.allow *
  subvolumes brick
end-volume

and

volume remote1
  type protocol/client
  option transport-type tcp
  option remote-host foobar-0.example.com
  option remote-subvolume brick
end-volume

volume remote2
  type protocol/client
  option transport-type tcp
  option remote-host foobar-2.example.com
  option remote-subvolume brick
end-volume

volume remote3
  type protocol/client
  option transport-type tcp
  option remote-host pfnstest-003.example.com
  option remote-subvolume brick
end-volume

volume remote4
  type protocol/client
  option transport-type tcp
  option remote-host foobar-4.example.com
  option remote-subvolume brick
end-volume

volume remote5
  type protocol/client
  option transport-type tcp
  option remote-host foobar-5.example.com
  option remote-subvolume brick
end-volume

volume remote6
  type protocol/client
  option transport-type tcp
  option remote-host foobar-6.example.com
  option remote-subvolume brick
end-volume

volume replicate1
  type cluster/replicate
  subvolumes remote1 remote2 remote3
end-volume

volume replicate2
  type cluster/replicate
  subvolumes remote4 remote5 remote6
end-volume

volume distribute
  type cluster/distribute
  subvolumes replicate1 replicate2
end-volume

volume writebehind
  type performance/write-behind
  option window-size 4MB
  subvolumes distribute
end-volume

volume cache
  type performance/io-cache
  option cache-size 1024MB
  subvolumes writebehind
end-volume

The backing storage for this GlusterFS was an Ext3 file system carved out of LVM and housed on HP SATA disk trays. Mounting up that file system and running some simplistic tests, I found that using large file sizes the file system performance was about at the maximum network speed and that using small file sizes the performance was in the 5-10MB/s range. Not bad for an hour or two’s worth of effect.

3 Comments : more...

Another Install Day

by Alex on Jul.19, 2009, under Purdue

It appears our last cluster was such a hit that once again my department at Purdue is going to do a single day, multi-hundred node cluster installation. Last year, the work went very quickly and we had the cluster racked before lunch. This year, we asked for fewer volunteers and added an extra step in the process, installing 10GigE network adapters. Hopefully that will slow things down enough so the VIPs can actually see us working.

As always, the marketing folks put together a short little clip to push the day:

The racking and stacking is pretty much just a matter of man power. Getting the software onto the systems is done using RHEL’s Anaconda and several scripts. All the magic was described by some of my coworkers in a paper for the USENIX’s LISA 2008 Conference.. A copy can be found here. It’s pretty fun stuff.

Comments Off :, more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...