Running Lustre on top of a Ceph’s RBD
by Alex on Feb.05, 2011, under Tinkergeek
Lustre is a popular parallel file system. It features one nagging problem, it requires reliable underlying disk. Ceph is a new distributed file system that offers redundancy and scalability. Ceph also can deliver a redundant block device. While Ceph has a POSIX file system interface, it requires a Linux kernel 2.6.27 or greater. RHEL5 only has a kernel of version 2.6.18, more or less, but is supported by Lustre. The crazy idea came about to try getting Lustre running on top of Ceph’s RBD layer and export cheap disk to RHEL5 clients.
As an exercise in excitement, let’s try making it happen.. For this endeavor, we will need four machines:
- Ceph-1: Ceph Monitor and OSD running Debian Squeeze with a btrfs file system
- Ceph-2: Ceph Metadata server and OSD running Debian Squeeze with a btrfs file system
- Bits: Ceph client and iSCSI target running Debian Squeeze
- Lustre: Lustre MGS, MDS, OSS server running CentOS 5
Go ahead and install the base OS’s on each, putting the btrfs file systems on ceph-* at /ceph.
Now, let’s install the Ceph server pieces on both ceph-1 and ceph-2 at the same time:
* Handy URLs: Ceph’s Debian Repositories, Basic Ceph Debian HOWTO
cat> /etc/apt/sources.list deb http://ceph.newdream.net/debian/ squeeze ceph-unstable deb-src http://ceph.newdream.net/debian/ squeeze ceph-unstable EOF apt-get update apt-get install -y bzip2 ceph linux-headers-$(uname -r) build-essential ssh-keygen -t dsa scp /root/.ssh/id_dsa.pub ceph-2:/root/.ssh/authorized_keys ceph-1: mkdir -p /ceph/mon0 ceph-1: mkdir /ceph/osd0 ceph-2: mkdir /ceph/mon1 cat /etc/ceph/ceph.conf [global] pid file = /var/run/ceph/$name.pid [mon] mon data = /ceph/mon$id [mon0] host = ceph-1 mon addr = 172.18.8.100:6789 [mds] [mds0] host = ceph-2 [osd] osd data = /ceph/osd$id osd journal = /ceph/osd$id/journal osd journal size = 512 [osd0] host = ceph-1 [osd1] host = ceph-2 EOF mkcephfs -c /etc/ceph/ceph.conf --allhosts -v -k /etc/ceph/keyring.bin /etc/init.d/ceph -a start cclass -a ceph class activate rbd 1.3 rbd create lustre --size 60000 rbd list
Now let’s install a Ceph-enabled version of the Linux kernel and setup the iSCSI target drvier on bits:
* Handy URLs: Making a custom, vanilla kernel for Debian, SCST iSCSI Howto
cd /usr/src git clone git://ceph.newdream.net/git/ceph-client.git cd ceph-client git checkout -b unstable origin/unstable cp /boot/config-2.6.32-5-amd64 . make menuconfig # enable libceph, ceph, and rbd make-kpkg clean export CONCURRENCY_LEVEL=2 make-kpkg --rootcmd fakeroot --initrd --revision=custom.001 kernel_image kernel_headers cd .. dpkg -i *.deb reboot echo "`hostname ceph-1` name=admin rbd lustre" > /sys/bus/rbd/add svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk cd trunk make scst scst_install iscsi iscsi_install scstadm scstadm_install cat/etc/scst.conf HANDLER vdisk_fileio { DEVICE disk01 { filename /dev/rbd0 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 1 TARGET iqn.2006-10.net.vlnb:tgt { LUN 0 disk01 enabled 1 } } EOF modprobe scst modprobe scst_vdisk modprobe iscsi-scst iscsi-scstd scstadmin -config /etc/scst.conf
Now, we have a block device and are exporting it through iSCSI. Let’s set up the Lustre server on lustre:
* Handly URLs: My blog post about Lustre 1.8, Configuring an iSCSI client on RHEL
wget "lustre binaries from Oracle's lustre.org" rpm -ivh *.rpm reboot yum install iscsi-initiator-utils service iscsi start iscsiadm -m discovery -t sendtargets -p bits service iscsi restart fdisk -l mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt mkdir -p /lustre/mds mount -t lustre -o loop /tmp/mdt /lustre/mds mkfs.lustre --reformat --fsname lustre --ost --mgsnode=lustre /dev/sda mkdir -p /lustre/ost mount -t lustre -o loop /tmp/ost /lustre/ost
Viola! At this point you should be able to mount your Lustre file system on your Lustre clients.. I performed my work on a bunch of VMs with already slow underlying disk, so things went even considerably more slowly after I mounted Lustre on a file system.
Some things that need to be thought about.. The number of replicas and the striping strategy for the block devices will impact performance on the Ceph pool, by default these settings are 4MB blocks with no striping. As for the iSCSI layer, tuning here for performance will be important as well as exporting the same LUN from multiple iSCSI targets and doing active-passive failover for the iSCSI LUNs. As long as Ceph does not loose too many replicas and the mds/mon services are still operational, your block device should not go away. As well, configuring redundancy between Lustre OSS’s will be important for redundancy.
At the time of this post, Ceph required Linux kernel 2.6.37 or newer and Lustre did not support anything newer than 2.6.32. In the future, it will probably be prudent to eliminate the iSCSI translation layer and access the block devices directly from the Lustre OSS’s. Also, Ceph is not production-ready, so there is that small bit to come about.
MooseFS Testing Complete
by Alex on Dec.24, 2010, under Purdue
It has been many months since I began looking at MooseFS. Testing the file system using the same machines for clients and servers was interesting, with some of the results you can find in the last post. During the past few months, my coworkers and I moved the machines around and redesigned the test. What we were looking for was stability, scalability, and to see how a truly distributed file system worked using a bunch of machines.
The cluster was re-architected. In the end, we kept 96 Dell GX745′s with three 160GB SATA disks in the storage cluster. All the disks in a machine were in a RAID0 stripe. This provided MooseFS with 96 ~400GB volumes for a total storage capacity of around 35TB. Each machine connected to a central Cisco 6500 switch using a 1Gbps connection. The switch had four 10Gbps connections into the core of our network on campus. The clients were taken from a random sampling of machines on our network, most clients were AMD-based systems that each had a 10Gbps connection to their resource-level networks, with resources connecting to the core of the network at 20Gbps or 40Gbps.
During testing, we lost a total of 13 machines due to hardware failure related to both the disks and other general component failure. No data was lost because of the problems. Testing ran for two months at full-tilt. We did not use a redundant master setup, but we also did not experience a master failure during the testing. Overall, hosting a stable file system on top of 96 machines past their prime is an impressive feat. I remarked in the last post about the file system driver being stable, and to reiterate here, we did not see any client mounts wedge themselves during testing.
The file system performed remarkably well given the type of hardware making up the test system. The highest sustained performance mark was 30Gbps reads and 5Gbps writes to clients. We had clients sustaining over 1Gbps traffic to/from the cluster. MooseFS did not perform in the same ways a Lustre parallel file system would, but that was not expected. The theoretical performance of the underlying disks seems to suggest we could sustain more bandwidth than we saw, but our particular mix of reads and writes probably caused the difference in achieved versus measured performance. Overall, this is admirable performance.
The last category of information we hoped to glean from testing is what impact running a distributed system would exact from the system hosting it. Definitely running the storage backends full-tilt will produce a fair amount of additional system load. At the height of testing, CPU load on any given backend ranged from 20% to 30%. The MooseFS chunk server daemon used approximately 500MB of memory and the system dedicated the remaining amount of memory to file cache. The storage machines moved constantly between 200Mbps and 400Mbps of traffic a piece. This an appreciable amount of additional load placed on machines, if one wanted to mix storage providers and consumers then some care should be taken to characterize the impact on the machine’s primary function.
As for the general maintainability of the system, no major problems occurred. System failures were handled automatically and the system was very responsive if a machine was marked administratively offline, it quickly re-balance and re-replicated the necessary blocks. As well, adding machines back into the cluster caused a rebalancing of blocks to occur. This process was seamless and after the initial replication phase, the additional machines quickly added capacity to the system. The MooseFS status webpages were very handy during testing, there is no doubt that they provide a wealth of performance information.
After two months of poking and prodding the system, we are quite impressed. Integrating a distributed file system into our environment will not be easy if we choose to integrate the file system into machines on a secondary function basis, but that will be another project for down the road.
MooseFS Testing
by Alex on Aug.28, 2010, under Purdue
It’s been a while since anything new popped up on here.. But, now I have something interesting to talk about. A few many months ago, I was toying around with Lustre and Gluster. Lustre is the standard parallel file system on many large supercomputers today, and it is an open source project now at Oracle. Gluster can be both or either a parallel or a distributed file system. Lustre has a centralized manager/metadata server whereas Gluster is much more decentralized. One currently lacking bit of Lustre is that redundancy comes from the underlying OS and hardware, which can be both expensive and more difficult to set up. Gluster has a notion about redundancy built into it and does not necessarily rely on anything else.
There are both good and less good aspects to the designs and implementations of Lustre and Gluster. However, it seems like the holy grail in storage is a distributed file system that can both scale its performance easily and take advantage of left over storage on servers without requiring large amounts of effort to add and remove parts of the storage system. If a backing storage piece in Lustre disappears, all files that resided on that backend become unavailable, which is a problem on less than reliable hardware. Glustre can get around this problem, but its configuration is slightly combersome and changing the layout of the storage cluster is fun. These are not necessarily problems in themselves, specially when using dedicated resources however can be problems when using willy-nilly, glued together backing storage.
Given those comments on Lustre and Gluster, there are two distributed file systems I’ve been looking at: Ceph and MooseFS. Ceph is the new kid on the block and is currently undergoing heavy development. I think the designers certainly kept their eye on scalability and redundancy, but getting the file system going has been a pain. The pain mostly focuses around my use of a Linux distribution that is made up of older components (kernel and user land bits). In time, I’ll revisit Ceph and run it through its paces.
Until then, MooseFS is the most recent file system on my chopping block. The other file systems I’ve posted about got tested on a variety of hardware that was sufficiently underpowered. Today, I have more hardware to play with..
I think to understand the performance that I saw with MooseFS, I should say some words about the hardware I used. There are two clusters I used for testing. The first is a 20 node cluster with 4 cores per machine and 8GB of RAM per machine. The storage is a single SATA disk and the network is a 1Gbps drop. There are 5 nodes on each switch and 4x1gbps uplinks from each switch to an aggregation switch. The network is slightly oversubscribed at 5:4, but I don’t believe this was a major problem during testing for this small number of machines.
The next cluster is a 96 node cluster with two cores per machine, 3x 160GB SATA disks in a RAID0, 1Gbps network drop, and every 48 nodes connects to one switch. The two switches are interconnected using a single 10Gbps link.
Testing done on the first cluster was done using 20 iozone instances, one per node, that read and wrote upto 8gb files each at various block sizes. The speed of the file system did not appear to be dependent on block size. That makes sense because MooseFS uses 64MB blocks internally and all operations take place on a whole block. Reads and writes using files smaller than 8GB saw amazing performance, as they all fit into memory cache. At the worse extreme using 8GB files and seeking to random positions inside the files, the performance was equal to the speed of the hard drives doing sequencial reads/writes on 64MB files. This is a potential downside, because iozone may have only wanted a 4KB chunk, but an entire 64MB block was exchanged between client and server leaving the rest to be thrown out.
A distributed file system able to get native hardware speed is ok, but the most important lesson learned what in the way MooseFS gets mounted on a client. MFS uses a FUSE driver on the client to mount the file system. In the past, I’ve had bad luck with these drivers becoming upset and gumming up the works. However, after running tests on the 20 node cluster for an entire weekend and pushing 60TB of data in and out of the FUSE driver from MFS, everything was still functional and no mounts were hung. This is what pushed me to continue testing and push it onto larger hardware.
I’ve not yet completed a comprehensive test using iozne on the 96 node cluster, but after doing some simple tests, I’m encouraged by the results. Running 48 writers, I was able to get: 540MB/s using a MFS goal of 3, 700MB/s using a goal of 2, 1400MB/s using a goal of 1, and 7000MB/s directly to the underlying file system without going through MFS on 48 nodes. The goal in MFS is the number of replicas of a file that gets stored. Replicas are written in parallel, so the worst case performance was 540MB/s * 3 or near the maximum performance I saw using no replicas. There is a discrepancy between the top end performance from MFS and what the underlying disk is able to write. I explain this because the maximum bandwidth between the two halves of the cluster is only 1250MB/s theoretically but that not all operations had to complete between client/servers on different halves, so I saw the interlink bandwidth + whatever local bandwidth was getting used.
The read performance was better than write.. 1520MB/s from 3x replicated files, 1100MB/s from 2x replicated files, 1400MB/s from single copy files, and 7100MB/s from the underlying file systems. If there is more than one copy of a file being read, each reader can pull from different servers. The performance dip from 3x to 2x is probably how the different 64MB blocks were distributed between the halves of the cluster, and I still saw a performance maximum near 1300MB/s, the single link between the two halves.
I saw a few things during these tests. The first and again most encouraging was that the FUSE driver held out and nothing blew up. The second was that the master daemon (the process keeping track of 64MB chunks and related metadata) did a lot of DNS queries, I assume for logging of traffic and house keeping. Overall, nothing terribly surprising came up. I should note that in both tests, the master machine with 4 cores, 8gb of memory, and two disks in a mirror using a single 1gbps uplink. This machine never seemed terribly busy.
The next step I believe will be to increase the internal bandwidth of the cluster and try testing again using the iozone.
Link Layer Discovery Protocol
by Alex on Feb.27, 2010, under Tinkergeek
Ever wondered exactly where all your network cabling goes? Have you been using Cisco and wished your computers spoke CDP too? Apparently you and everyone else would love for the computers to just say where they connected instead of chasing down network cables by hand. That seems to be the goal of the Link Layer Discovery Protocol (LLDP or 802.11ab). Unlike the Cisco Discovery Protocol (CDP), LLDP is the vender-neutral attempt to get it all right.
There is a LLDP daemon that is published at Luffy.cx that implements this under Linux. There are other daemons out there too, but this showed up first when I searched the Debian repositories for precompiled versions. Simply installing it via apt and starting up the service is enough to get your computers and your network devices discovering themselves. Although, if you’re like me and don’t have fancy new networking gear that supports LLDP, lldpd supports a wide range of other network discovery protocols too.
Once I installed lldpd on all my computers and enabled the CDP option (the -c option when starting up lldpd), I saw the magic happen:
c2950-01#show cdp neighbors
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone
Device ID Local Intrfce Holdtme Capability Platform Port ID
ospf-01
Gig 0/1 107 R Linux eth1
home-1
Gig 0/2 114 Linux eth0
storage
Fas 0/15 106 Linux eth0
c831
Fas 0/8 146 R C831 Eth 2
c871
Fas 0/1 142 R S I 871 Fas 0
To query the neighbors discovered by lldpd on the computer side, lldpctl outputs all the current neighbors:
Interface: eth1 ChassisID: c2950-01 (local) SysName: c2950-01 SysDescr: cisco WS-C2950G-48-EI running on Cisco Internetwork Operating System Software IOS (tm) C2950 Software (C2950-I6K2L2Q4-M), Version 12.1(22)EA8a, RELEASE SOFTWARE (fc1) Copyright (c) 1986-2006 by cisco Systems, Inc. Compiled Fri 28-Jul-06 17:00 by weiliu MgmtIP: 172.0.0.0 Caps: Bridge(E) PortID: GigabitEthernet0/1 (ifName) PortDescr: GigabitEthernet0/1
To finish up the post, I’ll note that discovery protocols have in the past, and potentially still now, have been susceptible to attacks by flooding devices with too many neighbor relations. Because of this, it might be best to ensure these protocols are disabled on switches ports connected to untrusted machines.
Dirvish
by Alex on Feb.08, 2010, under Tinkergeek
Tinkergeek recently moved to its own dedicated server hosted by fdc servers. The transition wasn’t quite as smooth as one would hope, but they do provide cheap hosting. With moving to a dedicated server based around cheap PC hardware, I thought it’d be a great idea to bring back an old backup solution. Dirvish is a neat set of scripts that combine the goodness of hard-links and rsync. The goal is that Dirvish creates a full backup once and then stores just the changes of the target file system on the backup system.
Installing dirvish on a Debian system is fairly easy:
apt-get install dirvish
Then, one merely has to copy the example configuration files into place. The first one is the master configuration file that goes in /etc/dirvish and can be found in /usr/share/doc/dirvish/master.conf.
## Example dirvish master configuration file: bank: /backup exclude: lost+found/ core *~ .nfs* Runall: root 22:00 expire-default: +15 days expire-rule: # MIN HR DOM MON DOW STRFTIME_FMT * * * * 1 +3 months # * * 1-7 * 1 +1 year # * * 1-7 1,4,7,10 1 * 10-20 * * * +4 days # * * * * 2-7 +15 days
Under bank: is going to be the place on your machine that contains all the backups. I’d suggest making this its own file system, as dirvish can eat inodes like there’s no tomorrow. Next, I’d suggest adding /proc and /sys under the global exclude: section just to ensure you don’t back these directories up.
Now, you have to make your first machine directory for backups (known as a vault in dirvish speak). This directory structure will be under the directory in the bank section from above.
mkdir -p /backup/example.com/dirvish
Now, just copy the default.conf example from /usr/share/doc/dirvish/examples into example.com/dirvish and edit.
client: thishost
tree: /
xdev: 1
index: gzip
log: gzip
image-default: %Y%m%d
exclude:
/var/cache/apt/archives/*.deb
/var/cache/man/**
/tmp/**
/var/tmp/**
*.bak
The key points in this file file are the client, xdev, and exclude directives. Merely change your client: to be the machine IP or hostname that you’re backing up. Xdev tells rsync to traverse file systems on the machine; generally, you’ll want to be careful with this setting, specially if you mount NFS shares. Lastly, update the exclude list for this particular machine. If you’re backing up the local machine using dirvish, be sure to exclude the dirvish bank directory!
Now, you’re all set to create the first backup with:
dirvish --vault example.com --init
If all goes well, you’ll have a new directory under /backup/example.com with the current data and a copy of the target. If there was an error, be sure to remove the failed backup attempt from /backup/example.com and rerun the dirvish command after fixing the error.
Now, the only thing left is to run dirvish-runall via cron at some convenient time and you’re on your way to having a decent backup solution. Besure to read the remainder of the dirvish documentation to pick up the finer points of configuration.
Quick Howto on Lustre 1.8
by Alex on Nov.24, 2009, under Purdue
Everybody likes fast file performance and recently I’ve been twiddling with different distributed/parallel/clustered file systems for fun and excitement. Tonight was Lustre’s turn to be toyed with. Below is how I got a small/slow Lustre going on my laptop using VirtualBox.
First, install CentOS 5 on some VMs, perhaps three? Then download the RPM packages from Sun/Lustre.org. On the servers:
rpm -ivh kernel-lustre* rpm -ivh lustre-1.8.1.1* lustre-ldiskfs* lustre-modules* e2fsprogs*
On the clients, install the appropriate kernel package for the patch-less client and then the lustre-client packages:
rpm -ivh --force kernel-2.6.18-128.7.1.el5.i686.rpm rpm -ivh lustre-client-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm lustre-client-modules-1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1.i686.rpm
Now, specially if your VMs have one network for the outside world through NAT and one network for inter-VM communication, you should add the following line to /etc/modprobe.conf to make Lustre find and use the correct interface:
options lnet networks=tcp0(eth1)
Now it’s time to set up the meta-data and management servers:
mkfs.lustre --reformat --device-size=250000 --fsname lustre --mdt --mgs /tmp/mdt mkdir -p /lustre/mds mount -t lustre -o loop /tmp/mdt /lustre/mds
Now it’s time to set up the servers and storage targets (in this simple example, there is only one server per target). Run this on all the storage servers:
mkfs.lustre --reformat --device-size=10000000 --fsname lustre --ost --mgsnode=@tcp0 /tmp/ost mkdir -p /lustre/ost mount -t lustre -o loop /tmp/ost /lustre/ost
After the client has finished rebooting into it proper kernel, mounting the file system is straight forward:
mount -t lustre@tcp0:/lustre /mnt
Of course, now it’s time to write files into our file system to see that it really works:
dd if=/dev/zero of=/mnt/foo bs=1M count=100 lfs getstripe /mnt/*
At this point, if you have multiple storage targets, then you’ll see that your large file got written to just a single target. That’s sad, since we were hoping for super-ultra-fast file I/O from our VMs running Lustre. Thankfully, this can be easily fix:
mkdir /mnt/super-fast lfs setstripe -c 2 /mnt/super-fast dd if=/dev/zero of=/mnt/super-fast/bar bs=1M count=100 lfs getstripe /mnt/super-fast/*
Aaah, there we go. You should have gotten “twice” the performance in MB/s reported from dd and you should see that now your large file of zero’s got written out to two different targets. Of course, the Lustre manual has all sorts of useful tuning suggestions and things to try to get working (like shared targets and proper failover). Many things to try, but I just thought I should document all this in a place that I could later find it easily. As for where I got this, I mostly found quick bits at this blog and the stranger stuff in the Lustre 1.8 manual when I got stuck on my network problems.
Super Computing 2009 – Final Day
by Alex on Nov.20, 2009, under Purdue, Tinkergeek
So, I didn’t have much of a chance to update the blog with blow-by-blow action updates at SC this year.. Oops. However, the show was a blast. The Purdue team won the “lowest power consumption” award. It was certainly a surprise, given another team had a generally more efficient power design. In any case, we’re certainly happy we won an award.
I did not get much of a chance to walk around the floor between working both of Purdue’s booths. However, I did hear about and see one or two interesting things. The first was the new ethernet gear coming out of Voltaire. They can confederate several of their switch chassises to form a virtual switch that allows an LACP bond between the chassises to be active-active, which seems to be all the rage these days. After visiting the Voltaire booth, Cisco also has various solutions.. from the virtual port channel on the nexus gear and the virtual switch chassis on the catalyst gear. Also, it appears that perhaps Cisco will be implementing something like Woven’s multi-path layer 2 magic, so that could be fun too.
Lastly, the most interesting company on the floor that I saw was the “Two Guys and a Cluster” folks. Really, it’s just two guys that live on opposite coasts that help people acquire and run clusters. They seem to be a nice middle ground between a value-added cluster vendor and just buying hardware off a website and doing it all yourself. Their web address, with the start of an interesting blog, can be found here.
Super Computing 2009 – Day 2
by Alex on Nov.16, 2009, under Purdue
The day started early with the team wondering to the convention center to start validating all our applications. We found that the center’s power dropped voltage under load enough to cause us to overshoot our current problems badly. After taking account of all our hardware, we removed some memory and turned off our spare hardware which solved our problems.
At the end of the day, everything was good to go for the morning when we start running our benchmark for the judging. I’ll leave you with a picture of our booth this year:

Super Computing 2009 – Day 1
by Alex on Nov.16, 2009, under Purdue
After getting up at 3AM eastern time to make our flight at 6AM, the Purdue Student Cluster Challenge team made it to Portland, Oregon, on time and not terribly tired. The challenge organizers had our box sitting in our booth and we took no time in starting to unbox it.
For now, here is a link that talks about the challenge for this year: InsideHPC.
Transparent Squid Proxy using WCCP
by Alex on Oct.13, 2009, under Tinkergeek
Using a web caching proxy can help save bandwidth and provide a good log of web traffic. At the house, I have a slow-ish DSL link to the world and would certainly love to have video not freeze when also browsing the web. And as a friend points out, sometimes it’s easier to have a log of traffic when digging into problems. Although, my one biggest pet peeve is having to manually reconfigure my laptop when moving between campus and home. OSX does have the concept of “network locations,” but usually half the time I forget to change it. So, the only solution to this problem is a transparent web-caching proxy server. Thankfully, I run a server in the basement for fun and excitement anyway.
First things first, one needs to set up the “Web Cache Control Protocol” on the router. At home, I use a Cisco 871, so I issued these changes:
config terminal ip wccp web-cache interface vlan1000 ip wccp web-cache redirect in exit
This sets up WCCP and applies it to the virtual network my wireless network uses. Next, we need to set up the Ubuntu box with Squid, thankfully that’s pretty easy. “apt-get install squid” is good enough to get the package. Then we need to ensure these lines are in squid.conf, the rest of the default options are probably good enough for a first go:
wccp2_router IP_OF_ROUTER wccp_version 4 wccp2_forwarding_method 1 wccp2_return_method 1 wccp2_service standard 0 http_port 3128 transparent acl our_networks src WORKSTATION_SUBNET/24 http_access allow our_networks
So, at this point, the router is using WCCP and our server has Squid going. Now, we just need some magic to tie the two together. This is done using a magical combination of iptables and a GRE tunnel. First things first, we need to load the “ip_gre” kernel module on boot: “echo ip_gre >> /etc/modules” should do the trick. For now, “modprobe ip_gre” is good.
Then we need to construct the tunnel by issuing the following commands:
/sbin/ip link set wccp1 mtu 1476 /sbin/ip tunnel add wccp1 mode gre remote ROUTER_IP local MACHINE_IP dev ETHERNET_DEVICE /sbin/ip addr add MACHINE_IP dev wccp1 /sbin/ip link set wccp1 up /sbin/sysctl -w net.ipv4.conf.wccp1.rp_filter=0 /sbin/sysctl -w net.ipv4.conf.ETHERNET_DEVICE.rp_filter=0 /sbin/iptables-restore < /etc/default/iptables
This constructs the tunnel and brings it up. At the house, the server has a separate connection on a separate virtual network from everything else. WCCP should handle the situation where your web cache is in the subnet whose traffic is getting redirected, but I wanted to ensure there weren't going to be any difficulties.
Now, the final piece of the puzzle are the iptables rules that search for traffic coming down the tunnel and redirects them to Squid:
# /etc/default/iptables # Allow in everything, from everywhere *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] COMMIT *nat :PREROUTING ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :OUTPUT ACCEPT [0:0] # Reroute HTTP requests to the proxy server -A PREROUTING -i wccp1 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 3128 -A PREROUTING -i wccp1 -p tcp -m tcp --dport 8000 -j REDIRECT --to-ports 3128 -A PREROUTING -i wccp1 -p tcp -m tcp --dport 8080 -j REDIRECT --to-ports 3128 COMMIT
Once all of this is in place and running, you should be able to see web traffic get cached and logged into /var/log/squid3/access.log. From my logs after several days, I'm seeing the cache serve about 4% of my traffic. Given the amount of dynamic content on the Internet today, that's not very surprising. What's it worth all this effort to get going? Probably not, but it was sure fun.
I need to point out that most of the content of this post can be found on the Squid site and wiki, and that community should receive any positive credit for their fine work.