A strange rrdtool error; Linux conntrack documentation

Last week i made some fairly significant changes on a client’s production firewall/routing cluster during our maintenance window.  The next morning there were reports of file server drives not connecting correctly and inaccessible web sites.  Because all wireless-to-wired and Internet traffic goes through this cluster, the firewall changes were the obvious culprit.  Looking at the logs it turned out we had run out of space in the connection tracking table:

May 26 08:55:05 corella1 kernel: ip_conntrack: table full, dropping packet.
May 26 08:55:13 corella1 kernel: ip_conntrack: table full, dropping packet.
May 26 08:55:15 corella1 kernel: ip_conntrack: table full, dropping packet.

I checked the counters in /proc/sys/net/ipv4/netfilter/, upped the limit for net.ipv4.netfilter.ip_conntrack_max in /etc/sysctl.conf to 4 times its previous value, and loaded the new value into /proc.

Then i started to hack up a few little scripts to monitor and graph ip_conntrack_count against ip_conntrack_max using rrdtool. I’ve used rrdtool a little before, so i thought it would be pretty straightforward.  I created my RRD file and started updating it every minute with the latest counters from netfilter.  However, as soon as i tried to graph it i got the error

ERROR: parameter ‘cnt’ does not represent a number in line AREA:cnt#00FF00:countn

A search of Google brought up a lot of hits which contained the same text but were not relevant – most of them were errors in not specifying the variable correctly.  However, i came across one very similar problem: https://lists.oetiker.ch/pipermail/rrd-users/2007-November/013277.html

Unfortunately, this post on the rrdtool users mailing list had no responses, so i was down to solving it myself.  It took me some time before i realised that both the original poster of that message and myself had made exactly the same elementary mistake: forgetting to include a filename for the graph output.  This rudimentary error is not picked up by rrdtool’s command line parser (at least not as at version 1.2.12 on SUSE Linux Enterprise Server), resulting in a very confusing error message.

So then i had a working rrd graph on my firewall, which seems to have settled down nicely.  You can find the current (very rough) state of the scripts at https://github.com/paulgear/puppet/tree/2b5363a3fbc1e73d5d88158e93ab5d879910173b/modules/netfilter/files.

At the moment i’m only graphing the connection tracking count vs. its maximum (see the graph below).  Note the interesting minor variation on the graph from the max value that isn’t actually changing.  This seems to be due to rrdtool’s consolidation of data points – the change to a solid line was effected by truncating the date to an exact multiple of the step interval that the rrd was set up with (in this case, 60 seconds).

Sample conntrack rrd graph

After getting this working, i wondered whether there were other conntrack values i should be checking (the ip_conntrack_tcp_be_liberal and ip_conntrack_tcp_loose sounded particularly interesting) so i started going looking for documentation on the files in /proc/sys/net/ipv4/netfilter/. Initial searches came up with very little. The best description i could find of them was at http://netfilter.linux-kernel.at/documentation/pomlist/pom-extra.html#tcp-window-tracking, but i must admit that i crave more detail.  If anyone can point me to a better reference, or suggest which conntrack items really need monitoring, please drop me a line.

(Incidentally, i’ve discovered that collectd has a netfilter conntrack plugin, so i will probably not develop the scripts i created any further, but will try to adapt that plugin to my needs.)

Attachment Size
ipconntrack-day.png 35.95 KB


Source: libertysys.com.au

Another HP product added to my "do not buy" list: LaserJet P2035n

 

I tweeted about the HP LaserJet P2035n a while back, and things have only gotten worse for me since.  To summarise: it has no SSL support for administration, its SNMP response is patchy (see graph below), and it isn’t supported by JetAdmin.  This last point was underscored to me yesterday: i realised that the particular printer i’ve been monitoring is running an older firmware version (from 2008), so i went looking for an updated one.  I found it on HP’s web site (eventually – that remains a rant for another day), downloaded it to my JetAdmin VM, and promptly found that it insists on being locally connected via USB.

Fails like this are unfortunately becoming increasingly common with HP’s product range as they try to compete on price with everyone and get products to market quickly.  (See my rant about the ProCurve 1810 switch.)  My open letter to HP follows:

Dear HP,

Please stop trying to compete with Dell on price.  Purchase price is not absolutely everything, nor is time to market.  Concentrate on making yourselves useful and manageable in the medium-large enterprise, and we’ll keep buying your products.

Yours sincerely,
A (mostly) happy, long-term ProCurve and LaserJet customer.


Graph from my SNMP monitoring package showing this printer going up & down like a yo-yo (click for full size version):