ntpq: write to ::1 failed: Operation not permitted

 

The other day I got a bug report about check_ntpmon, which was reporting UNKNOWN status back to Nagios even though everything seemed to be working fine. A bit of debugging revealed that it was receiving the message on standard error:

ntpq: write to ::1 failed: Operation not permitted

This was a bit strange, because various links I found indicated that this message is usually due to firewalls:

But this host was not blocking anything, not to mention that check_ntpmon’s use of ntpq only ever uses the loopback interface, which is rarely ever touched by firewalls. A bit of further digging showed that indeed it was not the firewall, but a full conntrack table, with dmesg showing:

Aug 4 03:04:19 hostname kernel: [5226949.016837] nf_conntrack: table full, dropping packet

Increasing the conntrack limit fixed the problem.

(Just thought I’d document this here for posterity, since none of the links I found suggested this issue.)
Source: libertysys.com.au

BackupPC incorrect "no ping response" error message

I discovered a bug with BackupPC’s error reporting.  This has hit me more than once (evidenced by the deja vu i experienced when debugging the problem), but i mustn’t have written down the solution previously.  A quick Google (by which i mean “skimming the first two screens of hits”) doesn’t show any obvious signs of people having the same issue, so i thought i’d document it here for search engine posterity.

The basic issue is that backups to certain systems fail, and the diagnostics shown in the web interface look like this:

  • Last status is state “idle” (no ping) as of 11/2 14:00.
  • Last error is “no ping response”.
  • Pings to laptop1 have failed 39 consecutive times.

However, no ping response is not the problem.  If i login as the backuppc user on my backup server, it is able to both ping and ssh to the host in question just fine.

Digging deeper in the logs, i found this in /var/lib/backupppc/log/LOG:

2012-11-02 14:52:05 laptop1: mkdir /var/lib/backuppc/trash: Permission denied at /usr/share/backuppc/lib/BackupPC/Lib.pm line 629

I fixed this by chowning /var/lib/backuppc to the backuppc user, and the backup proceeded as normal.  So it seems that backuppc will not do the right thing without a trash directory in place, and if it doesn’t have permissions to create it, it gives a misleading error message.

In my case, this happened because my removable drive for backups died and i replaced it with a new one without fully recreating the directory structure as required by backuppc.  So i guess it’s my fault, but a more helpful error message would have been good.

Source: libertysys.com.au

Bizarre error message with Novell Certificate Server

It must be the week for stupid error messages.  I just tried to create a private SSL certificate for one of our server VMs from our Novell eDirectory iManager server.  The CSR was created by OpenSSL on the command line on the Ubuntu server, and i copied it to my laptop with the filename “req”.  When i tried to issue the certificate through iManager > Novell Certificate Server > Issue Certificate, it gave me the singularly unenlightening error:

Exception occurred processing WizardPage_CreateCert_Key.jsp

Google searches for this exact string resulted in zero hits.  The error log on the iManager server (/var/opt/novell/tomcat5/logs/catalina.out) showed a similar error:

WizardPage_005fCre.1594 java.lang.StringIndexOutOfBoundsException

After playing around with a few different certificate parameters and trying again, i decided to try something stupid: i added the filename extension “.csr”.  Unbelievably, this worked, and i was able to create and download the certificate without problems.  It seems that the iManager code makes some assumptions about the content based on the filename.

I’m glad i solved my problem, but i do have to wonder whether there are any vulnerabilities (at least of the denial-of-service persuasion) which might be possible due to these sort of assumptions.

Source: libertysys.com.au

A strange rrdtool error; Linux conntrack documentation

Last week i made some fairly significant changes on a client’s production firewall/routing cluster during our maintenance window.  The next morning there were reports of file server drives not connecting correctly and inaccessible web sites.  Because all wireless-to-wired and Internet traffic goes through this cluster, the firewall changes were the obvious culprit.  Looking at the logs it turned out we had run out of space in the connection tracking table:

May 26 08:55:05 corella1 kernel: ip_conntrack: table full, dropping packet.
May 26 08:55:13 corella1 kernel: ip_conntrack: table full, dropping packet.
May 26 08:55:15 corella1 kernel: ip_conntrack: table full, dropping packet.

I checked the counters in /proc/sys/net/ipv4/netfilter/, upped the limit for net.ipv4.netfilter.ip_conntrack_max in /etc/sysctl.conf to 4 times its previous value, and loaded the new value into /proc.

Then i started to hack up a few little scripts to monitor and graph ip_conntrack_count against ip_conntrack_max using rrdtool. I’ve used rrdtool a little before, so i thought it would be pretty straightforward.  I created my RRD file and started updating it every minute with the latest counters from netfilter.  However, as soon as i tried to graph it i got the error

ERROR: parameter ‘cnt’ does not represent a number in line AREA:cnt#00FF00:countn

A search of Google brought up a lot of hits which contained the same text but were not relevant – most of them were errors in not specifying the variable correctly.  However, i came across one very similar problem: https://lists.oetiker.ch/pipermail/rrd-users/2007-November/013277.html

Unfortunately, this post on the rrdtool users mailing list had no responses, so i was down to solving it myself.  It took me some time before i realised that both the original poster of that message and myself had made exactly the same elementary mistake: forgetting to include a filename for the graph output.  This rudimentary error is not picked up by rrdtool’s command line parser (at least not as at version 1.2.12 on SUSE Linux Enterprise Server), resulting in a very confusing error message.

So then i had a working rrd graph on my firewall, which seems to have settled down nicely.  You can find the current (very rough) state of the scripts at https://github.com/paulgear/puppet/tree/2b5363a3fbc1e73d5d88158e93ab5d879910173b/modules/netfilter/files.

At the moment i’m only graphing the connection tracking count vs. its maximum (see the graph below).  Note the interesting minor variation on the graph from the max value that isn’t actually changing.  This seems to be due to rrdtool’s consolidation of data points – the change to a solid line was effected by truncating the date to an exact multiple of the step interval that the rrd was set up with (in this case, 60 seconds).

Sample conntrack rrd graph

After getting this working, i wondered whether there were other conntrack values i should be checking (the ip_conntrack_tcp_be_liberal and ip_conntrack_tcp_loose sounded particularly interesting) so i started going looking for documentation on the files in /proc/sys/net/ipv4/netfilter/. Initial searches came up with very little. The best description i could find of them was at http://netfilter.linux-kernel.at/documentation/pomlist/pom-extra.html#tcp-window-tracking, but i must admit that i crave more detail.  If anyone can point me to a better reference, or suggest which conntrack items really need monitoring, please drop me a line.

(Incidentally, i’ve discovered that collectd has a netfilter conntrack plugin, so i will probably not develop the scripts i created any further, but will try to adapt that plugin to my needs.)

Attachment Size
ipconntrack-day.png 35.95 KB

Source: libertysys.com.au