even more Python vs Perl performance

Following my last two posts (http://blog.entek.org.uk/?p=106 and http://blog.entek.org.uk/?p=112) I took some profilers to my codes.

First up my revised Python implementation:

> python -m cProfile checkmail2
...
         16225093 function calls in 8.663 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.663    8.663 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    5.906    5.906    8.662    8.662 checkmail2:3()
        1    0.000    0.000    8.663    8.663 {execfile}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       22    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
 16224990    2.755    0.000    2.755    0.000 {method 'startswith' of 'str' objects}
       37    0.001    0.000    0.001    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

Quite clearly there is a huge number (16 million!) of calls to startswith which is the biggest time-sink outside the main script.

Comparing the Perl implementation:

> perl -d:DProf checkmail3
...
> dprofpp
Total Elapsed Time = 3.426896 Seconds
  User+System Time = 3.401494 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 0.12   0.004  0.007      4   0.0010 0.0016  main::BEGIN
 0.03   0.001  0.001      5   0.0002 0.0002  File::Basename::BEGIN
 0.03   0.001  0.001      1   0.0009 0.0009  warnings::BEGIN
 0.03   0.001  0.001     37   0.0000 0.0000  File::Basename::_strip_trailing_sep
 0.03   0.001  0.001     37   0.0000 0.0000  File::Basename::fileparse
 0.00   0.000  0.002     37   0.0000 0.0000  File::Basename::basename
 0.00   0.000  0.000      1   0.0003 0.0003  File::Glob::doglob
 0.00   0.000  0.000      1   0.0001 0.0001  DynaLoader::dl_load_file
 0.00   0.000  0.000      1   0.0001 0.0003  XSLoader::load
 0.00   0.000  0.000      1   0.0001 0.0001  File::Basename::fileparse_set_fstype
 0.00   0.000  0.000      1   0.0001 0.0001  Exporter::import
 0.00   0.000  0.000      2   0.0000 0.0000  warnings::import
 0.00   0.000  0.000      3   0.0000 0.0000  strict::import
 0.00   0.000  0.000      1   0.0000 0.0003  File::Glob::csh_glob
 0.00   0.000  0.000      1   0.0000 0.0000  strict::bits

Ignoring the actual times which are not directly comparable due to the profiling overheads we can clearly see Perl is benefiting hugely from the inbuilt regex engine as there are 0 function calls associated with each line check.

I did replace the ‘str.startswith’ implementation of the Python script with a version which used ‘re’ regex objects, but this showed even worse performance:

> time python checkmail4
...
python checkmail4 8.42s user 0.33s system 99% cpu 8.765 total

Profiling this one we see the overhead of using ‘re.match’ was about double that of ‘str.startswith’ and, obviously, the number of calls remained the same. On top of this I introduced additional overhead of two calls to ‘re.compile’ at the start of the script, which the profiler showed incurred not insignificant function calls of their own:

> python -m cProfile checkmail4
...
         16225312 function calls in 12.416 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.416   12.416 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    6.662    6.662   12.415   12.415 checkmail4:3()
        2    0.000    0.000    0.000    0.000 re.py:186(compile)
        2    0.000    0.000    0.000    0.000 re.py:227(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:367(_compile_info)
        2    0.000    0.000    0.000    0.000 sre_compile.py:38(_compile)
        4    0.000    0.000    0.000    0.000 sre_compile.py:480(isstring)
        2    0.000    0.000    0.000    0.000 sre_compile.py:486(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:501(compile)
       15    0.000    0.000    0.000    0.000 sre_parse.py:144(append)
        2    0.000    0.000    0.000    0.000 sre_parse.py:146(getwidth)
        2    0.000    0.000    0.000    0.000 sre_parse.py:184(__init__)
       21    0.000    0.000    0.000    0.000 sre_parse.py:188(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:201(match)
       19    0.000    0.000    0.000    0.000 sre_parse.py:207(get)
        2    0.000    0.000    0.000    0.000 sre_parse.py:307(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:385(_parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:669(parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:73(__init__)
        2    0.000    0.000    0.000    0.000 sre_parse.py:96(__init__)
        2    0.000    0.000    0.000    0.000 {_sre.compile}
 16224990    5.751    0.000    5.751    0.000 {built-in method match}
        1    0.000    0.000   12.416   12.416 {execfile}
        6    0.000    0.000    0.000    0.000 {isinstance}
       44    0.000    0.000    0.000    0.000 {len}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
       59    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       24    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {min}
       37    0.002    0.000    0.002    0.000 {open}
       13    0.000    0.000    0.000    0.000 {ord}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

Quite clearly from the profiler output each call to either ‘str.startswith’ or ‘re.match’ use a very small amount of processor time (too small to be output) but the cumulative effect of 16 million calls is where the big slowdown was occurring. To get around this I tried implementing the ‘str.startswith’ version using string splicing (i.e. “line[:5] == ‘From ‘” rather than “line.startswith(‘From ‘)”) and the result was dramatic:

> time python checkmail4
...
python checkmail4 3.86s user 0.31s system 99% cpu 4.186 total

The profiler output for this version shows that the number of function calls is now on a par with the Perl implementation:

> python -m cProfile checkmail4
...
         110 function calls in 4.311 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.311    4.311 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    4.308    4.308    4.311    4.311 checkmail4:3()
        1    0.000    0.000    4.311    4.311 {execfile}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       29    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
       37    0.003    0.000    0.003    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

This puts the Python version within 0.6s of the Perl version, which is close enough for me. Especially considering this is effectively comparing Perl to Python on Perl’s hometurf of text matching.

I think Perl would probably still out perform Python if I was wanting to do something more fancy involving regex substitutions but Python’s performance issues, in this case, seem to be purely down to function call overheads which Perl sidesteps by incorporating the regex engine into the core language.

Perl vs Python speed cont’d

Following my post yesterday I decided to both slightly refine my implementation and see if I could improve the speed of the Python version of my script. All that it really required it a simple search of an mbox file looking for messages with no ‘Status’ header (which is added by mutt when it has seen the message). Both the Perl and Python script I was using yesterday were doing far more than necessary as both (at least partially) parsed the messages which is not required with the mboxes I have. I have therefore re-implemented the Python version as a simple loop over each line in the mbox files which counts the number of items with no status header:

#!/usr/bin/env python

from os import environ, listdir

MAILHOME=environ['HOME'] + '/Mail'

new_mailboxes={}
for file in listdir(MAILHOME):
	no_status=False
	with open('%s/%s' % (MAILHOME, file)) as f:
		for line in f:
			if line.startswith('From '):
				if no_status:
					new_mailboxes[file]=new_mailboxes.get(file, 0) + 1
				no_status=True
			elif line.startswith('Status: '):
				no_status=False
	# Loop ended, make sure we count the last message if we need to.
	if no_status:
		new_mailboxes[file]=new_mailboxes.get(file, 0) + 1

for box, count in new_mailboxes.iteritems():
	print "%s (%d)" % (box, count)

This revised script completes in a much more respectable time, substantially quicker than the original Perl script I was comparing my first attempt to:

> time python checkmail2
...
python checkmail2 5.30s user 0.32s system 99% cpu 5.625 total

Curiosity got the better of me and I implemented the exact same algorithm in Perl and timed that:

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename;

my %new_mailboxes;
for my $file (glob("$ENV{HOME}/Mail/*")) {
	my $no_status = 0;
	my $basename = basename($file);
	open INPUT, $file;
	while(<INPUT>) {
		if (/^From /) {
			$new_mailboxes{$basename} += 1 if $no_status;
			$no_status = 1;
		} elsif ( /^Status: / ) {
			$no_status = 0;
		}
	}
	close INPUT;
	$new_mailboxes{$basename} += 1 if $no_status;
}

print $_, ' (', $new_mailboxes{$_}, ")\n" for keys(%new_mailboxes);


time perl checkmail3
...
perl checkmail3 2.96s user 0.42s system 97% cpu 3.465 total

Interestingly Perl was still faster by quite some way (the Python version took around 1.6 times as long to run). The question is, is this purely down to the overhead of object-orientated vs procedural or is Perl faster at IO and/or pattern matching (although the Python version should have been quicker here as it was not using a full-blown regex engine to match the start of strings)?

Perl vs Python speed

I needed to write a quick script to find which mailboxes in “~/Mail” had unread messages in them. I decided to knock it up in Python, but the script was not performing very well:

> time ./checkmail
...
./checkmail 56.17s user 1.86s system 98% cpu 59.181 total

A quick google found a Perl program which did pretty much the same thing (at http://www.perlmonks.org/?node_id=552218 if you’re interested) which ran, unaltered, on the exact same files performed significantly better:

> time perl checkmail2
...
perl checkmail2 16.66s user 1.27s system 99% cpu 18.043 total

Not only did the Perl version take about 1/3 of the time of the Python implementation but it also counted the number of unread messages and displayed it, while my simple Python script break’d on the first match to avoid needlessly looping over the rest of the messages. The Perl version is managing not to fully decode every message through the use of the Mail::MboxParser library however I could not find a way to achieve the same result in a straightforward manner with the standard Python libraries. Indeed, looking at the documented examples in the python docs, http://docs.python.org/library/mailbox.html#examples, it appears this is the suggested way of doing it (in essence all I need to do is examine the ‘status’ header, the example which uses a very similar loop to just examine the ‘subject’ header).

My Python script is here:

#!/usr/bin/env python

import mailbox
from os import listdir, environ, system

MAILHOME=environ['HOME'] + '/Mail'

new_mail_mailboxes=[]
for file in listdir(MAILHOME):
	for message in mailbox.mbox(MAILHOME + '/' + file):
		if message['status']:
			new_mail_mailboxes.append(file)
			break
print "\n".join(new_mail_mailboxes)

The problem with ‘middleware’

Both Python’s WSGI and Perl’s PSGI (and presumable Ruby’s Rack, but I have no experience of that) have a concept of ‘middleware’ which is part of a web application which sits between the server interface (WSGI or PSGI) and the application itself. This middleware can act as a filter or manipulate the environment (using PSGI’s terminology) before the application sees it. This makes it great for implementing features such as authorisation and sessions and indeed there are pre-built middlewares for both platforms which will do this.

The problem is that there is no standard for what parts of the environment get set by these useful middlewares which means (for the most part) they cannot be instantly swapped out for an alternative. I think what is needed is a simple definition of the bare minimum API for a given object (i.e. what Java would term an interface) and a defined location within the environment where the object will be found. Obviously objects could implement additional methods to provide bells-and-whistles specific to the implementation which applications can then use at the cost of no longer being able to do a straight swap out of the middleware.

For example a ‘session’ object might implement ‘get(key)’ and ‘set(key,value)’ could be found under ‘session’ in the environment hash. A ‘user’ object (as part of a larger authentication middleware) might implement ‘login_id’ and ‘roles’ attributes and be found under ‘auth.user’ in the environment hash.

An application developer would then be able to choose between just using the published interface standard or using some of the specific bells and whistles of a middleware. Even with the user of the extra features the amount of refactoring involved in switching between middlewares would be limited to just where the extra features had been used. It make switching from a generic authentication middleware to an in-house single signon solution very straight forward, for example.

Comments, suggestions?

Python fail

Rather annoyingly the pycrypto website is down, which means easy_install cannot download the code, which means I cannot build my code, which means I cannot deploy the bugfix I’ve just incorporated. For all its failings, at least with CPAN the code is always available from CPAN’s servers so even if the project’s own site is down you can still get hold of the modules you’ve used in your code.

Python–, Perl++

Searching for pycrypto>=1.9
Reading http://cheeseshop.python.org/pypi/pycrypto/
Reading http://cheeseshop.python.org/pypi/pycrypto/2.2
Reading http://www.pycrypto.org/
error: Download error: (104, ‘Connection reset by peer’)

Catalyst, Catalyst::Authentication::Credential::Remote and mod_perl

Mod_perl does not pass REMOTE_USER as an environment variable to Catalyst, so the Catalyst::Authentication::Credential::Remote plugin which allows convenient offloading of authentication to apache will not work.

This is easily solved by adding this code to the root auto function:

$c->req->remote_user($c->apache->user) if $c->can('apache');

Now I can offload authentication to apache’s mod_kerb whilst using catalyst for authorisation.

Migrating from fastcgi to cgi

I’ve just migrated all of the sites on my VMS (including this blog) from running on a fastcgi backend to running on a cgi backend. Why, you may ask. Well, although fastcgi is substantially more responsive than plain old cgi (since it keeps the processes running between requests so the process start-up and take-down times are removed) it consumes much more memory (due to keeping the processes around). Nowadays this is not normally a problem but on a virtual machine with only 128MB available and no swap memory usage becomes a big issue.

By moving from fastcgi to “normal” cgi and tweaking my mysql config I have increased the free memory when the machine is idle from 4MB to 80MB and now have lots of headroom if a single process (e.g. DenyHosts, which is idling at 20MB) decides it wants loads of ram before the Linux low memory reaper starts killing processes.

As a slight aside, the bigest memory hogs on my box are Python programs (DenyHosts [~20M] and rss2email [~45M]). I don’t know it this is a fault with Python or the way the scripts are written (or both, not being intimately familiar with Python I don’t know if it encourages conservative use of memory or not).

Coping when languages change their api between minor versions

I have a rails application(yes, I know, I’ve been regretting it for some time), which I wrote over 2 years ago using the (then) stable rails version of 1.2.3 and whatever ruby version was around at the time (Debian Sarge was the OS of choice for the server). Then Debian Etch was released, so I dist-upgraded to that including the new version of ruby (1.8.5) without any major headaches. Now Lenny is the stable version but the version of rails I used originally does not work with the current version of ruby (1.8.7) because, apparently, `[]’ is no longer a valid method for ‘Enumerable::Enumerator’. This error is thrown in the rails libraries themselves, not my code.

There are two obvious solutions, upgrade the version of rails (which involves re-writing large portions of my code due to api changes in rails, rails is now at v2.3 (http://rubyonrails.org/)) or stick with the old version of rails and install an older version of ruby.

I did the latter. Originally I did this by <cringe>holding back the version of ruby, and its dependencies when doing the dist-upgrade from Etch to Lenny</cringe>. (I appologise for the kittens that were inevitably killed by me doing this!) This did, despite the horribleness(is that a word?) of the method, work.

Today I am installing a new server. Not only am I installing Lenny, which makes manually going to fetch the old versions a pain, but the new box is “amd64″ (it’s an Intel, actually, so x86_64 is more accurate but Debian refers to is as amd64) so I can’t just steal the packages from the cache on the old box. Thankfully all this means that I have been forced to install the old version in some sort of sane manner, by installing Etch in a chroot and calling the old rails app from there. Here’s the steps I took:
(prerequisits: debootstrap dchroot)

# mkdir -p /var/chroot/etch # Make the new chroot directory
# debootstrap --arch amd64 etch /var/chroot/etch http://ftp.uk.debian.org/debian/
# mkdir -p /var/chroot/etch/var/rails # Where the rails app is going to live (well, it'll actually live outside the chroot and be mounted here)
# mkdir -p /var/chroot/etch/XXXXXXX # removed to protect the innocent
# mkdir -p /var/chroot/etch/var/run/mysqld # this will be bound outside the chroot so that rails can access the mysql instance on the server

I added the following to /etc/fstab, and mounted them:

# Etch chroot (for rails)
/proc /var/chroot/etch/proc none rw,bind 0 0
/tmp /var/chroot/etch/tmp none rw,bind 0 0
/dev /var/chroot/etch/dev none rw,bind 0 0
/var/rails /var/chroot/etch/var/rails none rw,bind 0 0
XXXXXXX /var/chroot/etch/XXXXXXX none rw,bind 0 0
/var/run/mysqld /var/chroot/etch/var/run/mysqld none rw,bind 0 0

From here I could enter the chroot and install some needed applications:

# chroot /var/chroot/etch
# apt-get install ruby
# apt-get install rubygems
# apt-get install libfcgi-ruby
# apt-get install rake
# gem install -y -v=1.2.3 rails
# gem install -y pdf-writer

Then I can configure dchroot by adding this to /etc/schroot/schroot.conf:

[etch]
description=Debian etch (oldstable) for rails
location=/var/chroot/etch
groups=www-data

And finally a quick change to the lighttpd config which runs the fcgi program:
Old:

"bin-path" => "/var/rails/$app/public/dispatch.fcgi",

New:

"bin-path" => "/usr/bin/dchroot -c etch -d -q /var/rails/$app/public/dispatch.fcgi",

and it all works quite nicely. Now I have a stable Lenny system which I can keep up to date and an etch chroot for the legacy code.

Spot the deliberate mistake…

I just returned from tea to find a computer, which I’d left copying a large amount of data (~300GB worth of backups) from one raid arrray to another, displaying a friendly message:
cp: cannot create directory `./cpool/0': No space left on device.

In order to get to this state I’d done the following:

# mdadm --create -n3 -x1 /dev/md2 /dev/sdc1 /dev/sdd1 /dev/sdf1 missing
(several steps to setup lvm on and format the new device)
# cd /var/lib/backuppc
(some more steps, including adding the new mountpoint to /etc/fstab)
# mount .
# cp -a /mnt/oldbackuppc/pool ./pool
# cp -a /mnt/oldbackuppc/cpool ./cpool

In case you’ve not spotted it, after `mount`ing ‘.’ I needed to `chdir .` to get onto the new mount. As it was I was still on the device that /var is on (~40G) not my nice new 1.8TB raid device!

Kerberised authenticated printing to Windows printers within a 2003 active directory with smbclient

I needed to print to a printer shared via a Windows Server 2003 print server from my GNU/Linux box. Allegedly this should be possible using smbspool, which is provided by samba as a cups back-end to print to such devices. I spent some time looking at it, I was unable to hit upon the right incantation to make it do this. In the end I wrote a short script which uses smbclient and given user’s Kerberos ticket to authenticate, based upon a similar script which used stored credentials to print (to avoid putting username and password in the device URI in CUPS).


#!/bin/bash

# CCLAH 30-March-2009
# Kerberised CUPS printing
# Based upon http://willem.engen.nl/projects/cupssmb/smbc (http://willem.engen.nl/projects/cupssmb/)

if [ "$1" = "" ]; then
# list supported output types
echo 'network smbc "Unknown" "Windows Printer using smbclient"'
exit 0
fi

job="$1"
account="$2"
title="$3"
numcopies="$4"
options="$5"
filename="$6"

if [ "$filename" = "" ]; then
filename=-
fi

# strip protocol from printer
printer=`echo "${DEVICE_URI}" | sed 's/^.*://'`

# Obtain the user's id in order to determine the kerberos cache file name
uid=`id -u $account`

echo "NOTICE: Account: $account uid: $uid" 1>&2

# and print using smbclient
echo "NOTICE: KRB5CCNAME=/tmp/krb5cc_$uid smbclient -k -c \"print ${filename}\" \"${printer}\"" 1>&2

errtxt=`KRB5CCNAME=/tmp/krb5cc_$uid smbclient -k -c "print ${filename}" "${printer}" 2>&1`
ret=${?}

echo "NOTICE: Return value: $ret" 1>&2

#
# Handle errors
# see backend(7) for error codes

# log message
if [ "$ret" = "0" ]; then
echo "$errtxt" | sed 's/^/NOTICE: /' 1>&2
else
echo "$errtxt" | sed 's/^/ERROR: /' 1>&2
fi

# "NT_STATUS_LOGON_FAILURE" -> CUPS_BACKEND_AUTH_REQUIRED
echo "$errtxt" | grep -i 'LOGON_FAILURE' >/dev/null && exit 2
# "NT_STATUS_BAD_NETWORK_NAME" -> CUPS_BACKEND_STOP
echo "$errtxt" | grep -i 'BAD_NETWORK_NAME' >/dev/null && exit 4

# something went wrong, don't know what -> CUPS_BACKEND_FAILED
[ "$ret" != "0" ] && exit 1

echo "NOTICE: Everything OK"

# success! -> CUPS_BACKEND_OK
exit 0

To use: (at least on my Debian box) save as ‘/usr/lib/cups/backend/smbc’ then add a “Windows Printer using smbclient” type printer with a URI of ‘smbc:/// ‘ and the appropriate driver for the printer at the other end. The only problem is, at the moment, if smbclient fails then the script exits status 1 and the cups print queue enters a stopped state which means that a given (unprivileged) user could theoretically craft a print job which would stop printing working for all users on the local machine.