Perl vs Python speed cont’d

Following my post yesterday I decided to both slightly refine my implementation and see if I could improve the speed of the Python version of my script. All that it really required it a simple search of an mbox file looking for messages with no ‘Status’ header (which is added by mutt when it has seen the message). Both the Perl and Python script I was using yesterday were doing far more than necessary as both (at least partially) parsed the messages which is not required with the mboxes I have. I have therefore re-implemented the Python version as a simple loop over each line in the mbox files which counts the number of items with no status header:

#!/usr/bin/env python

from os import environ, listdir

MAILHOME=environ['HOME'] + '/Mail'

new_mailboxes={}
for file in listdir(MAILHOME):
	no_status=False
	with open('%s/%s' % (MAILHOME, file)) as f:
		for line in f:
			if line.startswith('From '):
				if no_status:
					new_mailboxes[file]=new_mailboxes.get(file, 0) + 1
				no_status=True
			elif line.startswith('Status: '):
				no_status=False
	# Loop ended, make sure we count the last message if we need to.
	if no_status:
		new_mailboxes[file]=new_mailboxes.get(file, 0) + 1

for box, count in new_mailboxes.iteritems():
	print "%s (%d)" % (box, count)

This revised script completes in a much more respectable time, substantially quicker than the original Perl script I was comparing my first attempt to:

> time python checkmail2
...
python checkmail2 5.30s user 0.32s system 99% cpu 5.625 total

Curiosity got the better of me and I implemented the exact same algorithm in Perl and timed that:

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename;

my %new_mailboxes;
for my $file (glob("$ENV{HOME}/Mail/*")) {
	my $no_status = 0;
	my $basename = basename($file);
	open INPUT, $file;
	while(<INPUT>) {
		if (/^From /) {
			$new_mailboxes{$basename} += 1 if $no_status;
			$no_status = 1;
		} elsif ( /^Status: / ) {
			$no_status = 0;
		}
	}
	close INPUT;
	$new_mailboxes{$basename} += 1 if $no_status;
}

print $_, ' (', $new_mailboxes{$_}, ")\n" for keys(%new_mailboxes);


time perl checkmail3
...
perl checkmail3 2.96s user 0.42s system 97% cpu 3.465 total

Interestingly Perl was still faster by quite some way (the Python version took around 1.6 times as long to run). The question is, is this purely down to the overhead of object-orientated vs procedural or is Perl faster at IO and/or pattern matching (although the Python version should have been quicker here as it was not using a full-blown regex engine to match the start of strings)?

5 thoughts on “Perl vs Python speed cont’d”

  1. Actually, I already did that one and the performance is even worse. It takes ~8s to run. I’ve done some profiling, however, and gotten the Python implementation down to ~4s, which I think is about as close to Perl as makes no difference.

  2. I guess that the overhead might be mostly due to the Python startup, it’d be nice to have a time comparison on a much bigger mailbox ;)

    Could you give some estimate of the number of files and their size?

  3. That’s run against 39 mailboxes totalling 562MB. Unfortunately that’s all the mail I’ve got so I’m not going to be trying it on anything much bigger.

  4. I’m currently learning Python.

    I have a parser script that take lines, manipulate them, and construct a output in text.

    My perl script takes 2 seconds to do the job on a 500K line file (with statistic computation), the python3 script takes, on the same file 13 seconds without optimization, 8 with many optimizations. The optimizations were the following :
    – string: don’t use += but join
    – hash (dictionnary): don’t use double dimension [stringA][stringB] but single dimension [stringA+’,’+stringB]

    The difference between implementations, is that I used class in the python3 script.

    Regards.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>