Turning off GMails draconian spam filter

Okay, so there is no way to turn off GMails spam filter (or even turn it down to the point where it stops putting more legitimate emails than spam into the “Spam” folder).

To fix this behavior I have thrown together this short python script that simply moves any email found in the Spam folder into the Inbox. I used to achieve the same thing by creating a filter which told GMail not to spam any email that has an ‘@’ in the ‘From’ address but GMail has suddenly decided to start ignoring that filter so a more permanent solution is required.

#!/usr/bin/env python

import imaplib

IMAP_USER='<your_user_name>@gmail.com'
IMAP_PASSWORD='<your_password>'


if __name__ == '__main__':
  imap4 = imaplib.IMAP4_SSL('imap.gmail.com')
  imap4.login(IMAP_USER, IMAP_PASSWORD)
  imap4.select('[Gmail]/Spam')
  typ, data = imap4.search(None, 'ALL')
  for num in data[0].split():
    message_subj = imap4.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM TO DATE)])')[1]
    print "Moving message '%s' from Spam to INBOX" % (', '.join(message_subj[0][1].rstrip().split("\r\n")))
    imap4.copy(num, 'INBOX')
    imap4.store(num, '+FLAGS', '\\Deleted')
  imap4.expunge()
  imap4.close()
  imap4.logout()

even more Python vs Perl performance

Following my last two posts (http://blog.entek.org.uk/?p=106 and http://blog.entek.org.uk/?p=112) I took some profilers to my codes.

First up my revised Python implementation:

> python -m cProfile checkmail2
...
         16225093 function calls in 8.663 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.663    8.663 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    5.906    5.906    8.662    8.662 checkmail2:3()
        1    0.000    0.000    8.663    8.663 {execfile}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       22    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
 16224990    2.755    0.000    2.755    0.000 {method 'startswith' of 'str' objects}
       37    0.001    0.000    0.001    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

Quite clearly there is a huge number (16 million!) of calls to startswith which is the biggest time-sink outside the main script.

Comparing the Perl implementation:

> perl -d:DProf checkmail3
...
> dprofpp
Total Elapsed Time = 3.426896 Seconds
  User+System Time = 3.401494 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 0.12   0.004  0.007      4   0.0010 0.0016  main::BEGIN
 0.03   0.001  0.001      5   0.0002 0.0002  File::Basename::BEGIN
 0.03   0.001  0.001      1   0.0009 0.0009  warnings::BEGIN
 0.03   0.001  0.001     37   0.0000 0.0000  File::Basename::_strip_trailing_sep
 0.03   0.001  0.001     37   0.0000 0.0000  File::Basename::fileparse
 0.00   0.000  0.002     37   0.0000 0.0000  File::Basename::basename
 0.00   0.000  0.000      1   0.0003 0.0003  File::Glob::doglob
 0.00   0.000  0.000      1   0.0001 0.0001  DynaLoader::dl_load_file
 0.00   0.000  0.000      1   0.0001 0.0003  XSLoader::load
 0.00   0.000  0.000      1   0.0001 0.0001  File::Basename::fileparse_set_fstype
 0.00   0.000  0.000      1   0.0001 0.0001  Exporter::import
 0.00   0.000  0.000      2   0.0000 0.0000  warnings::import
 0.00   0.000  0.000      3   0.0000 0.0000  strict::import
 0.00   0.000  0.000      1   0.0000 0.0003  File::Glob::csh_glob
 0.00   0.000  0.000      1   0.0000 0.0000  strict::bits

Ignoring the actual times which are not directly comparable due to the profiling overheads we can clearly see Perl is benefiting hugely from the inbuilt regex engine as there are 0 function calls associated with each line check.

I did replace the ‘str.startswith’ implementation of the Python script with a version which used ‘re’ regex objects, but this showed even worse performance:

> time python checkmail4
...
python checkmail4 8.42s user 0.33s system 99% cpu 8.765 total

Profiling this one we see the overhead of using ‘re.match’ was about double that of ‘str.startswith’ and, obviously, the number of calls remained the same. On top of this I introduced additional overhead of two calls to ‘re.compile’ at the start of the script, which the profiler showed incurred not insignificant function calls of their own:

> python -m cProfile checkmail4
...
         16225312 function calls in 12.416 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.416   12.416 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    6.662    6.662   12.415   12.415 checkmail4:3()
        2    0.000    0.000    0.000    0.000 re.py:186(compile)
        2    0.000    0.000    0.000    0.000 re.py:227(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:367(_compile_info)
        2    0.000    0.000    0.000    0.000 sre_compile.py:38(_compile)
        4    0.000    0.000    0.000    0.000 sre_compile.py:480(isstring)
        2    0.000    0.000    0.000    0.000 sre_compile.py:486(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:501(compile)
       15    0.000    0.000    0.000    0.000 sre_parse.py:144(append)
        2    0.000    0.000    0.000    0.000 sre_parse.py:146(getwidth)
        2    0.000    0.000    0.000    0.000 sre_parse.py:184(__init__)
       21    0.000    0.000    0.000    0.000 sre_parse.py:188(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:201(match)
       19    0.000    0.000    0.000    0.000 sre_parse.py:207(get)
        2    0.000    0.000    0.000    0.000 sre_parse.py:307(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:385(_parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:669(parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:73(__init__)
        2    0.000    0.000    0.000    0.000 sre_parse.py:96(__init__)
        2    0.000    0.000    0.000    0.000 {_sre.compile}
 16224990    5.751    0.000    5.751    0.000 {built-in method match}
        1    0.000    0.000   12.416   12.416 {execfile}
        6    0.000    0.000    0.000    0.000 {isinstance}
       44    0.000    0.000    0.000    0.000 {len}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
       59    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       24    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {min}
       37    0.002    0.000    0.002    0.000 {open}
       13    0.000    0.000    0.000    0.000 {ord}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

Quite clearly from the profiler output each call to either ‘str.startswith’ or ‘re.match’ use a very small amount of processor time (too small to be output) but the cumulative effect of 16 million calls is where the big slowdown was occurring. To get around this I tried implementing the ‘str.startswith’ version using string splicing (i.e. “line[:5] == ‘From ‘” rather than “line.startswith(‘From ‘)”) and the result was dramatic:

> time python checkmail4
...
python checkmail4 3.86s user 0.31s system 99% cpu 4.186 total

The profiler output for this version shows that the number of function calls is now on a par with the Perl implementation:

> python -m cProfile checkmail4
...
         110 function calls in 4.311 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.311    4.311 :1()
        1    0.000    0.000    0.000    0.000 UserDict.py:17(__getitem__)
        1    4.308    4.308    4.311    4.311 checkmail4:3()
        1    0.000    0.000    4.311    4.311 {execfile}
       37    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       29    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
       37    0.003    0.000    0.003    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.listdir}

This puts the Python version within 0.6s of the Perl version, which is close enough for me. Especially considering this is effectively comparing Perl to Python on Perl’s hometurf of text matching.

I think Perl would probably still out perform Python if I was wanting to do something more fancy involving regex substitutions but Python’s performance issues, in this case, seem to be purely down to function call overheads which Perl sidesteps by incorporating the regex engine into the core language.

Perl vs Python speed cont’d

Following my post yesterday I decided to both slightly refine my implementation and see if I could improve the speed of the Python version of my script. All that it really required it a simple search of an mbox file looking for messages with no ‘Status’ header (which is added by mutt when it has seen the message). Both the Perl and Python script I was using yesterday were doing far more than necessary as both (at least partially) parsed the messages which is not required with the mboxes I have. I have therefore re-implemented the Python version as a simple loop over each line in the mbox files which counts the number of items with no status header:

#!/usr/bin/env python

from os import environ, listdir

MAILHOME=environ['HOME'] + '/Mail'

new_mailboxes={}
for file in listdir(MAILHOME):
	no_status=False
	with open('%s/%s' % (MAILHOME, file)) as f:
		for line in f:
			if line.startswith('From '):
				if no_status:
					new_mailboxes[file]=new_mailboxes.get(file, 0) + 1
				no_status=True
			elif line.startswith('Status: '):
				no_status=False
	# Loop ended, make sure we count the last message if we need to.
	if no_status:
		new_mailboxes[file]=new_mailboxes.get(file, 0) + 1

for box, count in new_mailboxes.iteritems():
	print "%s (%d)" % (box, count)

This revised script completes in a much more respectable time, substantially quicker than the original Perl script I was comparing my first attempt to:

> time python checkmail2
...
python checkmail2 5.30s user 0.32s system 99% cpu 5.625 total

Curiosity got the better of me and I implemented the exact same algorithm in Perl and timed that:

#!/usr/bin/env perl

use strict;
use warnings;

use File::Basename;

my %new_mailboxes;
for my $file (glob("$ENV{HOME}/Mail/*")) {
	my $no_status = 0;
	my $basename = basename($file);
	open INPUT, $file;
	while(<INPUT>) {
		if (/^From /) {
			$new_mailboxes{$basename} += 1 if $no_status;
			$no_status = 1;
		} elsif ( /^Status: / ) {
			$no_status = 0;
		}
	}
	close INPUT;
	$new_mailboxes{$basename} += 1 if $no_status;
}

print $_, ' (', $new_mailboxes{$_}, ")\n" for keys(%new_mailboxes);


time perl checkmail3
...
perl checkmail3 2.96s user 0.42s system 97% cpu 3.465 total

Interestingly Perl was still faster by quite some way (the Python version took around 1.6 times as long to run). The question is, is this purely down to the overhead of object-orientated vs procedural or is Perl faster at IO and/or pattern matching (although the Python version should have been quicker here as it was not using a full-blown regex engine to match the start of strings)?

Perl vs Python speed

I needed to write a quick script to find which mailboxes in “~/Mail” had unread messages in them. I decided to knock it up in Python, but the script was not performing very well:

> time ./checkmail
...
./checkmail 56.17s user 1.86s system 98% cpu 59.181 total

A quick google found a Perl program which did pretty much the same thing (at http://www.perlmonks.org/?node_id=552218 if you’re interested) which ran, unaltered, on the exact same files performed significantly better:

> time perl checkmail2
...
perl checkmail2 16.66s user 1.27s system 99% cpu 18.043 total

Not only did the Perl version take about 1/3 of the time of the Python implementation but it also counted the number of unread messages and displayed it, while my simple Python script break’d on the first match to avoid needlessly looping over the rest of the messages. The Perl version is managing not to fully decode every message through the use of the Mail::MboxParser library however I could not find a way to achieve the same result in a straightforward manner with the standard Python libraries. Indeed, looking at the documented examples in the python docs, http://docs.python.org/library/mailbox.html#examples, it appears this is the suggested way of doing it (in essence all I need to do is examine the ‘status’ header, the example which uses a very similar loop to just examine the ‘subject’ header).

My Python script is here:

#!/usr/bin/env python

import mailbox
from os import listdir, environ, system

MAILHOME=environ['HOME'] + '/Mail'

new_mail_mailboxes=[]
for file in listdir(MAILHOME):
	for message in mailbox.mbox(MAILHOME + '/' + file):
		if message['status']:
			new_mail_mailboxes.append(file)
			break
print "\n".join(new_mail_mailboxes)

The UK Government strikes again

Since my previous rant on the UK Government’s inability to understand how technology works, it would appear the Government has still not advanced its understanding.

Apparently they are going to force ISPs to record the time, to and from details of all emails. Aside from failing to see how this will possibly prevent terrorism, one has to ask; “what about about those of us who do not use ISPs to send email?” Am I going to be marked as a possible terrorist simply because I run my own mail server, rather than use my [parent’s] ISP’s?

Also, what information are they using as the from and to? If they use the IP address’s of the sender and receiver then they will neither be able to readily find the actual identity of the sender/receiver or to record the IP address of the receiver until they collect their mail from the mail server by pop/imap/webmail. If they, alternatively, record the from/to email addresses then the information will be useless due to how trivial it is to forge from addresses (as anyone who has received spam claiming to be from their bank will be able to testify). If they record the IP of the sender and the email address of the receiver (probably the most sensible combination) then they will still be unable to determine who sent the email easily since a single IP may have many computers behind it (due to NAT routers) and the fact that (especially with ISPs which dynamically allocate IPs) the fact that IP addresses are constantly being re-allocated.

Yet another brilliant, and useless, idea from our illustrious leaders.
</rant>

Courier-imap-ssl woes

In order to be able to resize RAID5 arrays in my mailserver, I upgraded from Debian Stable->Testing as it broke less than trying to manually install the relivent packages needed from experimental and unstable. In order to resize RAID5, according to Steinar H. Gunderson, you need a 2.6.17-rc* kernel and mdadm tools>=2.4.1. Thankfully the updated mdadm tools are in unstable so installing them on testing was trivial. linux-imager-2.6.17-rc3 is in experimental, so installing that was also strightforward, just a case of adding an experimental source, aptitude update, aptitude install linux-image-2.6.17-rc3, and removing the experimental source).

After the stable->testing upgrade everything seemed to be working fine. My mail was still being fetched and delivered locally, mutt was working fine, apache2 was still running and the imaps daemon was still going. This morning I tried to access my email through the copy of SquirrelMail I have installed for eash access without having to ssh into the box. It failed to login with the message:

Error connecting to IMAP server: tls://localhost.
115 : Operation now in progress

To see if the courier-imap-ssl daemon was just not accepting connections I fired up Thunderbird (which I havn’t used in sometime, since setting up my mailserver). Thunderbird connected successfully and happily talked to the mailserver, fetched my current inbox and allowed me to poke my emails although it didn’t seem to like some locally created emails with attachments (it just refused to show the attachments). Starting the non-ssl daemon and telling SquirrelMail to use that instead work, but it should be able to use the ssl daemon. It was working fine under stable!

According to DirectAdmin Knowledge Base the error is caused by a bug in PHP. Their solution seems to be to rebuild everything from source. I think I’ll try some less-drastic solutions first, such as downgrading SquirrelMail to the version in Stable, and if that doesn’t work downgrading PHP too. or I could try installing PHP5 (I assume it’s still using 4.something atm).

Anyway, I have two exams in the next 24hours so more pokeage of this will have to wait until the weekend.

**UPDATE**
Following some interesting reading on php.net,freebsd.org and bugs.debian.org on the matter I decided to try installing PHP5 (as those seem to indicate, on debian, the problem is an openssl<->php incompatibility). After installing PHP5 it all worked as expected. Hurray! Now for some revision, honest.

Weblinks

In the good old days(tm), before I started blogging and stuff I used to e-mail myself weblinks if I didn’t want to lose them. Since I’m now blogging and have recently completely replaced the way I get my email, I’m now sorting out the contents of my inbox and removing these links. Anything which is still useful or relevent is here:

A simple text list of ‘language notes’ on python. It’s really useful and concise, as well as listing most of the nifty things you can do with the Python language.

A comprehensive guide to the stuff you can do with .htaccess. It’s lists juast about everything you could do (or at least want to do) with the .htaccess file in a straightforward way. One of the better references for doing stuff with a .htaccess file.

Lots of shiny icons.

The exceptionally shiny flurry screensaver from MacOSX for windows

Howto use 3rd party encoders with exact audio copy (e.g. FLAC)

Nifty settings and stuff for mutt

An intersting article on gentoo in the server room

OMG H4X!

I’ve just finshed writing possibly the hackiest(if that’s a word) script I’ve ever written.

I’ve switched to using mutt as my primary mail client and since it’s pretty much a stock mutt setup, it doesn’t particularly like the html emails that the ‘Daily Dilbert’ come as. In order to read the strips with as little effort as possible (always a good thing ;) ), I’ve written this filter. It extracts the url of the image from the email, fetches the image and then replaces the original email with just the image as an attachment (no body).

This is the filter from my .mailfilter (which courier-maildrop uses):

if ( /^Subject: Your Daily Dilbert$/ )
{
# This increadably hacky script extracts just the picture from the 'daily dibert' email so I can view it through
# my non-html email client (mutt).
# It's in an exception block, so if the hacky script fails the email (hopefully) will still get delivered.
exception {
TEMPFILE=`tempfile`
FROMFILE=`tempfile`
OUTFILE=`tempfile`
xfilter "tee $FROMFILE \
| grep 'http://www\.comics\.com/comics/dilbert/archive/images/.*\.gif' \
| sed -e 's#^.*\(http://www\.comics\.com/comics/dilbert/archive/images/.*\.gif\).*$#\1#' > $TEMPFILE \
&& sed -e '/^Subject: Your Daily Dilbert$/d; /^$/Q;' $FROMFILE | tee $FROMFILE > /dev/null \
&& echo 'X-Haxed-For-Piccie-Only: Yes' >> $FROMFILE \
&& curl `cat $TEMPFILE` > /tmp/`cat $TEMPFILE | sed -e 's#^.*/\(.*\.gif\).*$#\1#'` \
&& rm $OUTFILE \
&& mpack -s 'Your Daily Dilbert' -o $OUTFILE /tmp/`cat $TEMPFILE | sed -e 's#^.*/\(.*\.gif\).*$#\1#'` \
&& rm /tmp/`cat $TEMPFILE | sed -e 's#^.*/\(.*\.gif\).*$#\1#'` $TEMPFILE \
&& cat $FROMFILE $OUTFILE \
&& rm $OUTFILE $FROMFILE"
}
to $MAILDIR/.MailingLists/
}

Note the subject gets deleted, and then re-added when mpack creates the email with attachment. This is because mpack refuses to create mail without a subject, even though I’m adding all the original headers (mainly so I can see spam-assassin’s report) back in later.

Hmm, Blog

I recently (re-)discovered I’d actually installed a blog script, and never written anything down! Oh well, no time like the present to start – I wonder how long I will be able to keep writing entries before I:

  1. get bored with the whole ‘blog’ idea
  2. simply forget about or neglect the blog to the point that it disappears from my mind (again!)
  3. I get distracted by some project or other.

I recently managed to set up a Debian-based mail server. I originally searched google and came up with a number of guides to doing this which looked quite good, albeit long but it’s a project I’ve been planning for a while so I decided to bite the bullet and have a go. After installing Debian and playing around with various different approaches for a bit, I discovered an entry on another blog at The Tech Terminal explaining how the author had setup a Debian Mail Server. This simply said that all I had to do was enter this:

# apt-get install courier-imap
# apt-get install postfix
# postconf -e 'home_mailbox = Maildir/'
# postconf -e 'mailbox_command ='
# /etc/init.d/postfix restart

at the command line. This was certainly a lot easier that the 8-page guide I had be following previously, and it worked :).

Using other guides to install spamassassin and squirrelmail and it was all working very nicely. Fetchmail and gotmail were easy to install and configure using the man pages so I didn’t need to enlist google’s help with them. I now have a single server with 2x40GB HDDs (configured for RAID 1 using a PCI PATA RAID card) which goes and fetches emails from my 2 POP accounts and my hotmail account and delivers them to my local user on the machine (for my purposes I decided LDAP was overkill and that dropping the mail to a local user’s Maildir made more sense). This means I can now access my mail using an IMAP client on either my desktop or laptop, or I can use a web-browser from any other location.

One small snag did run into is that Maildir creates a directory for each directory on the server (as you’d expect) but doesn’t nest them. I was expecting them to nest and it took a while (and some head-banging) for me to discover that Maildir actually uses a ‘.’ to represent sub-directories.
e.g. this structure:

Inbox
:-New
:-Badgers
: :-Mushroom
: :-Snake
:-Llama

becomes this Maildir structure:

/Inbox
/Inbox.New
/Inbox.Badgers
/Inbox.Badgers.Mushroom
/Inbox.Badgers.Snake
/Inbox.Llama