mailbox_date_trimmer

Scenario description

You are a mailing list administrator, or you are somebody who keeps mailing list archives for a user group, or you just have a fetish for email archives. Now, don't you always wonder why after running Hypermail or MHonArc on your archives you always have some emails which date back to 1980 or far away in the future like 2011, even though you started collecting emails in the year 2000 and it's still this very same year? While there are many answers to this question, there is a very easy way to fix many, if not all, of these messages, which results in a much more consistent email archive without broken discussion threads.

How does it work?

Mailing lists with some activity register at least some messages every month, and luckily most of these emails have correct dates. This program iterates through the email archive of your choice (in Unix mailbox format) checking the date header of each email and comparing it to the date of the previous email. If the difference in time is greater than a month, the current email's date is considered invalid.

When an email is sent to a mailing list, it is very likely that it hops through some computers before it reaches its audience. The good thing about this is that each hop adds to the email's headers a timestamp. People running email servers connected day and night to the internet usually set them up correctly (or face the consequences), so it is very unlikely that one of these added timestamps is incorrect. Also, email delivery tends to be pretty quick from one of these servers to another, with delays not bigger than minutes or even seconds in most circunstances.

So when the current email's date is considered invalid, mailbox_date_trimmer finds all the date timestamps in the headers, and reading them in reverse order (servers add their headers to the beginning of the email) picks the first one whose difference to the previous email is smaller than one month (usually the first choice 99% of the time).

In the broken cases where an email doesn't have ANY header at all, mailbox_date_trimmer adds to this email the time of the previos email plus one second. In the cases where the closest match doesn't fall in the expected one month timeframe, mailbox_date_trimmer gives up and doesn't add any header at all. The latter could happen with legitimate emails which you moved incorrectly to a folder, or you unsubscribed for holidays and resubscribed much later, etc.

If your mailbox contains messages which fall into this category, tough luck, you will have to weed them out manually. For most of the other people in the world, rejoice, your calvary has come to an end, you can finally enjoy email archives with consistent dates.

Software requisites

This software requires Python (http://www.python.org). It is known to work with versions 1.5.2 or 2.2.3. You also need my mailbox_reader module, which you should be able to get from:

Usage

mailbox_date_trimmer is a commandline tool with pretty few options. Running mailbox_date_trimmer with the -h or --help arguments should bring up a help screen showing you how to use the program and with what switches. This program can read mailboxes from the hard disk or through standard input. It can also write new mailboxes or dump everything through standard output. The former means that if you run the program without arguments it will sit there idle waiting for your input, just like the grep command.

You can run the program like a filter inside a more complex command chain. It consumes/produces data one email at a time, so you can feed it gigabytes of data and it should not run out of memory unless you have emails which don't fit in your available free memory, or you have another heavy weight process consuming all your memory.

Note that while I have run this over all my personal mailing list archives, and the program is written in such a way that it should never do stupid things, hey, I'm a stupid human, and the computer just followed my instructions. So better make a safe backup of your mail archives before you use this software on them. Anyway, it has worked correctly with about 500MB of mail archives, which is all I have been able to get from internet and friends.

Checking the generated output

In order to verify that mailbox_date_trimmer didn't break anything seriously, the first thing you should do is inspect the generated mail archive and count the number of emails, it should equal the number of emails in your original mailbox. If this is not the case, I'm sorry, I must have done something terrible. Drop me an email.

The second thing you can do is go one email after another checking what dates were modified. When mailbox_date_trimmer modifies the date of an email, if the verbose switch has been used, the original date is stored in the header X-DT. The reason of the change is stored in the header X-pi. You can therefore extract all modified messages with the following command, if you have grepmail available on your machine:

grepmail -h X-pi mailbox > changed

Now you can open this new mailbox and see easier which messages where modified. Don't you like these extra headers? Well, currently you have to grep them out yourself.

You will notice that most of the dates the program generates are not accurate. I didn't bother to parse timezones, an error of some hours is irrelevant when the acceptation time frame is one month in both directions. Also, time operations are done using the local time of your machine, should be using UTC.

However, little differences in time didn't cause me any problems at all. You are welcome to send me patches in unified diff format to improve this or any other aspect of the program. The current version satisfied all my neccesities, so it is quite unlikely that I will actively improve this software (no itch to scratch).

Contact information

You should be able to get me through gradha@users.sourceforge.net. If this fails, try going to my web page (currently at http://gradha.sdf-eu.org/), my current email address is stamped at the bottom of most pages. If that URL fails, you could try Googling by "Grzegorz Adam Hankiewicz" (don't forget the quotes). Am I narcissistic or what? As if you ever wanted to know that much...

License

This software is covered under the GPL. See the full license text in the provided LICENSE file.