Sunday, February 01, 2009

A month of spam

So my month of analyzing how well the spam detectors for gmail, Yahoo mail and Hotmail work. The results are a little hard to interpret, but they are interesting anyway. Before you look a the table, here is some nomenclature:

True negatives: Email that I've received on that account that was not classified as spam and was not spam
True positives: Email that I've received on that account that was correctly classified as spam
False positives:Email that was classified as spam but were not spam
False negatives:Email that was spam but was not classified as spam

ProviderTrue negativesTrue positivesFalse positivesFalse negatives
Gmail25155562
Yahoo7271131
Hotmail202512

So, what is the conclusion? Well, first that Yahoo's spam classifier isn't very good in catching spam. About 1/3 of the spam that I've received ended up on my inbox. And only one real email ended up in the spam folder (which is statistically the same as gmail or hotmail). Gmail seems to do a very good job at classifying spam, but it does seem to err more on throwing things on my spam folder than letting spam pass into my inbox, and that does annoy me a lot. As you can see, it's the account I use the most and receive the most amount of spam. If I don't check spam for a couple of days sometimes it's hard to sift through 70 spam emails to find one non-spam there. Unfortunately I don't have much to talk about hotmail because that account is mostly dead.

Other things we can say about this? Well, we can look at any date trends on spam. Do they happen more often on weekdays or weekends? (unfortunately my data is not split by time - the date on the email many times doesn't make much sense and I haven't checked my emails often enough to annotate time) Let's look only at gmail where there was enough data to make it interesting:
Sunday56
Monday94
Tuesday70
Wednesday76
Thursday106
Friday87
Saturday66

Or as a graph:



I wished there was much there to show. Probably I'll need to look for longer than a month to get a better trend there. Look at the raw data day-by-day in the month for gmail:



Two of the spikes you see are Mondays and one is Thursday. The interesting trend that I've seen right now is that it seems like I'm getting significantly less spam in the last few days. Let's see if this trend continues.

Well, I guess that's it. It was fun! I should do things like this more often. Now it's time to start my day.

0 comments: