Testing email security products: Results and analysis

Kevin Tolly of the Tolly Group offers a look at how his company set out to test several email security products and the challenges it faced to come up with sound methodologies.

Editor's note: This is part two of a series on email security products. Part one examined the challenges of testing such products and the methodology and approach the Tolly Group used for a recent vendor product evaluation project. Here, Kevin Tolly explains the final steps of the test process, as well as his impressions from the results. The actual test results will be discussed further in a sponsored webinar this month.

Avoiding shut down triggers is crucial when testing email security products. For the actual testing of the four products we evaluated, we needed to take measures to make the inbound email test traffic seem natural and normal.

If, for example, we sent 100% malicious traffic to a given user, the systems under test (SUT) might simply decide to shut down or block all the messages to that user -- and the test would be over. Put another way, we had to avoid triggers that either shut down or otherwise invalidate our test.

Much of what we know about triggers comes from what we have heard anecdotally. Email security vendors are not about to tell us anything concrete, but there are some examples of things to watch out for.

It makes sense that attackers might use a newly registered domain to carry out attacks. Thus, if you set up a new domain for your test, that might tip off the system that it is more likely to be a source of malicious traffic. For our research, we used domains that were a year old.

We'd also heard that a user receiving too many threats too quickly could be a trigger. Thus, we sent five benign, cleansing messages to each test user between each malicious message. We also avoided using email automation, which could send messages too rapidly. We allowed several minutes to pass between messages.

Did we avoid triggers? We don't know. Are there other characteristics that might trigger SUTs to shut your systems down? Probably. You will simply need to be aware of this when you review the results and try to have your test flow resemble a normal email as much as possible.

Test results and analysis

Ultimately, we built and ran some 25 threat samples and 100 benign samples through four prominent email security products. Here's a brief look at some things we found.

First, it should be noted that where antispam products decide the fate of a message just once -- as it enters the system -- antiphishing products have two chances to judge a message. When a message containing a URL enters the SUT, the security product evaluates the URL.

Assuming it allows the message to pass into the user's inbox, the SUT rewrites every URL to link to the SUT. When and if a user clicks on the URL, the SUT gets a second chance to evaluate the link and decide if it constitutes a threat. Seconds, minutes or even days could elapse between the first and second evaluation.

Thus, if an SUT is unaware of a zero-day threat as it comes through the system, it has the opportunity to judge it again at a later time, perhaps when that threat has become known to a company's threat intelligence system. Here's a look at some of the tests we used.

Link to known virus: This was a homegrown test. We built a website, found a copy of a known virus and placed the EXE file on our test website. We didn't change the name or obfuscate the virus in any way. Then, we sent an email with the infected link to our SUTs.

Only one of the systems blocked the message completely. The other three delivered the message, one to junk mail, though it was still accessible to the user. We logged in and tried to download the virus. Two of the three detected and blocked the virus at that point. One system, however, failed to detect this old virus and allowed us to download the infected EXE file.

Fake WhatsApp link: As we began our testing, a new zero-day message arrived in a tester's inbox, so we grabbed it and sent it out to our SUTs. The HTML message stated that we had a new WhatsApp voicemail message. After clicking the URL, it took us to a fake tech support site requesting that we call a scam 1-800 number.

While two of the systems stopped this message outright, the other two delivered it with the links rewritten. Minutes later, however, when we logged in to review the message, both systems allowed us to click through to the phishing site.

Fake Office 365 link: The 50% failure rate for the fake WhatsApp link was a surprise, so we decided to find an older, non-zero-day phishing message to use. This message simply displayed text that said Confirm account followed by a link. The URL went to a fake Office 365 account site prompting the user to log in.

The results were identical, with the same two systems catching this old phishing URL and the two others allowing the user to click through and enter credential data on the login screen.

These results were typical of what we found. Of the 25 samples we came up with, only five were detected or blocked by every one of the systems.

The results for benign messages containing links to sites, some of which pointed to normal PDF, XLS and ZIP files, were also somewhat surprising. Only one of the email security products handled all 100 correctly, while the others had issues.

One product selected certain benign messages to scan-on-click instead of scanning them when they first entered the system. Thus, the user had access to the message, but only after a wait. Another product took this same approach to every benign attachment. The user was delayed at click time, though eventually he was able to access the file.

Initially, we were concerned that our small-scale research test would be too easy, resulting in 100% correct results and telling us nothing of interest -- then the test results came back.

Email security product scoring

Any comparison needs to end up with a bottom line, and with email security testing, that usually involves generating a score. With antispam products, one evaluation can calculate the spam accuracy rate and offset it with a calculation based on false positives. With antiphishing products, it isn't that easy.

What constitutes a pass or a fail? Starting with an easy pass, this is when a vendor blocks a threat and it never arrives in the user's inbox.

What about the case, which we saw frequently, where the SUT marked the message as Definite Spam -- or junk -- and delivered it to the inbox, and then, when the user clicked on the URL, he was taken to the malicious site? In my view, that is a fail. A vendor might argue that the message was correctly identified and classified as not good and that the user should never have clicked on it. You will have to decide what constitutes a pass/fail.

Similarly, if an SUT delays access to a benign message at click time -- a delay that might tell the user to come back later -- is that a False positive or is that acceptable behavior? To me, it is less than ideal, but because there is no such category for a partially false positive, how do we score it? Questions abound with no easy answers.

Conclusion

As stated earlier, we were concerned that our test would be too easy, resulting in every vendor scoring a 100%. That was clearly not the case. Our small research project underscored that fact that, as with any area in technology, there are significant differences between email security products that only become evident when tested.

All of the areas touched upon in this article are evolving and deserving of more attention in the new year. Trust but verify -- especially with email security -- is a good resolution for every year.

For more on these test results, stay tuned for the Tolly Group's forthcoming webinar.

Dig Deeper on Threats and vulnerabilities