How Dropbox dropped the ball with anonymized data
Dropbox found itself in hot water this week over an academic study that used anonymized data to analyze the behavior and activity of thousands of customers.
The situation seemed innocent enough at first — an article in Harvard Business Review, researchers at Northwestern University Institute on Complex Systems (NICO) detailed an extensive two-year study of best practices for collaboration and communication on the cloud file hosting platform. Specifically, the study examined how thousands of academic scientists used Dropbox, which gave the NICO researchers project-folder data from more than 1,000 university departments.
But it wasn’t long before serious issues were revealed. The article, titled “A Study of Thousands of Dropbox Projects Reveals How Successful Teams Collaborate,” initially claimed that Dropbox gave the research team raw user data, which the researchers then anonymized. After Dropbox was hit with a wave of criticism, the article was revised to say the original version was incorrect – Dropbox anonymized the user data first and then gave it to the researchers.
That’s an extremely big error for the authors to make (if indeed it was an error) about who anonymized the data and when the data was anonymized — especially considering article was co-authored by a Dropbox manager (Rebecca Hinds, head of Enterprise Insights at Dropbox). I have to believe the article went through some kind of review process from Dropbox before it was published.
But let’s assume one of the leading cloud collaboration companies in the world simply screwed up the article rather than the process of handling and sharing customer data. There are still issues and questions for Dropbox, starting with the anonymized data itself. A Dropbox spokesperson told WIRED the company “randomized or hashed the dataset” before sharing the user data with NICO.
Why did Dropbox randomize *or* hash the datasets? Why did the company use two different approaches to anonymizing the user data? And how did it decide which types of data to hash and which types to randomize?
Furthermore, how was the data hashed? Dropbox didn’t say, but that’s an important question. I’d like to believe that a company like Dropbox wouldn’t use an insecure, deprecated hashing algorithm like MD5 or SHA-1, but there’s plenty of evidence those algorithms are still used by many organizations today.
The Dropbox spokesperson also told WIRED it grouped the dataset into “wide ranges” so no identifying information could be derived. But Dropbox’s explanation of the process is short on details. As a number of people in the infosec community have pointed out this week, anonymized data may not always be truly anonymous. And while some techniques work better than others, the task of de-anonymization appears to be getting easier.
And these are just the issues relating to the anonymized data; there are also serious questions about Dropbox’s privacy policy. The company claims its privacy policy covers the academic research, which has since sparked a debate about the requirements of informed consent. The policy states Dropbox may share customer data with “certain trusted third parties (for example, providers of customer support and IT services) to help us provide, improve, protect, and promote our services,” and includes a list of those trusted third parties like Amazon, Google and Salesforce. NICO, however, is not on the list. It’s also not entirely clear whether the anonymized data was given to NICO to improve the Dropbox service or to advance scientific research.
And while this isn’t close to the gross abuse of personal data we’ve seen with the Cambridge Analytica scandal, it’s nevertheless concerning. These types of questionable decisions regarding data usage and sharing can lead to accidental breaches, which can be just as devastating as any malicious attack that breaches and exposes user data. If companies in the business of storing and protecting data — like Dropbox — don’t have clear policies and procedures for sharing and anonymizing data, then we’re in for plenty more unforced errors.