I almost cried this morning.
Last summer we debunked a study by the University of Ballarat’s Internet Commerce Security Laboratory (ICSL). Carefully we spelled out the many obvious mistakes that were made, both in data collection and the research design in general. In addition, we contacted the lead researcher, offering our help.
Several news outlets who published the story were kind enough to acknowledge our critique, but the researchers themselves went silent and didn’t respond directly to the errors we pointed out. Today, the same researchers are again making headlines, and it seems that they haven’t learned a thing.
In a replication of the study they conducted earlier this year, the researchers have studied what’s being downloaded on BitTorrent. Among other things they want to find out which files are popular on BitTorrent at the moment, and how many of these are infringing.
But there’s a problem. Again.
In common with those behind last year’s study, the researchers have no clue what they are doing. Mistake after mistake has been made, as we will point out below. The worst part is that some media outlets appear to be taking this research seriously, while it’s in fact a disgrace for anyone who works in academia.
In large parts the methodology is the same as last time, so we won’t report all the painful mistakes that were pointed out before. Instead, will will just sum up some of the new findings, and point out why these are clearly wrong.
1. Most downloaded files
The data collected for the new study was gathered in July 2010, and the researchers used the number of active seeders at the time to determine what files are ‘most downloaded’. One would assume that such a list would be dominated by new titles, but according to the Australian researchers this is not the case.
In their top 10 most downloaded (read ‘seeded’) movies, we find the following titles that have been available for years:
At TorrentFreak we have years of experience at tracking BitTorrent downloads, and we’ve never seen any old titles in our weekly lists. Older titles do show up as popular in tracker scrapes sometimes, but they are always from fake torrent files or manipulated trackers. Common sense should have alerted the researchers that something might have been wrong with their data collection methods or sample.
The report also claims that the aXXo release of the film Wanted had a massive 50,582 seeders two years after it was released. Aside from the fact that we haven’t seen such a high seeder count in weeks, it is absolutely impossible that a download would have these impressive figures two years after it first became available.
The inaccuracy of the most downloaded film list is nicely illustrated by the researchers themselves. Aside from gathering data from BitTorrent trackers, they also looked at the 100 most searched for terms on the BitTorrent search engine isoHunt at the time. Interestingly, none of the older movies listed in their top 10 most downloaded list was present in the list of popular searches.
2. Popular Categories
As we suggested, to determine the popularity of various categories the researchers used a random sample of torrents this time, instead of the sample of popular torrents they previously selected. Despite this change the gathered data differs significantly from what most torrent sites report.
Based on a sample of 127,600 torrent files they conclude that nearly 70% of the torrents are video content and less than 2% is software.
If we look at the >10 million torrent files (unique hashes) that are available on a quality torrent site such as BitSnoop, we see a different picture. On BitSnoop 9% of all torrents are categorised as software, while video adds up to ‘just’ 52%. This leads us to believe that the sample the researchers used is heavily biased towards video content, or that their categorization algorithms are flawed.
3. Multiplying Trackers
The last point that we want to address is again an illustration of the incompetence of the researchers. What we missed last time is that they simply added up the reports of the different BitTorrent trackers they scraped. If “torrent A” is tracked by 5 individual trackers, then the researchers add up the seeder counts of them all, while in fact they are often used by the same downloaders.
Or put differently, most torrent clients allow people to use multiple trackers. That means that they can be listed as a seeder at several trackers at the same time. The researchers didn’t calculate this in, and are therefore overestimating the download counts, which were already suspicious to begin with.
Sadly enough we have to conclude that this new study is just as bad as the previous one, and totally unusable to describe the BitTorrent landscape. We’re not exaggerating if we say that the researchers are incompetent, lack common sense, and are too stubborn to take advice when we offered it.
When I contacted researcher Dr. Paul Watters last time he sent the following reply: “I would be happy to send you a complimentary of my O’Reilly ‘Statistics in a Nutshell’ book that might give further insight into statistical methodology.” I chuckled, since I’ve worked as an academic myself for years, publishing in high impact peer-reviewed journals.
Perhaps the State Government of Victoria, IBM, Westpac Banking Corporation, the Australian Federal Police and Village Roadshow should ask for a refund, as they all supported the research financially.