Every so often a new study surfaces that attempts to describe the BitTorrent landscape. Yesterday a study by the Internet Commerce Security Laboratory (ICSL) was publicized (pdf) and the researchers found that only 0.3% of all torrents were confirmed legal. Good enough for a catchy headline, but how accurate is the study really?
Unfortunately, the results of these type of studies are pushed by anti-piracy outfits and taken for granted by outsiders, even by respected news outlets on the Internet such as Ars Technica and ZDNet. In this case their reporters were completely taken in by the report.
Just a few minutes into reading the study we were shaking our heads here at the TorrentFreak headquarters. Mistake after mistake is made in the report and conclusions are drawn based on painfully inaccurate data and methodologies. We’ll lay out the most critical errors below, which represent just the tip of the iceberg.
The study aims to answer four questions. We will state each question and indicate what’s wrong with the answers.
1. How many files are shared using BitTorrent and what are the categories of shared files?
ICSL claims that there are slightly more than a million torrent files to be found online, according to data obtained from 17 BitTorrent trackers this spring. They further come up with an overview of categories where applications account for 2.3% of all torrents, while movies and TV-shows are good for more than 70%.
Both conclusions are horribly wrong.
We’re not sure how the researchers came up with the one million torrents because the OpenBitTorrent tracker, which is included in their sample, reports it has 2,5 million torrents alone. In addition, sites such as isoHunt index over 5 million unique torrents. Needless to say, ICSL’s data collection methods are far from accurate.
An even bigger flaw is found in the categorization process. The categories are not based on the entire set of torrents, but only on the most-seeded ones, which heavily skews the data. Books and applications generally have a lower seed count than movie and TV-shows which means that they are underrepresented in the category overview.
2. At a given point in time, how much sharing of files is actually occurring using BitTorrent?
“For the trackers that we scraped, we recorded a minimum of 117,420,061 current seeds. This value is calculated by determining the highest available seeder count for each torrent from any tracker that was scraped,” the researchers answer in their report.
Again this is figure is bogus, but this time it’s wrong on the other end of the scale. As will become clear later in our analysis, the researchers have made a critical mistake by including various trackers that report false seed counts. We had to chuckle when we saw 2-year-old torrents with more than a million seeders in their report. The real seed count at any given time lies between 10 and 20 million.
3. For each shared file, how many times has it been shared in total?
Here’s where the researchers make total fools out of themselves. In their answer to the question they refer to a table of the top 10 most seeded torrents. As noted before, the most seeded file was uploaded nearly two years ago (The Incredible Hulk) and has a massive 1,112,628 seeders. The torrent in 10th place is not doing bad either with 277,043 seeds. All false data.
Top 10 of Fake Torrents?
We’re not sure where these numbers originate from but the best seeded torrent at the moment only has 13,739 seeders, that’s 1% of what the study reports. Also, the fact that the release is nearly two years old should have sounded some alarm bells. It appears that the researchers have pulled data from a bogus tracker, and it wouldn’t be a big surprise if all the torrents in their top 10 are actually fake.
4. Overall, what is the number and percentage of shared files which are infringing, both by number of files and total downloads?
Here the researchers conclude that 97.9% of all files on BitTorrent are copyright infringing, and only 0.3% confirmed ‘legal’. Based on our previous conclusions it is hard to believe that these figures are even remotely accurate, and they aren’t. There are too many flaws in the methodology to list here, but for one this statistic is grossly inaccurate because it’s based on the most popular files, of which many are fake.
The researchers should have at least tried to determine the percentage of infringing files on their whole (inaccurate) dataset instead of the most seeded ones (of which many are fake). We’re not trying to argue that the majority of the torrents are legit, but the selection of torrents and sources is extremely biased towards discovering copyright infringing torrents.
To back this up, we only have to take a look at isoHunt. According to isoHunt their site indexes 5,451,959 unique torrent files, and 85,457 of these come from Jamendo, a site that publishes only Creative Commons licensed music. So that’s already 1.5% torrents that can be shared legally, without mentioning any Linux distros.
Bottom line is that this ‘Academic’ paper is one of the most inaccurate reports we’ve seen thus far, and the mainstream tech media either didn’t spend long reading the report or simply didn’t have the specialist knowledge to read the results and come to their own conclusions. Even worse, the Australian anti-piracy outfit AFACT will probably use this ‘credible’ report in court to convince the court that the local ISP iiNet responsible for the copyright infringements of its customers.
Let’s hope that Ars and others will update their reports accordingly.
Update: They did..!
We’ve contacted Paul Watters, one of the researchers, for a comment but haven’t heard back from him yet.
Update: Watters replied to me, stating that he stands by his findings. He ignored all questions and offered to send a copy of a statistics manual instead. Since I taught statistics and research methods to PhD students myself, I kindly declined his offer.