For many people, Google is the go-to starting point when they need to find something on the web. With just a few keystrokes, the search engine can find virtually anything.
This makes life easier but for copyright holders, there’s a drawback too, as the web is littered with pirate sites.
Over the past decade, Google has removed more than five billion ‘infringing’ URLs from its search results. A few days ago the company hit a new milestone after receiving DMCA notices for more than four million unique domain names.
To mark this event we decided to take a closer look at the submitted URLs to see who the worst offenders are. This leads to some interesting conclusions and puts the four million number in perspective.
The domain most flagged for infringement is 4shared.com. This popular file-sharing service has had more than 68 million of its URLs removed from Google. The majority of these were removed several years ago. More recently, 4shared started to actively work with rightsholders to prevent piracy by deploying filtering technologies.
The runner-up on the list with 51 million removed URLs is the relatively unknown mp3toys.xyz. This domain has been inactive for roughly half a decade but previously hosted pirated MP3s. The top three is completed by rapidgator.net, which has had more than 42 million of its URLs removed from Google’s search results.
What stands out is that the majority of the reported URLs are linked to a tiny fraction of the four million domain names. Just 400 domains (0.01%) are responsible for 41% of all links removed by Google over the years.
Hundreds of Pirate Bays
The Pirate Bay is ranked 66th based on the number of URLs Google had to remove. That’s only for the site’s main .org domain and there are hundreds of Pirate Bay proxy domains that are also frequently targeted.
There are currently close to 900 domain names that include the phrase “piratebay” in Google’s list of copyright infringing domain names. On top of that, there are more than 5,000 that use other variations of the word “bay,” many of which are inspired by the notorious pirate site.
Legitimate Domains in the Long Tail
It’s clear that a relatively small number of domains generate the bulk of all takedown requests. This means that there’s a long tail of domains that are flagged just a few times.
This long tail includes many marginal pirate sites, but there are also thousands of domain names that are not typical copyright infringers. Some of these are simply flagged in error.
There are numerous examples of legitimate sites. These include the FBI (22x), the RIAA (2x), the Vatican (reported 3x), and the White House (17xx), but there are thousands more we could add to the list. The good news is that Google is usually quite good at spotting these errors.
There are also legitimate sites that are flagged relatively often. The movie and TV database IMDb, for example, was reported 5,564 times. Another popular target is Wikipedia, which was mentioned 3,492 times in takedown notices. Interestingly, Google.com was also targeted over 700,000 times.
All in all, it’s safe to conclude that the four million domain names are not all blatant infringers. Instead, the bulk of all pirated content is centered around just a few hundred sites.