- Content scraping websites steal work to pad sites setup for ads
- The system is cost effective because it's done by software robots
- The piracy sites are automated and capture programmatic advertising
- The thefts dilute the value of original work
Above: Illustration by
BitDepth#1295 for April 01, 2021
Anyone who thinks publishing on the Internet is easy is likely to be in for big surprise, particularly after the profile of their website begins to rise.
At technewstt.com, the news website for which I serve as both editor and webmaster, I get to enjoy all the hassles of big profile publishers with none of the perks.
As a super-niche site, we don’t win the impressive numbers generated by general interest online news publications, but I do get the professional attention of hackers, whose efforts briefly yielded fruit a month ago, creating flood of spam from my contact page.
It’s odd seeing your email client filling with spam from your own email. Fixing that was a quick two step process, adding a digital lure for automated bots and upgrading the Google Captcha used on the site.
While I do get regular attention from the hacker community — a few hundred intrusion attempts a month — it’s the increasingly sophisticated web robots that are the real headache.
Bots, as they are commonly called, are bits of code that search the web for vulnerabilities on websites.
They are an integral part of the architecture of the internet, first created to assess the size and number of interconnected computers that comprised the earliest versions of the global network.
The code was improved to become the web-crawlers that constantly search the internet for changes to add to search engine databases, but as with most code created to be useful, they can also be a nuisance.
The latest irritation comes in the form of content scrapers, code that searches for new posts fulfilling certain criteria, copy it and use it to pad dummy websites set up to attract viewers who will be served programmatic advertising.
Traditional anti-copy tools don’t work with code-level theft.
I usually find out when an automatic pingback to internal links to stories on my site embedded in the original story requests approval for the link.
That’s usually when I discover that the story has been stolen.
Is this fair? No.
Is this annoying? Oh yes.
For Grant Taylor, Managing Director of Newsday, the constant leak of pirated stories is a source of frustration.
“The hardest thing is identifying who is behind these websites,” Taylor said.
“Because it’s only a copyright issue, you can’t get the police involved. It isn’t a criminal matter, so the mechanisms to deal with it just aren’t that strong.”
Much of the most fretful piracy of Newsday stories thrives on Facebook, where “news” pages offer up copied stories in their entirety.
Taylor is currently concerned about Alert T&T, a Facebook page that’s particularly enthusiastic about strip-mining Newsday stories to fill its feed.
“Some of them might put a link to the site, but who’s going to click on that when the whole story is there.”
“I wouldn’t mind if it was a short excerpt and a link, but…”
While that’s an accepted form of linking back to a story, even that is steadily being deprecated in value, as many readers are content to skim a headline and paragraph and consider the story read.
In this cheerful video, Octoparse describes how their tool makes it easy to scrape content from Reuters.
It’s an untidy situation for web publishers, with hacking attacks an almost daily occurrence and outright theft of IP an ongoing issue.
The TechNewsTT experience with professional content scrapers is disturbingly thorough. Content is taken whole and attributed to an obviously fake author on the destination site in an automated operation.
The same tools that make online content delivery so cost-effective also make widespread theft and automated piracy equally accessible.
Anyone willing to run a website off stolen intellectual property isn’t, as you might expect, going to respond to emails of protest.
There is little that can be done that doesn’t require an exhausting amount of work and there’s no automation for that.
This isn’t good news for anyone trying to find an alternative, internet-based model for journalism or, for that matter, any other creative endeavour.
In the long run, it makes producing original work unsustainable, as the value of the work is constantly leached away by piracy and well-intentioned sharing.
The only positive thing to report on this trend is that the first five stories that were stolen from TechNewsTT now point to domains that are empty.
If I decide to take legal action on any site, I’ve taken one pre-emptive first step by adding WordProof to the website.
It’s a tool that identifies first authorship by logging an uneraseable token in the Eos blockchain on publication .
As a defensive measure, its only value is after legal action is taken. While Google’s search algorithms tend to be smart about first publication, authorship still counts for very little in the climate of wanton sharing that prevails on social media.