How YouTube, Wikipedia use machine moderation for crowdsourced content
Blog :Tech Tips, Tricks & Trivia
Date: 8/14/2012 3:03:00 AM
Search engines rely on bots to index web pages but did you know Wikipedia uses more than 700 active bots to keep its content clean! Wikipedia which is 50 times larger than the Encyclopaedia Britannica currently has 4,005,000 articles in the English edition and 22.8m articles in 285 language editions. The bots delete vandalism and foul language, organise and catalogue entries, and handle the reams of behind-the-scenes work that keep the encyclopaedia running smoothly and efficiently.
BBC News Magazine has a neat writeup on what these bots do -
- "Interwiki" bots link articles on the same subject in different languages
- Flag potential copyright violations and other irregularities for human review
- Add dates to "cleanup" tags so human editors know what needs attention
- Add articles to category lists, and lists of categories to articles
- Format and repair citations and references
- Compare ISBN numbers
- Flag images that need more licensing details
- Behind the scenes:
- Maintain Wikipedia archives
- Handle evidence in arbitration and administrative matters
Bots have been around almost as long as Wikipedia itself.
The site was founded in 2001, and the next year, one called rambot created about 30,000 articles - at a rate of thousands per day - on individual towns in the US.
The bot pulled data directly out of US Census tables. The articles read as if they had been written by a robot. They were short and formulaic and contained little more than strings of demographic statistics.
But once they had been created, human editors took over and filled out the entries with historical details, local governance information, and tourist attractions.
In 2008, another bot created thousands of tiny articles about asteroids, pulling a few items of data for each one from an online Nasa database.
ClueBot NG, as the bot is known, resides on a computer from which it sallies forth into the vast encyclopaedia to detect and clean up vandalism almost as soon as it occurs.
YouTube relies on its automated copyright detection system to verify if an uploaded video is in fact posted by the owner. It compares each upload against all the reference files in their database.
The scale and speed of this system is truly breathtaking -- we're not just talking about a few videos, we're talking about over 100 years of video every day between new uploads and the legacy scans we regularly do across all of the content on the site. And when we compare those 100 years of video, we're comparing it against millions of reference files in our database. It'd be like 36,000 people staring at 36,000 monitors each and every day without as much as a coffee break.
The official documentation explains how the system works -
If Content ID identifies a match between a user upload and material in the reference library, it applies the usage policy designated by the content owner. The usage policy tells the system what to do with the video. Matches can be to only the audio portion of an upload, the video portion only, or both.
There are three usage policies -- Block, Track or Monetize. If a rights owner specifies a Block policy, the video will not be viewable on YouTube. If the rights owner specifies a Track policy, the video will continue to be made available on YouTube and the rights owner will receive information about the video, such as how many views it receives. For a Monetize policy, the video will continue to be available on YouTube and ads will appear in conjunction with the video. The policies can be region-specific, so a content owner can allow a particular piece of material in one country and block the material in another.