CAPTCHAs Help Read Old Texts

CAPTCHAs Help Read Old Texts

A new program is using those annoying CAPTCHAs to help digitize old texts.

You might not know the term, but you’ve dealt with CAPTCHAs many times. They’re the annoying, fuzzy, distorted words or numbers you need to try and copy correctly to gain access to asite. About 60 million of them are typed every single day   Their aim is to stop automated programs having access and posting ads; it requires a human to type the code.   Researchers atCarnegie Mellon University have found a new use for the CAPTCHA – well, call it a dual use, if you like. As well as acting as a password to a site, they cannow also help with the digitizing of old books.   It all stems from the problem that when old texts are scanned, computers are unable to decipher about 10% of the words, meaning the human touchis necessary to make sense of them. With literally thousands of pages scanned every month, that becomes a gigantic test.   So researchers decided to farm out the work. The words are sent out toweb sites to be used as CAPTCHAs. Known as reCAPTCHAs, once they’re deciphered, the result is returned to Carnegie Mellon.   But how does anyoneknow the answer is correct? Well, as a test, users are given two words to type, and the content of one is already known. If that is typed correctly, the assumption is that the other is correct. Forextra proof, the word is sent to two different sites to be used as a CAPTCHA. If both answers are the same, then that’s good enough for the researchers.   With the proliferation ofCAPTCHAs, about a million words a day are being deciphered.   “There’s no danger of us running out of words," Luis von Ahn, a Professor at CMU told the BBC. "There’s still about 100 million books to be digitized, which at the current rate will take us about 400 years to complete."  

Showing 2 comments

  1. Tim Stevens at 12:16pm 4th October 2007 Good point! We could end up with some pretty bad translations at that point.

    Other than that, it's a pretty cool idea!
  2. Finder Keeper at 2:41pm 3rd October 2007 This solves the problem of the known unknows. What about the "unknown unknowns" - scanned text that the OCR software thinks it deciphered, but got wrong?
Close Suggestion Radiohead Subverts Download Model
View Article