Every little bit helps. - The Thrawn Rickle

You probably are aware of how many websites now use distorted letters that the user must type in order to gain access. These distorted letters are virtually impossible for a bot to read, and so spammers are thwarted in their efforts to gain entry to sites where they can cause malicious damage, steel email addresses, etc. Although the process is irritating, most of us accept it as part of doing business on the Internet.

Carnegie Mellon (CM) devised this test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), to help to keep out automated programs known as “bots.”

Now CM has devised a way to make the process much less painful, and actually benefit a significant project at the same time. CM is digitizing old books and manuscripts supplied by the non-profit Internet Archive. Approximately one word in ten is illegible by CM’s OCR software, and has to be deciphered by a human.

To solve this problem the CU team takes words that the OCR software can’t read, and uses images of the them as CAPTCHAs. The distribute these reCAPTCHAs, as they call them, to websites around the world to be used in place of conventional CAPTCHAs. When visitors decipher the reCAPTCHAs to gain access to the web site, the results are sent back to CM. Thus, every time an Internet user deciphers a reCAPTCHA, another word from an old book or manuscript is digitized.

Typing in two words correctly results in the digitisation of one word

Website visitors are presented with two words to examine, one of which is already known. Prof. Luis von Ahn at CMU explains, “If a person types the correct answer to the one we already know, we have confidence that they will give the correct answer to the other. We send the same unknown words to two different people, and if they both provide the same answer then effectively we can be sure that it is correct. If they don’t agree then we send it to a lot more people to examine.”

Facebook, Twitter and StumbleUpon, among other popular sites, use reCAPTCHAs. Consequently, about one million words every day are being deciphered for CM’s book archiving project. It takes about 10 seconds to decipher a reCAPTCHA and type in the answer, representing the equivalent of nearly three thousand daily man hours deciphering words for CM.

But, “There’s no danger of us running out of words,” says von Ahn. “There’s still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete.”