CAPTCHAs work—for digitizing old, damaged texts, manuscripts

From Ars Technica: CAPTCHAs work—for digitizing old, damaged texts, manuscripts.

Over the course of history, humanity has suffered some horrifying damage to our collective cultural legacy in the form of books and other text lost to accident or neglect. The digitalization of text holds out the promise of permanently preserving the written word in an archive that can be distributed widely and kept safe from accidental damage. This presents archivists with a challenge: the works that are most in need of preservation are likely to already be damaged or distorted, making the use of automated scanning and text processing less likely to succeed. Researchers are now reporting on a successful way to identify the words that computers can’t handle: turn them into CAPTCHAs, and get people to do the work.

For those who haven’t heard the term, CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. In practical terms, a CAPTCHA takes the form of a string of characters subjected to distortions that make it difficult for computerized character recognition to identify them. Humans, who have a visual recognition capacity that vastly outperforms even the best computers, generally do pretty well in identifying these distorted characters. That has made the CAPTCHA a useful tool (although the bad guys are catching up) for keeping spam bots from harvesting e-mail addresses or posting spam-filled messages to public forums.

Researchers at Carnegie Mellon noticed a while back that [continue]