Archive for the tag 'OCR'

Alumni Horae Digital Archive Launch

May 29th, 2009

Lisa Laughy – Archives Assistant

Ohrstrom Library is proud to announce a special Anniversary Weekend preview of the newly launched Alumni Horae Digital Archive.

Alumni Horae, the St. Paul’s School alumni magazine, is published four times a year by the Alumni Association in order to engage the alumni community of SPS, to connect alumni to each other, and to enrich the School community. The magazine contains alumni news, features, book reviews, Form notes, and obituaries as well as information about current School life and athletics.

The entire print run of the St. Paul’s School alumni magazine, has been scanned and is now accessible online. Every issue of the Alumni Horae from 1921 to the present has been professionally scanned using Optical Character Recognition (OCR) to create a searchable online database.  The articles are also available in PDF format, which reproduces every page of the Alumni Horae as it was originally published, including all diagrams, tables, and photographs.  The PDF files are available for downloading and printing.

Click HERE to access the Alumni Horae Digital Archive.

Click HERE to access the user’s guide to searching and browsing the archive.

Periodical Picks: The Science of CAPTCHAs

September 30th, 2008

Lisa Laughy - Archives Assistant

The September 12th issue of Science, recently out on the shelf in Ohrstrom Library’s periodical room, features a cover article about the combination of new tech and old books.  Five researchers have tested the effectiveness of the CAPTCHA web security measure to pick up the slack in OCR book digitization. If you regularly browse the web, you have encountered a CAPTCHA – asking you to decipher a difficult to read section of text and type the letters into a box.  Now researchers are finding a way to re-purpose your small efforts into something rather useful.   Science describes the project:

“Millions of books written before the computer era are being digitized for preservation. Because the ink has faded, optical character recognition software cannot decipher many words. Through a repurposing of an existing online security technology called CAPTCHA, these words are being manually transcribed by millions of Web users.”

Here is the abstract from the published paper:

“CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are widespread security measures on the World Wide Web that prevent automated programs from abusing online services. They do so by asking humans to perform a task that computers cannot yet perform, such as deciphering distorted characters. Our research explored whether such human effort can be channeled into a useful purpose: helping to digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize. We showed that this method can transcribe text with a word accuracy exceeding 99%, matching the guarantee of professional human transcribers. Our apparatus is deployed in more than 40,000 Web sites and has transcribed over 440 million words.”

The article estimates that over 100 million CAPTCHAs are typed a day, amounting to hundreds of thousands of human hours.  Taping into that resource to accomplish such a useful task as the digital preservation of old books is a fascinating prospect.  Come into Ohrstrom Library’s periodical room and read the full text of the article in the September 12th issue of Science, starting on page 1465.