- MarissaBrand
- gordman
- mithunsarker
- Kim07
- Ralph Waldren
How to Digitize Eight Million Books
This is an The Book and the Computer interview with Michael Keller, who is currently overseeing Stanford University's effort to digitize part of their vast library with a custom-built robotic scanner.
Keller obviously "gets it", understands what's going on. "Books are copyrighted, and we can't violate those copyrights. So what we have to do is persuade the rights holders -- the publishers, the authors, the agents and so forth -- that there's a benefit to them in digitizing their books. The first project we're doing is digitizing the books at the Stanford University Press. Over the last 80 years, we've published about 2,500 titles. We're digitizing them all. We'll put some on the Web for free and put others up for sale through a company like Ebrary, where they can be sold around the world, costing us only a pittance to produce on a per-page basis. We can take books that have been long out of print and put them back on the market, benefiting authors who might not have seen a dime since their first sale long ago. This will be an example to other publishers, to see if we can't persuade them to let us do the same with their titles."
To give an idea of the magnitude of this task, Keller does the math: "At its current rate of 1,000 pages per hour, we can scan about 5,000 books in one eight-hour shift. Let's say we put people to work three shifts a day. Now we're talking about 15,000 books. Previously, we might have been able to do 3,000 books a day. But one machine is not going to be enough to digitize eight million books in my lifetime. It would take about 300 years, as a matter of fact. So something else has to happen."
Are you wondering about the computational resources needed for this effort? Keller has those answers, too: "But it's not just the scanning robot that's needed. There are the servers, the software, the network, the storage...With eight million volumes, if we were to digitize everything, we would end up with about a petabyte and a half of data. A petabyte is 10 to the 15th power. Managing the metadata for each individual bibliographic entity and each volume, the coding that allows you to search in a book, or in a collection of books associated by various parameters -- classification, subject heading, author, publisher, place of publication and so forth -- is another petabyte and a half. We're talking about gargantuan-sized memories and massively parallel supercomputers to whiz through this stuff."
Keller is obviously one of those individuals who is excited about the possibilities in our future of the Internet and computing, and is personally doing something about it. It's a refreshing break to read this story with all the noise from other quarters.