alacy wrote:
I like the technique Distributed Proofreaders (
www.pgdp.net) use
...
That way the entire set of images doesn't have to be sent all at once, keeping bandwidth down. Of course Singularity couldn't use pgdp itself because all their work goes to Project Gutenberg. They would have to get software to do it and I don't know if they have enough money to buy or pay to have it built for them.
I like it too. But I'm not sure if S&Co would need to re-implement the site code: the code is open source (and in fact right over here:)
sourceforge.net/projects/dproofreaders/
Distributed Proofreaders has an elaborate work flow, involving good word/bad word lists, flow-specific mark-ups for capturing things like block quotes, poetry indentations, etc. S&Co probably doesn't need all of that, but I do think they could use several iterations of people reading unformatted text and comparing it to the page images. I'm half way through "A Plunge into Space", and have flagged a good number of potential OCR errors. I would love helping proofread and reduce these, pre-release.
Some of these errors are of the type DP calls "stealth scannos". These are errors that an OCR engine is likely to make, but that pass spell checkers. Such as he/be, or he/lie. Also, I've found some inconsistency in hyphenation (for-wards or forwards?) DP has great amounts of experience finding these kinds of errors (20,000 titles and counting). I think the S&Co. people would do well to talk with them and discover what processes they can borrow.