OCR, ICR or Rekeying? Digitizing Legacy Archives Right

Every organization with history has the same problem in a different costume: ledgers in a basement, case files in deep storage, journal backruns, microfilm, registration cards, or correspondence that someone, eventually, will need to search. Digitization projects fail less often at scanning than at the decisions around it: which capture method suits which document class, how accuracy will be verified, how the output will be indexed so anyone can find anything, and how originals survive the process.

This guide walks through those decisions in the order a project actually meets them. At Emayyam we run document digitization and data conversion programmes for publishers, government bodies, and enterprises, and the pattern we see repeatedly is that archives are mixtures: clean print next to faded carbon copies next to handwritten marginalia. The right answer is almost never one technique; it is a routing rule that sends each document class to the method that earns its cost.

OCR: Strong on Print, Sensitive to Condition

Optical character recognition converts images of machine-printed text into encoded characters, and on clean, modern print it performs admirably at very low cost per page. Legacy archives, however, attack its assumptions from every direction: faded ink, foxing and stains, bleed-through from thin paper, skewed scans, broken typefaces, tight bindings that curve text lines, and historical typography such as ligatures or older spelling conventions. Each defect chips away at recognition accuracy, and the damage compounds.

The practical implication is that OCR output quality must be measured, not assumed, and measured per document class rather than per project. A useful discipline is to run a representative sample through the engine early, score the results against careful manual transcription, and let that evidence assign each class to a workflow: straight OCR for clean print, OCR with human correction for middling material, and full manual capture where recognition collapses. Image preparation, including deskewing, despeckling, and binarization tuned to the material, often buys more accuracy than switching engines.

ICR and the Hard Problem of Handwriting

Intelligent character recognition extends recognition to handwriting, and its reliable home is constrained hand-print: forms where each character sits in its own box, written in capitals, as on census forms, application forms, and examination sheets. In that setting, ICR plus validation rules, such as checking a field against a list of valid codes or date formats, can be very productive. Free-flowing cursive is a different matter entirely; despite genuine progress in handwriting recognition research, unconstrained historical cursive still produces output that needs so much correction that skilled rekeying is frequently faster and cheaper.

Our rule of thumb from production work: the more constrained the writing and the more predictable the field content, the better ICR pays off. Structured forms with dictionaries and check digits, yes; a nineteenth-century minute book in flowing script, almost always no. As with OCR, a scored pilot on real sample pages settles the question for a few days of effort and prevents a budget-breaking surprise at page two hundred thousand.

Rekeying and Double-Key Verification

Manual rekeying sounds archaic until you need accuracy that recognition cannot deliver on difficult material. The professional standard is double-key verification: two operators key the same content independently, software compares the two versions character by character, and every discrepancy goes to a senior adjudicator who resolves it against the source image. Since two trained operators rarely make the same mistake in the same place, the method catches precisely the errors that single-pass keying leaves behind.

In our projects, double-keying with adjudication routinely achieves accuracy levels well beyond what single-key or corrected OCR reaches on degraded material, and crucially, it achieves them predictably, which is what contractual accuracy commitments require. The craft lies in the supporting apparatus: keying guides that specify how to treat illegible characters, stamps, annotations, and corrections in the source; field-level validation during entry; and honest conventions for marking the genuinely unreadable rather than guessing.

Indexing and Taxonomies: Making the Archive Findable

A perfectly captured archive that nobody can search is an expensive hard drive. Indexing design deserves as much attention as capture: which metadata fields each document class needs, which fields use controlled vocabularies rather than free text, how dates, names, and identifiers are normalized, and how the hierarchy of collection, series, folder, and item is preserved from the physical arrangement. The discipline is to index for the questions users will actually ask, then stop; every additional field multiplies keying cost across the entire corpus.

Controlled vocabularies are the quiet hero here. Free-text indexing produces five spellings of the same department name and a search experience that misses records that plainly exist, which users experience as a failed project no matter how accurate the capture was. Pick lists, authority files for personal and organizational names, and validation at entry time cost little during keying and pay back every single day the archive is searched. Agree the taxonomy with the people who will actually retrieve documents, and freeze it before volume indexing begins.

Define metadata fields per document class, not per project
Use controlled vocabularies and authority files for names and categories
Normalize dates and identifiers to a single declared format
Preserve the physical hierarchy in the digital structure
Index for real user questions; resist speculative fields

Quality Sampling and Acceptance Criteria

Nobody can proofread two million captured fields, so quality is managed statistically. Production is divided into batches, a random sample from each batch is inspected against source images, and the batch is accepted or returned for rework based on a defect threshold agreed in advance, the approach long used in acceptance sampling for manufacturing. The essential groundwork is definitional: client and vendor must agree what counts as an error, at character level or field level, whether a wrong index value weighs the same as a typo in body text, and how marked illegibles are scored.

Two practices keep sampling honest in production. First, inspect early batches at higher intensity, then relax the sampling rate once the process proves stable, tightening again whenever the source material or the team changes. Second, feed every rejected batch into root-cause review, because repeated defects usually trace back to an ambiguous keying rule or a misrouted document class, problems that a process fix removes permanently and that rework alone never will.

Handling Fragile Originals

Legacy capture is physical work on irreplaceable objects, and the scanning method must follow the condition of the material. Automatic document feeders are for healthy modern paper only; bound volumes, brittle pages, photographs, and anything of archival value call for flatbed or overhead capture, book cradles that support spines at a gentle opening angle, and careful page handling by trained operators. Condition assessment belongs at the start of the project, so fragile classes are routed to conservation-aware handling and, where needed, a conservator's attention before imaging rather than after damage.

The practical takeaway for any digitization owner: survey and classify the archive first, pilot each capture method on real samples and score the results, choose OCR, ICR, or double-key rekeying per document class rather than wholesale, design the index around real retrieval questions with controlled vocabularies, and write sampling-based acceptance criteria into the contract. Projects that settle these points before volume production starts finish quietly; projects that defer them become the cautionary tales.