Publishing

XML-First Publishing: Economics and Migration

· 6 min read · By the Emayyam Infotech team

Most publishers still run some version of a print-first workflow: content is edited in Word, composed into pages, and only then converted into XML, EPUB or HTML for digital channels. The alternative, XML-first, captures content in structured form early and derives every output, including print, from that single source. The difference sounds procedural, but it changes the economics of publishing in ways that compound with every title and every new delivery channel.

We have run conversion and composition services for publishers on both models, and the pattern is consistent: print-first looks cheaper per title until the digital outputs, corrections and reuse requests start arriving, at which point its costs multiply quietly across departments. This post lays out how the two models differ, what JATS and BITS bring to the table, and how to migrate without disrupting a live publishing programme.

The hidden costs of print-first

In a print-first workflow, the page is the product and the XML is an afterthought, produced by back-conversion from final page files. Back-conversion is inherently lossy and interpretive: the converter must infer structure from visual formatting, deciding whether bold text is a heading, a defined term or mere emphasis. Every inference is a chance for error, which is why back-converted XML needs heavy QA and still surprises downstream users.

The deeper cost is corrections. When an error is found after conversion, it must be fixed in the page files and the XML separately, or the conversion rerun, and the versions drift apart over time. Multiply that across reprints, digital updates and licensing deliveries, and a publisher ends up paying repeatedly for work that a single structured source would have absorbed once.

What XML-first actually means

XML-first does not require authors to write XML. It means the manuscript is converted to structured XML at or shortly after copyediting, and that XML becomes the version of record. Editors work against the structure, composition systems pour it into page templates, and digital products are generated from the same files. Corrections happen once, in one place, and flow automatically to every output.

The discipline this demands is real: a DTD or schema must be chosen and enforced, editorial and production staff need new habits, and composition must be template-driven rather than hand-crafted page by page. The reward is that the expensive human judgment, the structural and editorial decision-making, happens once at the point of capture instead of being re-performed at every conversion for every format.

JATS for journals, BITS for books

For scholarly publishers the schema question is largely settled. JATS, the Journal Article Tag Suite standardized through NISO, is the lingua franca for journal articles: aggregators, indexing services and hosting platforms all consume it, so producing clean JATS is effectively a condition of participating in the scholarly supply chain. Its tag set covers article metadata, full text, references and the editorial apparatus journals depend on.

BITS, the Book Interchange Tag Suite, extends the JATS approach to books and book parts, handling front matter, chapters, indexes and the collection structures that monographs and reference works require. Publishers outside scholarly publishing often adapt DocBook or build constrained custom schemas, but staying close to an established standard pays off every time content moves between vendors, platforms or successor systems.

Single source, many channels

The strategic payoff of XML-first is multi-channel output from one source. The same files drive print PDF through automated composition, EPUB for retail, HTML for platform hosting, and feeds and excerpts for marketing, with accessibility features like structural semantics carried through rather than re-created for each format. When a new channel appears, the question becomes a transformation project, not a re-keying project.

Single-sourcing also changes what a backlist is worth. Structured content can be licensed, recombined into custom editions, mined for new products and updated incrementally, none of which is practical when the canonical version of a book is a folder of page files. In our experience the publishers extracting the most value from their archives are, without exception, the ones who control their content as structured data.

  • Print PDF via automated composition
  • EPUB 3 for retail and libraries
  • HTML for hosting platforms
  • Chunked content for licensing and reuse
  • Accessibility semantics preserved across outputs

The economics, honestly stated

XML-first is not free. Upfront costs include schema selection, template development, tool changes and training, and the first titles through a new pipeline are always slower. Per-title composition can also cost slightly more than the cheapest print-only alternative. These costs are visible and easy to object to, which is why migrations stall when leadership sponsorship is weak or the business case is framed only around print.

The savings sit elsewhere: digital editions become a by-product rather than a project, corrections are made once, vendor switching costs fall because the content is portable, and rights and licensing deliveries stop requiring bespoke conversion. For a publisher producing more than one output per title, the crossover usually arrives quickly; for a print-only list with no digital ambitions, print-first remains defensible, though that position is increasingly rare.

Migration tips for working publishers

Migrate by pipeline, not by big bang. Choose one list or content type with high digital demand, define the schema and templates for it, and run new titles through the XML-first pipeline while the legacy workflow continues elsewhere. Convert backlist opportunistically, when a title earns it through a new edition, a licensing deal or an accessibility obligation, rather than paying to convert everything speculatively.

Invest early in two unglamorous things: a written tagging specification with examples, so vendors and staff make consistent decisions, and automated validation at every handoff, so structural defects are caught at the gate rather than discovered downstream. Most failed migrations we are asked to rescue went wrong in exactly these two places, with ambiguity in the specification quietly accumulating into expensive rework.

  • Start with one pipeline, not the whole list
  • Pick standard schemas: JATS, BITS
  • Write a tagging spec with examples
  • Validate at every vendor handoff
  • Convert backlist on demand, not speculatively
  • Keep print templates automated and boring

The bottom line

The choice between print-first and XML-first is really a choice about where structure gets created and how many times you pay for it. Print-first defers the cost and then charges interest at every conversion, correction and channel launch. XML-first pays once, up front, and amortizes the investment across every output and every year the content stays in service or earns licensing revenue.

If you are weighing the move, start with arithmetic, not ideology: count the outputs per title you produce today, the ones you expect in three years, and what conversion and corrections cost you annually. Then pilot one pipeline end to end. The pilot will give you real numbers, a trained team and a working template set, which is a far stronger basis for the wider rollout than any slide deck.

Need help putting this into practice?

Our team does this work every day — get a free consultation on your project.