Improved Formatting for Bilingual Texts, and an Algorithm for its Creation

Bilingual books (ideally with audio) are very useful for language study.

Typically a text and it's translations are presented in side-by-side columns. When comparing a text against it’s translation, it’s not always easy to find the corresponding sentence in the other language.

Here’s an example:

(You can read the second page of the text here if you are interested.)

Usually a paragraph and it's corresponding translation are arranged such that they start at the same vertical position on the page, but the two texts 'go out of sync' the further into a paragraph you go. If the reader is reading in one column/language, and referring frequently to the second language/column to support comprehension of what they are readying in the first, significant mental resources are required to continually locate the corresponding text in the second column, especially when paragraphs are longer.

I tried to see if I could come up with an improved formatting.

Without detailing all the different ideas and variations I tried, here's what I ended up with: (click the image to expand)

(A longer example here.)

Some notes, in no particular order:

Algorithm

I hope the following is clear, although I'm aware it's a bit difficult to explain well in plain text, and I'm not sure I have succeeded.

The task of the algorithm is to break up a paragraph and it's corresponding translation into 'blocks.' That's to say, how to combine pairs of sentences (one in each language/column) into 'blocks.' We assume we are dealing with a faithful translation where the number and order of sentences in the translation matches those of the original text. With each sentence pair we place, we can either include it in the current 'block' (unless it's the first sentence pair in a paragraph), or start a new block. Two options.

We assign a 'badness' score to each block we create. This score consist of two components. One component is the 'looseness' score. This gets higher the longer the block is. If the blocks are too long, it makes it harder to find the corresponding text in the other language/column. This is bad. The other component is the 'ugliness' score, which is a function of how much we must stretch or squash the text in order to make it fit fully-justified into a fixed number of lines (a block). This makes typographists upset and is indeed unattractive and reduces legibility. We affix fudge factors to the two components and sum them to get the overall 'badness score' for a block. Finding the optimal layout for a paragraph (optimal tradeoff between between 'ugliness' and 'looseness' ..) is the question of finding the arrangement of blocks with the minimum sum of block 'badness' scores squared (or cubed.. pick a function). Squaring/cubing is a good idea, as it's preferable that all blocks have a reasonably low 'badness' score, rather than one block being awful and the others being good, a situation that otherwise wouldn't be penalised if we just sumed the 'block' scores.

Let's say we have a long paragraph, 20-sentence-long. We have 2^19 different ways to arrange the sentence pairs into blocks, that's over half a million potential solutions to evaluate. Trying each arrangement to get the overall 'badness' score for the paragraph can be very time consuming, and perhaps unfeasible for very long paragraphs. However, we can use a dynamic programming/shortest path algorithm to cull the number of combinations. The approach I took is the same as that as is used in Knuth's Line Breaking Algorithm, only we are fitting sentences into 'blocks', and blocks into paragraphs, as opposed to words onto lines, and lines into paragraphs. His approach is described in his book 'Digital Typography', and in the paper 'Breaking paragraphs into lines' (Donald E. Knuth, Michael F. Plass - November 1981). Some python code examples here. It's very clever and reduces the problem from O(n ^ 2) to linear time.

We can run this algorithm over the entire book, and sum the total 'badness' for all the paragraphs. We can then vary the column widths (the proportion of the page each column occupies) in small increments, to find the column widths that result in the lowest overall 'badness' score for the book. In the example page given above, the German column is a good deal wider than the English. Overall I'm quite happy with the output of the program, it gives the student a kind of 'superpower' over the language being studied. I have some ideas for minor improvements though.

I know it's customary for Show HN posts to include the code, but frankly although it is functional with the right nudges, it is a mess that I didn't get around to cleaning up. It was continually modified and 'cludged' as I experimented with different approaches. The layout code is still attached to a complex, although vestigial and unnecessary GUI that can be used to align and edit translations, and creates an intermediate format that is imported into Indesign to create a PDF output. I hope to rewrite it sometime as a command-line program with it's own PDF rendering code and put it on github. I'm a bit hesitant though, as the code is used to create books that are currently my only source of income.

I hope you enjoyed reading about my project, and I'm always happy for an email, or interesting projects to work on. hobodrifterdavid at [the big G] dot com