In a standard PDF viewer, suppose you’re reading a two-column document and you zoom into the left column. Now suppose that the font size is still to small to read comfortably. You zoom in further:
What happens is that even though the font is now the size you want, it also cuts off the left and right half of the text.
Scribd’s new reflow zoom
With the new Scribd Android reader, what happens instead is this:
As soon as the left and right half of the text hit the border, the app starts ‘reflowing’ the text, nicely matching it to the screen size. Essentially, the document has been reformatted into a one-column document with no pagebreaks for mobile reading.
For a clearer understanding of what this means, please watch the video at the beginning of this post.
How it works
In order to render a ‘reflowed’ version of the document text, we have to analyze the document beforehand (we actually do this offline, on our servers).
In particular, we have to:
- Analyze the layout and detect the reading order of the text
- Detect and join back words where hyphens were used for line-wrapping
- Remove page numbers, headers/footers, table of contents etc.
- Interleave images with the text
I’d like to talk about at least two of them right here-
Detecting the reading order of the text
For starters, we need to figure out the reading order of the content on a page. In other words, given a conglomeration of characters on a page, how to “connect the dots” so that all the words and sentences make sense and are in the right order.
Thankfully, PDF tends to store characters in reading order in its content stream.
It doesn’t always (and what to do if it doesn’t is a topic for a whole blog post),
but when it does, determining the reading order is as easy as reading the index of
characters in the page content stream from the PDF.
Detecting hyphenation, and joining back words
Determining whether a hyphen at the end of a line is there because a word was hyphenated, or whether it’s just a so-called em dash is more tricky— especially since not everybody uses the typographically correct version of the em dash (Unicode 0x2014). Consider these example sentences:
The grass would be only rustling in the wind, and the pool rippling to the waving of the reeds— the rattling teacups would change to tinkling sheep- bells. The Cat grinned when it saw Alice. It looked good- natured, she thought.
When implementing a algorithm for detecting all these cases, it’s useful to have a dictionary handy, (preferably in all the languages you’re supporting— for Scribd, that’s quite a few.) That allows you to look up that “sheep-bell” is a word, whereas “reedsthe” is not.
It’s even better if the dictionary also stores word probabilities, allowing you to determine that “good-natured” is more probable than “natured”.
If you want to try this out for yourself, you can download our implementation
from the Android market.
Right now, we have a choice selection of books and documents that offer this functionality. Soon, we will roll it out to a major percentage of our content.