The Perils of Stacking

This is the second of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 1, “Facing Fonts in HTML” and part 3, “Repolygonizing Fonts,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

A document page, unlike an image, isn’t really just a two-dimensional thing.

It’s not until you’ve been forced to dig into the internals of the PDF format that you come to appreciate the rich structure an innocent looking document page gives you. Vector fills, gradient patterns and semi-transparent bitmaps fight over dominance in the z-order stack, while clip-polygons slice away whole portions of the page, only to be faded into the background by an omnipotent hierarchical transparency group afterwards. So how does one convert this multitude of multi-layer objects into an html page, which is basically nothing more than a background image with a bit of text on top?

To understand the problem better, here’s a stacked diagram of a page:

At the bottom of the stack we have a bitmap (drawn first), then some text, followed by vector graphics, and finally another block of text on top. We don’t currently support vector graphics in HTML (stay tuned …); instead, we convert polygons to images which presents us with the challenge of finding a z-order of bitmaps and text elements that preserves the drawing order of the original page, while also simplifying the structure.

An optimal solution of transforming the above document page into a bitmap/text layering might look like this:

You see that here we merged two images into one even though they were not adjacent in the rendering stack, by using the fact that the text between the two images didn’t intersect with both of them.

This was a simple case where it’s actually enough to put one solitary bitmap at the background of a page. It also may happen that you have to put transparent images on top of the text (i.e, give them an higher z-index value). Notice that this requires the IE6 transparency hack.

In order to figure out whether or not an element on the page shares display space (i.e., pixels) with another element, we keep a boolean bitmap around during conversion:

Element on page Corresponding boolean bitmap

This bitmap tells us which regions of a page currently have been drawn by e.g. polygon operations, and thus which pixels need to be checked against new text objects in preceding layers. In fact, we actually keep two of those bitmaps around; one for keeping track of the area currently occupied by the next bitmap layer we’re going to add to the display stack, and one for keeping track of the same thing for text objects.

There’s an interesting fact about this approach: As long as the things drawn so far onto a bitmap and the html text fields don’t overlap, we’re free to chose the order we draw them.

In other words, it’s not until we’ve drawn the first object intersecting with another layer that we decide which of those two layers to dump out to the page first.

Here’s an example of a document page being rendered step by step:

The background image is put on the lowermost layer. Notice that the background also contains graphical elements for the equations on this page:

This is the text layer, consisting of normal HTML markup using fonts extracted from the document:

The two layers are combined to produce the final output:

Take a look at the actual technical paper as converted. And of course, if you want to see your own docs htmlified, just upload them to Scribd.com!

—Matthias Kramm

Next: Repolygonizing Fonts

Facing Fonts in HTML

This is the first of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 2, “The Perils of Stacking” and part 3, “Repolygonizing Fonts,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

PDF to HTML converters have been around for ages. They take a document, analyze its structure, and produce (with the help of some images) an HTML document that has roughly the same text at roughly the same position. This is fun to play with, but not really useful for practical applications since fonts, layout and style are often very different from the original.

With the new Scribd HTML document conversion, we aimed for a higher goal: Let’s create a version of a PDF file that looks exactly the same, but is also completely valid HTML. It turns out that since all major browsers now support @font-face, this is actually possible.

Encoding documents in this way has numerous advantages: no proprietary plugins like Flash or Silverlight are required to view documents; we take full advantage of built-in browser functionality like zooming, searching, text selection, etc.; state-of-the-art embedded devices are supported out of the box; and even on older browsers it degrades gracefully (to HTML text with built-in fonts).

So let’s back up a bit and review what @font-face is:

Font face is a way of making a browser use a custom TrueType font for text on a web page. For example, this text uses the Scrivano typeface:

font-face

Try selecting this text and right-clicking to assure yourself that this is indeed just standard browser element

This is accomplished by introducing a @font-face css declaration in the page style block:

@font-face {font-family: Scrivano;
            src: url("http://someserver.com/scrivano.ttf");}

Now you can use the “Scrivano” font family as if it was a built-in font. Before we dive into how one can use this for converting documents to HTML, there’s a caveat: in order for @font-face to really work, we need to store three font files: One for Internet Explorer (.eot) one for embedded devices (.svg) and one for Firefox, Safari, Chrome et al (.ttf). Once this is done, however, @font-face works in pretty much every modern browser, including the ones that come with mobile phones (iPhone and Android). So this is how the above looks, fully-fledged:

@font-face {
  font-family: 'Scrivano';
  src: url('scrivano.eot');
  src: url("scrivano.svg") format('svg');
  src: local('\u263a'), url('scrivano.otf')
  format('truetype');
}

(For a detailed discussion about the \u263a hack in there, see Paul Irish’s bulletproof font-face syntax)

Now, let’s talk about documents again. As in most documents, text is the predominant element, making @font-face invaluable for transcoding PDF files to @font-face-enabled HTML. Of course, the fonts we need for displaying HTML text as closely to the original as possible need to be constructed from scratch: PDF documents store fonts in a myriad number of formats (Type1, Type3, OpenType etc.), none of which is directly usable as a web font. Additionally, the same font may be used in different places with different encodings or even different transforms.

I don’t want to bore anyone with the specifics of decoding arbitrary PDF font and build a TTF, EOT and SVG out of it. However, the transformed font variant is actually fun to talk about. Diagonal text in documents is surprisingly common; let’s look at, for example, this figure with diagonal axis descriptors, from a document about fractals:

How do you encode the diagonal text in this document in a HTML page?

Short of using element transformations (-moz-transform, DXImageTransform etc.) which we found to be rather impractical, we encode the above HTML with a custom font created by transforming the original font. Here’s how our generated font looks in FontForge:

From the above font screenshot you also notice that we reduce fonts to only the characters that are actually used in the document; that helps save space and network bandwidth. Usually, fonts in the pdfs are already reduced, so this is not always necessary.

Naturally, for fonts with diagonal characters every character needs to be offset to a different vertical position (we encode fonts as left-to-right). In fact, this is how other HTML converters basically work: they place every single character on the page using a div with position:absolute:

<!-- crude pdf to html conversion -->
<div style="position:absolute;top:237px;left:250px;">H</div>
<div style="position:absolute;top:237px;left:264px;">e</div>
<div style="position:absolute;top:237px;left:271px;">l</div>
<!-- etc. -->

At Scribd, we invested a lot of time in optimizing this, to the degree that we can now convert almost all documents to “nice” HTML markup. We detect character spacing, line-heights, paragraphs, justification and a lot of other attributes of the input document that can be encoded natively in the HTML. So a PDF document uploaded to Scribd may, in it’s HTML version, look like this (style attributes omitted for legibility):

Original document:

HTML version:

<p>
  <span>domain block is in a different image than the range block), as</span>
  <span>opposed to mappings which stay in the image (domain block</span>
  <span>and range block are in the same image) - also see Fig. 5.</span>
  <span>It's also possible to, instead of counting the number of</span>
</p>

Together with <img> tags for graphic elements on pages, we can now represent every PDF document in HTML while preserving fonts, layout and style, with text selectability, searchability, and making full use of the optimized rendering engines built into browsers.

Want to see it in action? Check out this technical paper, browse our featured documents or upload your own!

-Matthias Kramm

Next: The Perils of Stacking

The Future of Reading is Open

Today, Scribd is changing the way you read documents online. Over the next few weeks and months, Scribd will convert our entire content corpus — tens of millions of documents, books and presentations — into native HTML5 web pages so that we can offer the best online reading experience. Scribd documents in HTML5 load instantly, support native browser functions (zoom, search, scroll, select text), and deliver an impressive reading experience across all browsers and web-enabled devices, without requiring add-ons or plug-ins.

What does this mean? It means that any document in any format (MS Office Docs, Google Docs, PDF, ePub) will now be readable on any device with a browser. This move — nearly six months in the making — represents billions of pages of content that will become part of the fabric of the web. To find out more, read this presentation:

http://www.scribd.com/html5

The Scribd Open Reading Platform: The future of reading is open. Join us.
-Jared Friedman, Scribd CTO and co-founder

Follow

Get every new post delivered to your Inbox.

Join 44 other followers