Hackday: Keynote!

At least once a year, we have a Scribd “Hack Day”.
Everybody picks something they really want to work on, and then we hack for 24 hours straight.

It’s always a lot of fun- there are treats around the clock (ice cream deliveries, a Crêpe cart and and a specially stocked snack bar), and if you’re still around (and awake!) during random late-night “raffle hours”, you get a raffle ticket. (Prices were, among others: a Kindle, a jalapeno growing kit, a rip stick, a flying shark)

Our last Hack Day was March 29st 2013, with the following top #5:

  1. Jared’s In-Doc Search (17 votes)
  2. Matthias’ Keynote Conversion (12 votes)
  3. Jesse’s ScribdBugz (10 votes)
  4. Gabriel’s Bounce Analytics (9 votes)
  5. Kevin’s Leap Motion Navigation (6 votes)

We’re proud to announce that we now open sourced some of this work: The Scribd Keynote converter, by now running in production for us, is open sourced and available from http://github.com/scribd/keynote .
Keynote is Apple’s presentation format, and the above project enables you to convert it to PDF on non-Mac systems.

Open sourcing keynote follows in the footsteps of many of our open source contributions- see http://github.com/scribd/ for our other projects.

Happy Hacking!

Why zooming on mobile is broken (and how to fix it)

Traditional zooming

In a standard PDF viewer, suppose you’re reading a two-column document and you zoom into the left column. Now suppose that the font size is still to small to read comfortably. You zoom in further:

What happens is that even though the font is now the size you want, it also cuts off the left and right half of the text.

Scribd’s new reflow zoom

With the new Scribd Android reader, what happens instead is this:

As soon as the left and right half of the text hit the border, the app starts ‘reflowing’ the text, nicely matching it to the screen size. Essentially, the document has been reformatted into a one-column document with no pagebreaks for mobile reading.

For a clearer understanding of what this means, please watch the video at the beginning of this post.

How it works

In order to render a ‘reflowed’ version of the document text, we have to analyze the document beforehand (we actually do this offline, on our servers).

In particular, we have to:

  1. Analyze the layout and detect the reading order of the text
  2. Detect and join back words where hyphens were used for line-wrapping
  3. Remove page numbers, headers/footers, table of contents etc.
  4. Interleave images with the text

I’d like to talk about at least two of them right here-

Detecting the reading order of the text

For starters, we need to figure out the reading order of the content on a page. In other words, given a conglomeration of characters on a page, how to “connect the dots” so that all the words and sentences make sense and are in the right order.

Thankfully, PDF tends to store characters in reading order in its content stream.
It doesn’t always (and what to do if it doesn’t is a topic for a whole blog post),
but when it does, determining the reading order is as easy as reading the index of
characters in the page content stream from the PDF.

Detecting hyphenation, and joining back words

Determining whether a hyphen at the end of a line is there because a word was hyphenated, or whether it’s just a so-called em dash is more tricky— especially since not everybody uses the typographically correct version of the em dash (Unicode 0x2014). Consider these example sentences:

The grass would be only rustling in the wind, and 
the pool rippling to the waving of the reeds—
the rattling teacups would change to tinkling sheep-
bells. The Cat grinned when it saw Alice. It looked good-
natured, she thought.

When implementing a algorithm for detecting all these cases, it’s useful to have a dictionary handy, (preferably in all the languages you’re supporting— for Scribd, that’s quite a few.) That allows you to look up that “sheep-bell” is a word, whereas “reedsthe” is not.

It’s even better if the dictionary also stores word probabilities, allowing you to determine that “good-natured” is more probable than “natured”.

Learn more

If you want to try this out for yourself, you can download our implementation
from the Android market.

Right now, we have a choice selection of books and documents that offer this functionality. Soon, we will roll it out to a major percentage of our content.

Matthias Kramm

Plan B: Font Fallbacks

This is the fourth post in our series about Scribd’s HTML5 conversion. The whole process is neatly summarized in the following flowchart:

In our previous post we wrote about how we encode glyph polygons from various document formats into browser fonts. We described how an arbitrary typeface from a document can be sanitized and converted to a so called “@font-face”- a font that browsers can display.

The next challenge the aspiring HTML5 engineer faces is if even after hand-crafting a @font-face (including self-intersecting all the font polygons and throwing together all the required .ttf, .eot and .svg files ), a browser still refuses to render the font. After all, there still are browsers out there that just don’t support custom fonts- most importantly, mobile devices like Google’s Android, or e-book readers like Amazon’s Kindle.

Luckily enough, HTML has for ages had a syntax for specifying font fallbacks in case a @font-face (or, for that matter, a system font) can’t be displayed:

    <style type="text/css">
    .p {
	    myfontface, /* preferred typeface */
	    Arial,      /* fallback 1 */
	    sans-serif; /* fallback 2 */

There’s a number of fonts one can always rely on to be available for use as fallback:

Arial (+ bold,italic)
Courier (+ bold,italic)
Georgia (+ bold,italic)
Times (+ bold,italic)
Trebuchet (+ bold,italic)
Verdana (+ bold,italic)
Comic Sans MS (+ bold)

(Yes, that’s right- every single browser out there supports Comic Sans MS)

However, it’s not always entirely trivial to replace a given font with a font from this list. In the worst case (i.e., in the case where an array of polygons for a subset of the font’s glyphs is really all we have- not all documents store proper font names, let alone a complete set of glyphs or font attributes), we don’t really know much about the font face at hand: Is it bold? Is it italic? Does it have serifs? Is it maybe script?

Luckily though, those features can be derived from the font polygons with reasonable effort:

Detecting bold face glyph polygons

The boldness of a typeface is also referred to as the “blackness”. This suggests a simple detection scheme: Find out how much of a given area will be covered by a couple of “representative” glyphs.
The easiest way to do this is to just render the glyph to a tiny bitmap and add up the pixels:

A more precise way is to measure the area of the polygon directly, e.g. using a scanline algorithm.

A mismatch between the area we “expect” e.g. for the letter F at a given size and the actual area is an indicator that we’re dealing with a bold face.

Detecting italic face glyph polygons

A trivial italic typeface (more precisely: an oblique typeface) can be created from a base font by slanting every character slightly to the right. In other words, the following matrix is applied to every character:

(  1   s  )
 0   1 

(With s the horizontal displacement)

In order to find out whether a typeface at hand is slanted in such a way, we use the fact that a normal (non-italic) typeface has a number of vertical edges, for example in the letters L,D,M,N,h,p:

In an italic typeface, these vertical edges “disappear” (become non-vertical):

In other words, we can spot an italic typeface by the relative absence of strict vertical polygon segments, or, more generally, the mean (or median) angle of all non curved segments that are more vertical than horizontal.

Detecting the font family

As for the actual font family, we found that two features are fairly characteristic of a given font:

  • The number of corners (i.e., singularities of the first derivative) of all the glyph outlines
  • The sign of (w1-w2) for all pairs of glyphs with widths w1 and w2

For example, displayed below are the corners of two fonts (Trebuchet and Courier) and the extracted feature string:

Of course, for a font to be mapped against a browser font, we typically only have a subset of n glyphs, hence we can only use the number of corners of a few glyphs.

The second feature, comparing signs of glyph-width differences, gives us more data to work with, as n glyphs generate n*(n-1)/2 differences (entries in the difference matrix, with the lower left half and upper right half symmetric):

Notice that we assume in our detection approach that we actually know what a given glyph represents (i.e., that glyph 76 in a font is supposed to look like an “L”). This is not always the case- we’ll write about that in one of the next posts.

Here’s a random selection of fonts from our documents (left) and the corresponding replacement (right):

Comparison of original font and new font

And, as always, if you want to see the results of these algorithms for yourself, just grab a PDF (or any other document format), upload it to Scribd, and then download it to a (non @font-face-enabled?) mobile device of your choice.

-Matthias Kramm

Repolygonizing Fonts

This is the third of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 1, “Facing Fonts in HTML” and part 2, “The Perils of Stacking,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

Every single day, Scribd processes over 150,000,000 polygons in order to convert your uploaded documents into our new HTML format (among others).

So why on earth would something like this be necessary? In order to get to that, we first have to talk a bit about fonts. A font, in its most simplistic form, is just a bundle of polygons, with a Unicode index attached to each one. It’s these font polygons this post is about.

Of course, Truetype fonts and their descendants (.eot fonts, OpenType fonts etc.) store much more than that, with a typical font containing thousands of lines of vertex-adjusting bytecode programs (hence the name “font program”), multiple encoding tables, a vast amount of miscellaneous font metrics, etc. This is what this post is not about.

For our HTML conversion, we had to solve the following problem: How do you encode an arbitrary polygon into a font glyph for use in a @font-face declaration? The answer is font repolygonization, which is the process of optimizing the polygons in a given font. It turns out that to repolygonize a font properly, you first have to know a thing or two about fill-rules.

Polygons in computer graphic generally come in two “flavors”: even-odd-filled or nonzero-filled. These two schemes use different semantics to determine whether a given point is inside or outside of the glyph. For even-odd, every line in the polygon separates something on the inside from something on the outside:

Even-odd filled glyphs

For nonzero, on the other hand, the direction of the line matters. Two lines in the same direction mean the area after the lines is still filled; drawing a hole into a polygon needs to be done by a line going into the other direction:

Nonzero-filled glyphs (with direction indicators)

For even-odd polygons, the segment direction doesn’t matter, but for nonzero-filled polygons it’s also necessary to draw segments in the right direction. Truetype, and embedded OpenType (eot) fonts use the nonzero fill style. For .svg fonts, you get to choose (see the fill-rule property). So to properly encode any polygon as font glyph, we need to “fix” the fill-rule.

Another thing that’s important for properly encoding a Truetype font is that all segments need to be continuous. Depending on where the source polygon comes from, this isn’t necessarily the case. In some vector graphic formats (e.g. SWF), polygons can easily be drawn in an arbitrary order and still define a filled shape:

degenerate glyph (even-odd filled)

We found that while we’re recoding and fixing a polygon, a convenient way to store the end result is as a hybrid polygon which is, at the same time, even-odd as well as circular:

Self intersected, both even-odd as well as nonzero-filled glyphs

At Scribd, we greatly rely on this hybrid approach when repolygonizing fonts in order to be able to produce output files for both our Flash and HTML reader platforms at the same time. “But wait!” you cry, “you said you generate HTML fonts from fonts stored in PDF files? Why aren’t those already encoded with the right fill-rule and continuous segments?” Good question.

First of all, we simplify font polygons, thus potentially breaking the type of fill-rule. Secondly, some “fonts” in a PDF are no more than a vector drawing in font’s clothing. Finally, font characters might get overlapped, transformed, combined, intersected etc. during conversion.

So how do you convert a polygon to a different fill-rule? This gets us back to the polygon intersection of those 150,000,000 polygons per day. You intersect the polygon with itself and relabel them at the same time with a fill-rule of your choice. We actually not only intersect the polygon with itself but also with various elements like clipshapes in order to remove invisible or half-hidden elements on the page.

In theory, this is a rather simple approach: For every character element to be drawn on the page, intersect that character’s polygon with itself and all other page polygons it possibly interacts with, then store the result of that intersection in the font. There are two perhaps non-obvious problems, though:

  1. This polygon intersection needs to be fast. We process hundreds of thousands of pages every day.
  2. It needs to be very stable. We want to introduce zero display errors.

So while long processing speed can easily be countered by buying some more fancy hardware, getting a polygon intersection to be stable is something entirely different. For example, here are some classic pathological cases that need to be correctly handled. Notice that cases like these are actually the rule rather than the exception.

Horizontal lines throw off scanline algorithms, multiple intersections in a single point lead to subtle numeric errors, and intersections taking place on top of lines can introduce hairline fragments that will cause the rendering engine displaying these polygons to “bleed.”

While it’s possible to deal with each of these cases, floating point arithmetic can actually cause them to occur after or during the computation: For example, if you add a point to a polygon segment because of an intersection, the resulting small change in the segment slope can cause it to overlap another segment it was previously disjunct from.

There are basically three ways to deal with this precision problem:

  • Introduce some random perturbation to all polygon endpoints. This nicely takes care of horizontal lines and three point intersections, and makes the probability of floating point errors introducing extraneous overlaps very small. However when concatenating intersections the perturbations tend to accumulate. It’s possible to counteract this with symbolic perturbation, but still, this is cheating. We wouldn’t be solving the problem, only make it harder to reproduce.
  • Work with infinite precision. In other words, do away with floating point errors by introducing enough bits in the numeric representation of the coordinates so that we never have to do any rounding. This approach has been verified in practice (see e.g. Fortune & Wyk). The main problem is going back from the infinite precision representation to the finite precision final version (which we need to actually store the data).
  • Work on an integer grid, and use snap-rounding. Don’t let the floating point numbers lure us into the security of false precision, and explicitly deal with all special cases that can occur.

Let’s look at the last item in more detail. So in the example of these glyphs:

A snap-rounded version would look like this:

You see that snap-rounding doesn’t just reposition all polygon endpoints, it also puts all the intersection points on the integer grid. Also all lines that come into the vicinity of a used grid point are routed through that grid point as well. This way, no polygon edges can ever come to lie too close to the vertex of another edge. And, while snap-rounding causes things to look blocky in the example above, this is because we choose a very coarse-grained grid for illustration. In our production system we operate with a grid spacing that is a tiny fraction of a pixel (or for glyphs, a tiny fraction of an em square).

Intersecting two polygons while at the same time grid-rounding the results also uses only slightly more computation time than standard scanline algorithms: while a Bentley & Ottman scanline sweep needs O(n log n + k log n) time, snap-rounding can be done in up to O(E log n), with E the description complexity of the crossing pattern which for typical polygons is of order (n+k). The nice thing about intersection using grid-rounding is that it’s rock-solid. There’s no corner-case that’s not explicitly handled, and overcomplicated intersection patterns automatically simplify themselves. If you’re interested in the nitty-gritty details of our implementation, we basically used the algorithm from Hobby with a few additional ideas from Hershberger and Bhattacharya et al in order to produce a grid-rounded glyph polygon intersection in a single scanline pass.

Also, even though we process those hundred million polygons a day, only 10% of document conversion time is actually spent on that. All the rest goes into image processing. We’ll talk about that in one of our next posts.

—Matthias Kramm

Coming SoonPlan B: Font Fallbacks

The Perils of Stacking

This is the second of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 1, “Facing Fonts in HTML” and part 3, “Repolygonizing Fonts,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

A document page, unlike an image, isn’t really just a two-dimensional thing.

It’s not until you’ve been forced to dig into the internals of the PDF format that you come to appreciate the rich structure an innocent looking document page gives you. Vector fills, gradient patterns and semi-transparent bitmaps fight over dominance in the z-order stack, while clip-polygons slice away whole portions of the page, only to be faded into the background by an omnipotent hierarchical transparency group afterwards. So how does one convert this multitude of multi-layer objects into an html page, which is basically nothing more than a background image with a bit of text on top?

To understand the problem better, here’s a stacked diagram of a page:

At the bottom of the stack we have a bitmap (drawn first), then some text, followed by vector graphics, and finally another block of text on top. We don’t currently support vector graphics in HTML (stay tuned …); instead, we convert polygons to images which presents us with the challenge of finding a z-order of bitmaps and text elements that preserves the drawing order of the original page, while also simplifying the structure.

An optimal solution of transforming the above document page into a bitmap/text layering might look like this:

You see that here we merged two images into one even though they were not adjacent in the rendering stack, by using the fact that the text between the two images didn’t intersect with both of them.

This was a simple case where it’s actually enough to put one solitary bitmap at the background of a page. It also may happen that you have to put transparent images on top of the text (i.e, give them an higher z-index value). Notice that this requires the IE6 transparency hack.

In order to figure out whether or not an element on the page shares display space (i.e., pixels) with another element, we keep a boolean bitmap around during conversion:

Element on page Corresponding boolean bitmap

This bitmap tells us which regions of a page currently have been drawn by e.g. polygon operations, and thus which pixels need to be checked against new text objects in preceding layers. In fact, we actually keep two of those bitmaps around; one for keeping track of the area currently occupied by the next bitmap layer we’re going to add to the display stack, and one for keeping track of the same thing for text objects.

There’s an interesting fact about this approach: As long as the things drawn so far onto a bitmap and the html text fields don’t overlap, we’re free to chose the order we draw them.

In other words, it’s not until we’ve drawn the first object intersecting with another layer that we decide which of those two layers to dump out to the page first.

Here’s an example of a document page being rendered step by step:

The background image is put on the lowermost layer. Notice that the background also contains graphical elements for the equations on this page:

This is the text layer, consisting of normal HTML markup using fonts extracted from the document:

The two layers are combined to produce the final output:

Take a look at the actual technical paper as converted. And of course, if you want to see your own docs htmlified, just upload them to Scribd.com!

—Matthias Kramm

Next: Repolygonizing Fonts

Facing Fonts in HTML

This is the first of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 2, “The Perils of Stacking” and part 3, “Repolygonizing Fonts,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

PDF to HTML converters have been around for ages. They take a document, analyze its structure, and produce (with the help of some images) an HTML document that has roughly the same text at roughly the same position. This is fun to play with, but not really useful for practical applications since fonts, layout and style are often very different from the original.

With the new Scribd HTML document conversion, we aimed for a higher goal: Let’s create a version of a PDF file that looks exactly the same, but is also completely valid HTML. It turns out that since all major browsers now support @font-face, this is actually possible.

Encoding documents in this way has numerous advantages: no proprietary plugins like Flash or Silverlight are required to view documents; we take full advantage of built-in browser functionality like zooming, searching, text selection, etc.; state-of-the-art embedded devices are supported out of the box; and even on older browsers it degrades gracefully (to HTML text with built-in fonts).

So let’s back up a bit and review what @font-face is:

Font face is a way of making a browser use a custom TrueType font for text on a web page. For example, this text uses the Scrivano typeface:


Try selecting this text and right-clicking to assure yourself that this is indeed just standard browser element

This is accomplished by introducing a @font-face css declaration in the page style block:

@font-face {font-family: Scrivano;
            src: url("http://someserver.com/scrivano.ttf");}

Now you can use the “Scrivano” font family as if it was a built-in font. Before we dive into how one can use this for converting documents to HTML, there’s a caveat: in order for @font-face to really work, we need to store three font files: One for Internet Explorer (.eot) one for embedded devices (.svg) and one for Firefox, Safari, Chrome et al (.ttf). Once this is done, however, @font-face works in pretty much every modern browser, including the ones that come with mobile phones (iPhone and Android). So this is how the above looks, fully-fledged:

@font-face {
  font-family: 'Scrivano';
  src: url('scrivano.eot');
  src: url("scrivano.svg") format('svg');
  src: local('\u263a'), url('scrivano.otf')

(For a detailed discussion about the \u263a hack in there, see Paul Irish’s bulletproof font-face syntax)

Now, let’s talk about documents again. As in most documents, text is the predominant element, making @font-face invaluable for transcoding PDF files to @font-face-enabled HTML. Of course, the fonts we need for displaying HTML text as closely to the original as possible need to be constructed from scratch: PDF documents store fonts in a myriad number of formats (Type1, Type3, OpenType etc.), none of which is directly usable as a web font. Additionally, the same font may be used in different places with different encodings or even different transforms.

I don’t want to bore anyone with the specifics of decoding arbitrary PDF font and build a TTF, EOT and SVG out of it. However, the transformed font variant is actually fun to talk about. Diagonal text in documents is surprisingly common; let’s look at, for example, this figure with diagonal axis descriptors, from a document about fractals:

How do you encode the diagonal text in this document in a HTML page?

Short of using element transformations (-moz-transform, DXImageTransform etc.) which we found to be rather impractical, we encode the above HTML with a custom font created by transforming the original font. Here’s how our generated font looks in FontForge:

From the above font screenshot you also notice that we reduce fonts to only the characters that are actually used in the document; that helps save space and network bandwidth. Usually, fonts in the pdfs are already reduced, so this is not always necessary.

Naturally, for fonts with diagonal characters every character needs to be offset to a different vertical position (we encode fonts as left-to-right). In fact, this is how other HTML converters basically work: they place every single character on the page using a div with position:absolute:

<!-- crude pdf to html conversion -->
<div style="position:absolute;top:237px;left:250px;">H</div>
<div style="position:absolute;top:237px;left:264px;">e</div>
<div style="position:absolute;top:237px;left:271px;">l</div>
<!-- etc. -->

At Scribd, we invested a lot of time in optimizing this, to the degree that we can now convert almost all documents to “nice” HTML markup. We detect character spacing, line-heights, paragraphs, justification and a lot of other attributes of the input document that can be encoded natively in the HTML. So a PDF document uploaded to Scribd may, in it’s HTML version, look like this (style attributes omitted for legibility):

Original document:

HTML version:

  <span>domain block is in a different image than the range block), as</span>
  <span>opposed to mappings which stay in the image (domain block</span>
  <span>and range block are in the same image) - also see Fig. 5.</span>
  <span>It's also possible to, instead of counting the number of</span>

Together with <img> tags for graphic elements on pages, we can now represent every PDF document in HTML while preserving fonts, layout and style, with text selectability, searchability, and making full use of the optimized rendering engines built into browsers.

Want to see it in action? Check out this technical paper, browse our featured documents or upload your own!

-Matthias Kramm

Next: The Perils of Stacking