Facing Fonts in HTML

This is the first of a four-part series on the technology behind Scribd’s HTML viewing experience. You might like to read part 2, “The Perils of Stacking” and part 3, “Repolygonizing Fonts,” if you haven’t already. Part 4, “Plan B: Font Fallbacks” is coming soon.

PDF to HTML converters have been around for ages. They take a document, analyze its structure, and produce (with the help of some images) an HTML document that has roughly the same text at roughly the same position. This is fun to play with, but not really useful for practical applications since fonts, layout and style are often very different from the original.

With the new Scribd HTML document conversion, we aimed for a higher goal: Let’s create a version of a PDF file that looks exactly the same, but is also completely valid HTML. It turns out that since all major browsers now support @font-face, this is actually possible.

Encoding documents in this way has numerous advantages: no proprietary plugins like Flash or Silverlight are required to view documents; we take full advantage of built-in browser functionality like zooming, searching, text selection, etc.; state-of-the-art embedded devices are supported out of the box; and even on older browsers it degrades gracefully (to HTML text with built-in fonts).

So let’s back up a bit and review what @font-face is:

Font face is a way of making a browser use a custom TrueType font for text on a web page. For example, this text uses the Scrivano typeface:

font-face

Try selecting this text and right-clicking to assure yourself that this is indeed just standard browser element

This is accomplished by introducing a @font-face css declaration in the page style block:

@font-face {font-family: Scrivano;
            src: url("http://someserver.com/scrivano.ttf");}

Now you can use the “Scrivano” font family as if it was a built-in font. Before we dive into how one can use this for converting documents to HTML, there’s a caveat: in order for @font-face to really work, we need to store three font files: One for Internet Explorer (.eot) one for embedded devices (.svg) and one for Firefox, Safari, Chrome et al (.ttf). Once this is done, however, @font-face works in pretty much every modern browser, including the ones that come with mobile phones (iPhone and Android). So this is how the above looks, fully-fledged:

@font-face {
  font-family: 'Scrivano';
  src: url('scrivano.eot');
  src: url("scrivano.svg") format('svg');
  src: local('\u263a'), url('scrivano.otf')
  format('truetype');
}

(For a detailed discussion about the \u263a hack in there, see Paul Irish’s bulletproof font-face syntax)

Now, let’s talk about documents again. As in most documents, text is the predominant element, making @font-face invaluable for transcoding PDF files to @font-face-enabled HTML. Of course, the fonts we need for displaying HTML text as closely to the original as possible need to be constructed from scratch: PDF documents store fonts in a myriad number of formats (Type1, Type3, OpenType etc.), none of which is directly usable as a web font. Additionally, the same font may be used in different places with different encodings or even different transforms.

I don’t want to bore anyone with the specifics of decoding arbitrary PDF font and build a TTF, EOT and SVG out of it. However, the transformed font variant is actually fun to talk about. Diagonal text in documents is surprisingly common; let’s look at, for example, this figure with diagonal axis descriptors, from a document about fractals:

How do you encode the diagonal text in this document in a HTML page?

Short of using element transformations (-moz-transform, DXImageTransform etc.) which we found to be rather impractical, we encode the above HTML with a custom font created by transforming the original font. Here’s how our generated font looks in FontForge:

From the above font screenshot you also notice that we reduce fonts to only the characters that are actually used in the document; that helps save space and network bandwidth. Usually, fonts in the pdfs are already reduced, so this is not always necessary.

Naturally, for fonts with diagonal characters every character needs to be offset to a different vertical position (we encode fonts as left-to-right). In fact, this is how other HTML converters basically work: they place every single character on the page using a div with position:absolute:

<!-- crude pdf to html conversion -->
<div style="position:absolute;top:237px;left:250px;">H</div>
<div style="position:absolute;top:237px;left:264px;">e</div>
<div style="position:absolute;top:237px;left:271px;">l</div>
<!-- etc. -->

At Scribd, we invested a lot of time in optimizing this, to the degree that we can now convert almost all documents to “nice” HTML markup. We detect character spacing, line-heights, paragraphs, justification and a lot of other attributes of the input document that can be encoded natively in the HTML. So a PDF document uploaded to Scribd may, in it’s HTML version, look like this (style attributes omitted for legibility):

Original document:

HTML version:

<p>
  <span>domain block is in a different image than the range block), as</span>
  <span>opposed to mappings which stay in the image (domain block</span>
  <span>and range block are in the same image) - also see Fig. 5.</span>
  <span>It's also possible to, instead of counting the number of</span>
</p>

Together with <img> tags for graphic elements on pages, we can now represent every PDF document in HTML while preserving fonts, layout and style, with text selectability, searchability, and making full use of the optimized rendering engines built into browsers.

Want to see it in action? Check out this technical paper, browse our featured documents or upload your own!

-Matthias Kramm

Next: The Perils of Stacking

47 responses to “Facing Fonts in HTML

  1. Pingback: How Scribd’s HTML document reader works (part 1)

  2. Are there licensing issues in regards to fonts being generated from the PDF? Or is this issue avoided by the creation of the PDF in the first place?

  3. Dan

    Is this PDF to HTML tech going to be included in your API now or anytime down the road?

  4. The generation of rotated fonts is quite clever–thanks for sharing!

    Have you found that font kerning gets mucked up with that technique, though?

    Kudos to your adoption of html5!

    • matthiaskramm

      Thanks for the positive feedback! As for kerning: we do reposition elements in the HTML if we detect kerning-caused negative character spacing (i.e. glyphs being closer together than by what would be dicated by their advance values) in the orignal document. There’s a tradeoff with regard to HTML size, though.

  5. Pingback: Ajaxian » Scribd: Font face trickery and more

  6. mario

    The one thing I do not get about this PDF to HTML conversion is, why go to that at all? What’s wrong with just serving the PDF? It’s not like there are no PDF viewers on any platform. Evince does quite a good job, and it’s faster than the flash thingy you have on scribd. It will probably still be faster than having my future HTML5 browser load a dozen fonts and svg resources..

    • Thanks Davis for tainkg the time to look at our Beta. We are working hard on a new version that should be easier to use. Regarding the current up-front entry of the URL, it’s fine if you just give the URL to your blog for now. We are definitely looking into Podcast, Radio and other ways to spread our artist’s music. But, as you know, it’s not entirely straight forward in terms of licensing so it takes time.

  7. With Epiphany webkit, firefox and Opera 10.10 I only see blank pages on http://www.scribd.com/documents/5/Image-Cluster-Compression

  8. Maybe that’s because my system color scheme have white text on black background, maybe the text is white on white in your pages but I don’t know because I can’t select any text in the pages.

  9. kl

    Please don’t rely on proprietary webkit and moz extensions – it screws up in Opera, even though Opera *does* support CSS transformations.

    Besides, using not-yet-standard CSS transformations just for image zoom is overkill — image scaling is easy to do in plain ‘ol 1993 HTML.

  10. Pingback: Top Posts — WordPress.com

  11. Is there any chance it get opensourced.
    It would be nice to add latex equation in my website instead of embeding png.

  12. Pingback: iPad Links: Wednesday, May 19, 2010 « Mike Cane's iPad Test

  13. Just note that the Scrivano’s licence doesn’t allow any implementation of theses files in a server, nor the creation of derivative font files. It’s usage as webfont is only allowed using the Typekit service.

    The use in your website is ok, but the method that you describe are ilegal for this particular typeface. This particular method can be used with open source fonts, but it is not the case.

    For more information visit the Fondry’s EULA:
    http://www.outrasfontes.com/eula.htm

  14. Pingback: Max’ Lesestoff zum Wochenende | PHP hates me - Der PHP Blog

  15. Pingback: Destillat #45 Web- und Softwareentwicklung | Open Source und Wetware

  16. Josh Feathwood

    Wow, that’s incredible. I’ve been waiting for those things for years.

    It’s a pity that it doesn’t work in Opera.

    Cheers, Josh

  17. mithilesh

    i am mithilesh mishra

  18. Pingback: My daily readings 06/21/2010 « Strange Kite

  19. Pingback: The Perils of Stacking « coding@scribd

  20. Pingback: Repolygonizing Fonts « coding@scribd

  21. Pingback: Technology: How does Scribd Work? - Quora

  22. Pingback: Quora

  23. Pingback: Scribd's New Float Reader App Combines News, Social & "Read Later" on Mobile

  24. Pingback: Scribd’s New Float Reader App Combines News, Social & "Read Later" on Mobile | SEO College

  25. Pingback: Scribd’s New Float Reader App Combines News, Social & "Read Later" on Mobile | Feed Pedia

  26. Pingback: Scribd’s New Float Reader App Combines News, Social & "Read Later" on Mobile | TechDiem.com

  27. Pingback: Scribd’s New Float Reader App Combines News, Social & "Read Later" on Mobile - reseller-hosting.co.za

  28. Pingback: Scribd’s New Float Reader App Combines News, Social & "Read Later" on Mobile | cheap-web-hosting.co.za

  29. Pingback: Scribd’s New Float Reader App Combines News, Social & “Read Later” on Mobile | Modern Techie

  30. jotiralow

    Good morning!!!
    I am not sure on soma, but i am sure that valium puts you into a very relaxed,carefree,stressfree emotion..pills generic!!!!!
    Pa!
    ____________________________
    order online

  31. I do accept as true with all the concepts you have offered in your post. They’re really convincing and will certainly work. Still, the posts are too short for novices. May just you please lengthen them a bit from next time? Thanks for the post.

  32. What’s up, I read your new stuff regularly. Your writing style is awesome, keep doing what you’re doing!

  33. I like this website very much, Its a rattling nice office to read and receive info . “There’s nothing I’m afraid of like scared people.” by Robert Frost.

  34. I bet you do crossword puzzles in ink. Thank you! I just saw your posts on Tuesday. I was walking on the beach on Friday when I discovered your site. You have made my day!

  35. I am hoping you write once more very soon! Do not get afraid to spread your thoughts. This is an excellent, an eye-opener for sure! Kudos.

  36. Your writing style reminds me of my girlfriend. My mom said they like your blogs write up.

  37. Through my examination, shopping for technology online can for sure be expensive, however there are some how-to’s that you can use to help you get the best offers. There are continually ways to discover discount bargains that could make one to ge thet best gadgets products at the smallest prices. Interesting blog post.

  38. Admiring the hard work you put into your blog and in depth information you offer. It’s awesome to come across a blog every once in a while that isn’t the same out of date rehashed information. Great read! I’ve saved your site and I’m adding your RSS feeds to my Google account.

  39. Admiring the time and energy you put into your site and detailed information you offer. It’s good to come across a blog every once in a while that isn’t the same old rehashed material. Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google account.

  40. I am really loving the theme/design of your blog. Do you ever run into any browser compatibility problems? A number of my blog audience have complained about my website not working correctly in Explorer but looks great in Safari. Do you have any suggestions to help fix this issue?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: