Hackday: Keynote!

At least once a year, we have a Scribd “Hack Day”.
Everybody picks something they really want to work on, and then we hack for 24 hours straight.

It’s always a lot of fun- there are treats around the clock (ice cream deliveries, a Crêpe cart and and a specially stocked snack bar), and if you’re still around (and awake!) during random late-night “raffle hours”, you get a raffle ticket. (Prices were, among others: a Kindle, a jalapeno growing kit, a rip stick, a flying shark)

Our last Hack Day was March 29st 2013, with the following top #5:

  1. Jared’s In-Doc Search (17 votes)
  2. Matthias’ Keynote Conversion (12 votes)
  3. Jesse’s ScribdBugz (10 votes)
  4. Gabriel’s Bounce Analytics (9 votes)
  5. Kevin’s Leap Motion Navigation (6 votes)

We’re proud to announce that we now open sourced some of this work: The Scribd Keynote converter, by now running in production for us, is open sourced and available from http://github.com/scribd/keynote .
Keynote is Apple’s presentation format, and the above project enables you to convert it to PDF on non-Mac systems.

Open sourcing keynote follows in the footsteps of many of our open source contributions- see http://github.com/scribd/ for our other projects.

Happy Hacking!

Compete Against Our Developers!

We think we’ve invented a new game. Imagine a grid with fruit scattered upon it.
game screenshot
There are two robots that start on the same cell. Given simultaneous turns, they decide if they want to move or pick up the fruit from the cell they’re standing on. They are told where all the fruit are, what the score is, and where both bots are every turn. The objective of the game is to win as many fruit groups as they can. To win a fruit group, they need to have picked up the most of that type of fruit. Ties can happen when both bots decide to pick up the same piece of fruit on the same turn.

Our developers have created some bots and we challenge you to create a bot to beat them. Bots are written in Javascript. Our backend servers will throw each bot into a sandbox. When we are ready to begin a game, we notify the bots and allow them to access information about the board. At each turn, we call their make_move functions to find out what they want to do. (Details can be found in our bot api.)

To make development easier, we put up a little local development framework on github. Here, we give you a little starting bot that randomly chooses somewhere to go, but always picks up a piece of fruit if he’s standing on it. We also mocked out the server in way to allow you to play against our “simplebot”. “simplebot” is really just a breadth-first search. It’s actually written recursively, though that seems to be okay for the given time and memory contraints given that boards can at most be 15×15 and have 25 items.

The bots written by our developers employ a number of different strategies. “fruitwalk-ab” tries to predict the opponent’s moves and does some minimax with alpha-beta pruning. “Blender” and “Flexo” mostly ignore the opponent and go by a number of heuristics trying to decide the best piece of fruit to go for at a given time. Some of our other developers prefer to remain hush-hush on how their bots are written.

We hope you enjoy our game; we’ve certainly enjoyed writing it. Read more about it here.

Why zooming on mobile is broken (and how to fix it)

Traditional zooming

In a standard PDF viewer, suppose you’re reading a two-column document and you zoom into the left column. Now suppose that the font size is still to small to read comfortably. You zoom in further:

What happens is that even though the font is now the size you want, it also cuts off the left and right half of the text.

Scribd’s new reflow zoom

With the new Scribd Android reader, what happens instead is this:

As soon as the left and right half of the text hit the border, the app starts ‘reflowing’ the text, nicely matching it to the screen size. Essentially, the document has been reformatted into a one-column document with no pagebreaks for mobile reading.

For a clearer understanding of what this means, please watch the video at the beginning of this post.

How it works

In order to render a ‘reflowed’ version of the document text, we have to analyze the document beforehand (we actually do this offline, on our servers).

In particular, we have to:

  1. Analyze the layout and detect the reading order of the text
  2. Detect and join back words where hyphens were used for line-wrapping
  3. Remove page numbers, headers/footers, table of contents etc.
  4. Interleave images with the text

I’d like to talk about at least two of them right here-

Detecting the reading order of the text

For starters, we need to figure out the reading order of the content on a page. In other words, given a conglomeration of characters on a page, how to “connect the dots” so that all the words and sentences make sense and are in the right order.

Thankfully, PDF tends to store characters in reading order in its content stream.
It doesn’t always (and what to do if it doesn’t is a topic for a whole blog post),
but when it does, determining the reading order is as easy as reading the index of
characters in the page content stream from the PDF.

Detecting hyphenation, and joining back words

Determining whether a hyphen at the end of a line is there because a word was hyphenated, or whether it’s just a so-called em dash is more tricky— especially since not everybody uses the typographically correct version of the em dash (Unicode 0x2014). Consider these example sentences:

The grass would be only rustling in the wind, and 
the pool rippling to the waving of the reeds—
the rattling teacups would change to tinkling sheep-
bells. The Cat grinned when it saw Alice. It looked good-
natured, she thought.

When implementing a algorithm for detecting all these cases, it’s useful to have a dictionary handy, (preferably in all the languages you’re supporting— for Scribd, that’s quite a few.) That allows you to look up that “sheep-bell” is a word, whereas “reedsthe” is not.

It’s even better if the dictionary also stores word probabilities, allowing you to determine that “good-natured” is more probable than “natured”.

Learn more

If you want to try this out for yourself, you can download our implementation
from the Android market.

Right now, we have a choice selection of books and documents that offer this functionality. Soon, we will roll it out to a major percentage of our content.

Matthias Kramm

What Start-Ups Like to See in Resumes from College Students and Entry-Level Candidates

This post was originally posted on jenny.webs.com.

Last week, I represented Scribd at my alma mater Carnegie Mellon’s job fair — the TOC. While the quality of students that we met were incredible beyond what our recruiters had expected, the overall quality of the resume writing was honestly atrocious. I’m sure college seniors feel like they don’t have a lot yet to put on a resume, but there are certainly ways to stand-out. This post is specifically about how to make your resume stand-out to a start-up.

Why a start-up? Start-ups are a lot of fun. The atmosphere is also closer to college life, which makes an easier transition. Working at a start-up is also a really great way to learn a ton since they are generally small with a lot of work and interesting problems. The start-ups I’ve worked at include Overture Technologieswebs.com, and now Scribd. The hours are extremely flexible (anything before 11am is considered “early”), the offices are filled with toys, snacks and drinks, and I really respected the intelligence, abilities, and passion of my co-workers. The last point goes to show why these companies look more for these qualities rather than for a list of skills. This post describes how to showcase these points on your resume.

List Personal Projects
Personal projects are a way to differentiate yourself from other candidates; they show that programming is part of who you are in life, not just your job. Personal projects can be as small and simple as a little game or a tutorial for a new language or framework you went through not because you needed to know it, but because you wanted to learn more about it. We weigh these things heavier than academic or work projects. I was flabbergasted to find that when I asked one student why his personal projects weren’t on his resume, he answered, “They weren’t real projects; I just did them for fun.” That is exactly the reason they should be on your resume! Not everyone programs for fun. We want people for whom programming is fun, not just work.

List Only Academic Projects that Differentiate You
Most companies send alumni back to represent them at college job fairs. Being a Carnegie Mellon School of Computer Science alumna, I could easily recognize all the academic projects listed on various students’ resumes. While “Diffusing a Binary Bomb” is really a great project and in fact the one I like to use as an example of why I thought our homework projects were really well-written, it is something that every CMU computer science student has to do. Meanwhile, projects in elective classes or ones where you must self-design your project helps a potential employer understand what sort of problems interest you. Specifying that it was an elective helps recruiters less familiar with various academic programs parse your resume.

Specify Links that Reference You and Your Work
Especially for college students, it is huge when we see someone who has a github account. Whether the github account shows original code, forked projects, or simply following other projects, it shows not only interest in the industry but also that you are already a part of the community. Links to projects are also useful; we can not only read about your work, we can see it first hand. Even links to twitter accounts or blogs are good as well. Particularly if you are applying to a social media start-up, it is good to see that you are a user of social media tools yourself and already have some domain knowledge.

List your Hobbies
We review so many resumes. Hobbies make us feel like you’re more of a person than just a list of skills and qualifications. It also may help us determine whether you’d be a culture fit with the company.

Realize the Skills List is Not the Most Important Part of Your Resume
While more traditional companies may look for a checklist of skills, this is not the start-up mentality; start-ups look for smart, passionate people who can learn and pick-up anything. I remember being an entry-level candidate and listing skills that I had, but didn’t really want to pursue at a company (like my sysadmin experience). I thought it was better to fill my resume with anything I could do rather than leave it off. As a result of course, I piqued the interest of several companies wanting me to do a role I was not interested in. I like to state it this way to candidates: if you are in the middle of figuring out a problem on Friday and you really have to leave work to go to a friend’s birthday dinner, what sort of problem would make you more likely to want to continue figuring it out over the weekend rather than waiting until Monday to get back to it? I’m not saying don’t like all the various skills you have, but know your passions and be sure to be forthright about what you are passionate about versus what you just “know how to do” or are simply “willing to do”.

The start-up hiring mentality is just different from the traditional hiring mentality; what your parents advise you or what your college teaches may not be applicable here. Passion, intelligence, and love of the industry are what matter. If you are looking to apply to both start-ups and non-start-ups, I actually advise you to make two different resumes that emphasize things differently and make your own choice on what type of company you prefer after you get to see their offices and meet the employees.

If you are interested in working at Scribd (located in San Francisco) or webs.com (located in the DC area), please check out their job pages at http://www.scribd.com/jobs and http://webs.com/Careers/.

FlashHeed: Fixing the Flash Z-index Problem For Ads

Here at Scribd, we’ve moved on from Flash and are embracing HTML5 as the open standard for reading on the web. Unfortunately, the ad industry has not quite caught up yet, as many ads are still flash, and probably will be for some time.

The dreaded problem that most web developers come across is the z-index issue with flash elements. When the wmode param is not set, or is set to window, flash elements will always be on top of your DOM content. No matter what kind of z-index voodoo you attempt, your content will never break through the flash. This is because flash, when in window mode, is actually rendered on a layer above all web content.

There is a lot of chatter about this issue, and the simple solution is to specify the wmode parameter to opaque or transparent. This works when you control and deliver the flash content yourself. However, this is not the case for flash ads.

The majority of flash ads don’t specify a wmode parameter, which will put flash into window mode. Why they don’t specify transparent or opaque is a mystery to me. This is a nightmare for pages that have UI elements that depend on z-index, like dropdown menus and lightboxes. Google even has an article about avoiding these sorts of elements altogether when they are in close proximity to ads.

I personally disagree with the notion of redesigning your UI because of display ordering issues with flash ads. It just doesn’t make sense from a product standpoint. And look! Even YouTube, owned by Google, has z-index flash issues with their own ads:

So, to solve all this, I wrote some javascript that will dynamically add the correct wmode parameter. I call it FlashHeed. You can get it now on the GitHub repo.

It works reliably in all major browsers, and has no dependencies, so feel free to drop it into your Prototype or jQuery dependent website.

The usage is simple: just include the FlashHeed javascript in the head of your page, and call it like so:


FlashHeed.heed();

And you’re done. All the flash ads on the page will now heed to the z-index ordering. No more embarassing lightbox and dropdown menu occlusions.

Under the hood, FlashHeed injects the correct wmode parameter and actually forces the flash to re-render. This is the only reliable way that I’ve found to kick the flash into the correct wmode.

Update 11/14/10:

Note that FlashHeed will not work on flash ads or elements that are embedded inside iframes, due to cross domain policies. Unfortunately, I don’t have a solution for those. If anyone has a suggestion, please comment below.

James Yu, Lead Developer at Scribd

Vanity Profile URLs in Rails

One feature shared by many social networking sites is "vanity" short profile
URLs. My Twitter page could have easily been the RESTfully predictable http://twitter.com/users/riscfuture, but thanks to short profile URLs it is http://twitter.com/riscfuture.

Even Facebook got in the game recently with their "Facebook Usernames" feature. Of course, in classic Facebook style, getting the vanity URL is a multi-step process with an application and the associated land-grab. At Scribd I kept it a little simpler, and I'm assuming you'd like to keep it simple for your Rails website as well.

In order for this system to work, we're going to have to lay down a few ground rules:

  • No user whose username conflicts with a controller name can have a short URL. You can't sign up on Scribd with the username "documents" and prevent anyone from seeing their document list.
  • No user whose username conflicts with another defined route can have a short URL. Remember that the routes file defines named or custom routes and resources, but with the default routes, normal controllers do not need an entry in that file.
  • Users with reserved characters in their names must have these characters escaped or dealt with. If I sign up with the username "foo/bar", that slash can't be left unescaped, or the router will misunderstand the address.
  • Usernames must be case-insensitively unique. Every browser expects scribd.com/foo to be the same as scribd.com/FOO.
  • Any user who cannot be given a short URL for the above reasons must have a fallback URL. This is where you fall back to your less pretty /users/123 URL. (Or perhaps /users/123-foo-bar for SEO purposes.)

Note that it's not enough to simply build a list of your controllers and stick them in a validates_exclusion_of validation. You want to be able to claim new routes for yourself even if users have already signed up with conflicting logins, and gracefully revert those users to a fallback profile URL.

Ultimately the question we need to answer is this: Given a user name, will a vanity URL conflict with an existing route? There are a lot of really hard ways of going about this, many of which will break over time. I opted to go with the a reliable (if somewhat slow) way of doing this: I build a list of known routes, strip them down to their first path component, then build an array of these reserved names. A known route might be, for instance, /documents/:id; its first path component is "documents." Thus, a user whose login is "documents" cannot have a vanity URL.

There are some points to note for this system:

  • You'll get a few false positives. If /documents/:id is a valid route, but /documents is not (say you had no index action), this system would still disallow a user named "documents". You can easily solve this by tweaking the code below, though.
  • No attention is paid to HTTP methods. Theoretically, if you had a route like /upload whose only acceptable method is POST, you could still use GET /upload to refer to a user named "upload". I have intentionally avoided doing this, however; good web design dictates that varying the HTTP method of a request only varies the manner in which you interact with the resource represented by the URL; a single URL should represent the same resource regardless of which method is used in the request.

In order to eke speed out wherever we can, we generate the list of reserved routes once, at launch, and cache it for the lifetime of the process. We do this in a module in lib/:

 
module FancyUrls
  def self.generate_cached_routes
    # Find all routes we have, take the first part (/xxx/) and remove some unwanted ones
    @cached_routes = ActionController::Routing::Routes.routes.map do |route|
      segs = route.segments.inject("") { |str, s| str << s.to_s }
      segs.sub! /^\/(.*?)\/.*$/, '\\1'
 
      # Some routes accept a :format parameter (ratings.:format).
      segs.sub! /\.:format$/, ''
      segs
    end
 
    # All possible controllers for /:controller/:action/:id route
    @cached_routes += ActionController::Routing.possible_controllers.map do |c|
      # Use only the first path component for controllers with multiple path components
      c.sub /^(.*?)\/.*$/, '\\1'
    end
    @cached_routes.uniq!
    # Remove routes whose first path component is a variable or wildcard
    @cached_routes.reject! { |route| route.starts_with?(':') or route.starts_with?('*') }
    # Remove the root route.
    @cached_routes.delete '/'
  end
 
  def self.cached_routes
    @cached_routes
  end
end

The top method combines two arrays: the first, a list of routes from the defined routes, and the second, a list of the app's controllers. It then filters out some non-applicable routes and stores the list in an instance variable. The list consists of only the first path component of a route.

The method is called generate_cached_routes because it's called when the server process starts, as part of the environment.rb file. The cached results are accessed with the cached_routes method.

So given this method, how do we test if a user is eligible for URL "vanitization?" It's simple:

 
module FancyUrls
  def user_name_valid_for_short_url?(login)
    not FancyUrls.cached_routes.include?(login)
  end
end

The method is simple: If the user's name is in our list of reserved routes, then it's not valid for URL shortening. Easy peasy.

So now we can reasonably quickly determine whether or not a user gets a vanity profile URL. The next step is to write a user_profile_url method that, given a user, returns either the vanity or full profile URL, as appropriate. To do this, first we will need to add our vanity URLs to the bottom of our routes.rb file:

 
# Install the non-vanity user profile route above the vanity route so people
# who don't have shortenable logins can still have a URL to their profile page.
map.long_profile 'users/:id', :controller => 'users', :action => 'show', :conditions => { :method => :get }
# Install the vanity user profile route above the default routes but below all
# resources.
map.short_profile ':login', :controller => 'users', :action => 'show', :conditions => { :method => :get }
 
# Install the default routes as the lowest priority.
map.connect ':controller/:action/:id'
map.connect ':controller/:action/:id.:format'

What's going on here? Well, at the very bottom of the routes.rb file, we are installing the old Rails standby, the :controller/:action routes. Newer Rails ideology is often to leave these routes out, so adjust your routes file as appropriate. Above those routes, but otherwise of the lowest priority, is our vanity route. Anywhere above that route is our traditional profile URL. (If you have a RESTful users controller, you could of course replace the top route with a resources call.)

At first glance there's a chicken-and-egg problem: We're checking if a user is "vanitizable" using the routes file, but now the routes file contains the vanity URL route. We solved this problem earlier in the generate_cached_routes method:

 
# Remove routes whose first path component is a parameter or wildcard
regular_routes.reject! { |route| route.starts_with?(':') or route.starts_with?('*') }

This line of code filters out any routes that start with a parameter or wildcard, among them the short_profile named route.

With the routes squared away, we move on to the problem of users with logins containing reserved characters. RFC 1738 defines what characters must be encoded in a URL:

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

Characters aside from these in usernames must either be encoded or otherwise dealt with. Beyond RFC 1738, we should additionally consider the dollar sign and plus characters ("$" and "+") reserved because they often serve special roles in URLs as well. And because this is a Rails app, we should consider the period (".") reserved as well, as it is used by Rails to indicate the format parameter.

So if a user has any reserved character in his login, what do we do? The obvious solution is to percent-encode it, creating a string like "foo%2Fbar", but some might find that ugly. You could also replace these characters with dashes (or some other stand-in character), creating "foo-bar", but then you run into trouble if someone actually signs up with the username "foo-bar". If you're making a new website, you may opt to disallow these characters from usernames. At Scribd we use a combination of approaches: Some reserved characters (like spaces) are simply not allowed in usernames; others are allowed but by using one of these characters you "give up" your vanity URL, instead using the fallback profile URL.

If you choose to allow certain reserved characters in your usernames, but disallow those people vanity URLs, you will have to modify the user_name_valid_for_short_url? like so:

 
def user_name_valid_for_short_url?(login)
  not (login.include?('.') and FancyUrls.cached_routes.include?(login))
end

This example allows users to have periods in their login, but disallows those users their vanity URLs.

With our vanity routes defined, we can implement the user_profile_url method:

 
module FancyUrls
  def user_profile_url(person, options={})
    login = login_for_user(person)
    raise ArgumentError, "No such user #{person.inspect}" unless login
 
    if user_name_valid_for_short_url?(login) then
      short_profile_url options.merge(:id => login)
    else
      long_profile_url options.merge(:id => person)
    end
  end
 
  private
 
  def login_for_user(user_or_id)
    return (if user_or_id.is_a?(User) then
      user_or_id.login
    else
      Rails.cache.get("login:#{user_or_id}") { User.find_by_id(user_or_id, :select => 'login').try(:login) }
    end)
  end
end

The method is simple enough: We check if the user an have a vanity URL, and if so, we return it; otherwise we return the standard profile URL. I included two small optimizations: We cache the login to avoid database lookups with each method call, and we only select the fields we care about from our users table.

And with that, we've got our URLs! Simply include your module as a helper and call user_profile_url to generate profile URLs as opposed to url_for or the named resource routes or whatever else you might have been using.

We're not quite done yet, though. What happens when a user who haplessly registered the username "ratings" gets screwed because we just launched our ratings feature? With the system I've shown above, the moment we deploy our new feature, any links to that user's profile page would automatically revert to the normal profile URLs.

Good web practice teaches us that when we change the URL for a resource, we should respond with a 301 to any client that tries to access the old URL. Obviously, since the /ratings URL now points to a different web page, we can't do that. Any users who visit external web pages and click a link to that user's profile URL will find themselves on your brand new ratings page. I have implemented no particular fix for this problem, as I believe most websites add very, very few controllers and named routes in comparison to the number of users they have. In other words, the problem is small enough that it's probably not worth solving.

We can solve the flip side of this problem, though: Once a website launches its vanity URL feature, there will still be bunches of external links to the old, longer profile URLs. We can respond to these requests with 301s to inform people that those links are now outdated. This also helps assist with SEO, getting people's new profile URLs on the Google index and getting the old ones off.

We do this by including code in the profile page's controller action to redirect if necessary:

 
class UsersController
  def show
    if params[:id] then
      @user = User.find(params[:id])
      return head(:moved_permanently, :location => user_profile_url(@user)) if user_name_valid_for_short_url?(@user)
    elsif params[:login] then
      @user = User.with_login(params[:login]).first || raise ActiveRecord::RecordNotFound
    else
      raise ActiveRecord::RecordNotFound
    end
  end
end

We have this if statement at the start of our show method because the method is doing double-duty: It responds to both the short_profile and long_profile named routes. In the former, the variadic portion of the URL is stored in the id parameter; in the latter, the login parameter. You could of course opt to dispatch the two URLs to two separate actions; either way, make sure you respond to unnecessarily long profile URLs with a 301.

And with that, you've got your vanity URLs. All it comes down to is a little bit of route-foo and some speed optimizations here and there. The solution here is tailored to the needs of Scribd; I've done my best to outline those needs and how they impacted our code. You should think about how you want to do vanity URLs on your website and take this code as a guide to implementing your own solution. Vanity URLs take a little extra time to implement, but in return you are rewarded with users who are more willing to share their profile pages, improved SEO, and that glowy feeling you get when you increase your site's Web 2.0-ishness.

Plan B: Font Fallbacks

This is the fourth post in our series about Scribd’s HTML5 conversion. The whole process is neatly summarized in the following flowchart:

In our previous post we wrote about how we encode glyph polygons from various document formats into browser fonts. We described how an arbitrary typeface from a document can be sanitized and converted to a so called “@font-face”- a font that browsers can display.

The next challenge the aspiring HTML5 engineer faces is if even after hand-crafting a @font-face (including self-intersecting all the font polygons and throwing together all the required .ttf, .eot and .svg files ), a browser still refuses to render the font. After all, there still are browsers out there that just don’t support custom fonts- most importantly, mobile devices like Google’s Android, or e-book readers like Amazon’s Kindle.

Luckily enough, HTML has for ages had a syntax for specifying font fallbacks in case a @font-face (or, for that matter, a system font) can’t be displayed:

    <style type="text/css">
    .p {
	font-family:
	    myfontface, /* preferred typeface */
	    Arial,      /* fallback 1 */
	    sans-serif; /* fallback 2 */
    }
    </style>

There’s a number of fonts one can always rely on to be available for use as fallback:

Arial (+ bold,italic)
Courier (+ bold,italic)
Georgia (+ bold,italic)
Times (+ bold,italic)
Trebuchet (+ bold,italic)
Verdana (+ bold,italic)
Comic Sans MS (+ bold)

(Yes, that’s right- every single browser out there supports Comic Sans MS)

However, it’s not always entirely trivial to replace a given font with a font from this list. In the worst case (i.e., in the case where an array of polygons for a subset of the font’s glyphs is really all we have- not all documents store proper font names, let alone a complete set of glyphs or font attributes), we don’t really know much about the font face at hand: Is it bold? Is it italic? Does it have serifs? Is it maybe script?

Luckily though, those features can be derived from the font polygons with reasonable effort:

Detecting bold face glyph polygons

The boldness of a typeface is also referred to as the “blackness”. This suggests a simple detection scheme: Find out how much of a given area will be covered by a couple of “representative” glyphs.
The easiest way to do this is to just render the glyph to a tiny bitmap and add up the pixels:

A more precise way is to measure the area of the polygon directly, e.g. using a scanline algorithm.

A mismatch between the area we “expect” e.g. for the letter F at a given size and the actual area is an indicator that we’re dealing with a bold face.

Detecting italic face glyph polygons

A trivial italic typeface (more precisely: an oblique typeface) can be created from a base font by slanting every character slightly to the right. In other words, the following matrix is applied to every character:

(  1   s  )
 0   1 

(With s the horizontal displacement)

In order to find out whether a typeface at hand is slanted in such a way, we use the fact that a normal (non-italic) typeface has a number of vertical edges, for example in the letters L,D,M,N,h,p:

In an italic typeface, these vertical edges “disappear” (become non-vertical):

In other words, we can spot an italic typeface by the relative absence of strict vertical polygon segments, or, more generally, the mean (or median) angle of all non curved segments that are more vertical than horizontal.

Detecting the font family

As for the actual font family, we found that two features are fairly characteristic of a given font:

  • The number of corners (i.e., singularities of the first derivative) of all the glyph outlines
  • The sign of (w1-w2) for all pairs of glyphs with widths w1 and w2

For example, displayed below are the corners of two fonts (Trebuchet and Courier) and the extracted feature string:

Of course, for a font to be mapped against a browser font, we typically only have a subset of n glyphs, hence we can only use the number of corners of a few glyphs.

The second feature, comparing signs of glyph-width differences, gives us more data to work with, as n glyphs generate n*(n-1)/2 differences (entries in the difference matrix, with the lower left half and upper right half symmetric):

Notice that we assume in our detection approach that we actually know what a given glyph represents (i.e., that glyph 76 in a font is supposed to look like an “L”). This is not always the case- we’ll write about that in one of the next posts.

Here’s a random selection of fonts from our documents (left) and the corresponding replacement (right):

Comparison of original font and new font

And, as always, if you want to see the results of these algorithms for yourself, just grab a PDF (or any other document format), upload it to Scribd, and then download it to a (non @font-face-enabled?) mobile device of your choice.

-Matthias Kramm

The Future of Reading is Open

Today, Scribd is changing the way you read documents online. Over the next few weeks and months, Scribd will convert our entire content corpus — tens of millions of documents, books and presentations — into native HTML5 web pages so that we can offer the best online reading experience. Scribd documents in HTML5 load instantly, support native browser functions (zoom, search, scroll, select text), and deliver an impressive reading experience across all browsers and web-enabled devices, without requiring add-ons or plug-ins.

What does this mean? It means that any document in any format (MS Office Docs, Google Docs, PDF, ePub) will now be readable on any device with a browser. This move — nearly six months in the making — represents billions of pages of content that will become part of the fabric of the web. To find out more, read this presentation:

http://www.scribd.com/html5

The Scribd Open Reading Platform: The future of reading is open. Join us.
-Jared Friedman, Scribd CTO and co-founder