Wish: Retina Display Image Magnification Service

I am writing this on a Retina MacBook Pro. Text and anything rendered from vectors looks absolutely, I’ll-have-a-hard-time-going-back-to-non-retina, amazing. But poorly scaled bitmap graphics look terrible. The contrast between the ultra crisp text and the pixelated, fuzzy scaled graphics makes the problem even more glaring and annoying.

For Polyvore’s Mobile HTML site, we had to deal with this problem. We switch to serving 2x dpi images when the client device uses a high dpi display. We could easily do this because our backend image rendering pipeline is able to render generate images in different resolutions.

But what about all our other image assets? And what about all the other services out there that can’t serve high dpi versions of their images?

A useful service would be some kind of Image Scaling proxy that sits in from of your image servers and dynamically decides what kind of image resolution it should serve.

If you have high resolution image sources, serving lower resolution images is not a problem. There are lots of algorithms that can down-scale images and produce great looking output.

The problem is when you have to magnify images from standard resolution sources. For pixel art type image, there are algorithms like hqnx.

NN interpolation

hqnx scaling

For photos and images of natural things, there are good magnification algorithms like fractal based scaling.

Fractal Resizing

But there is another class of hard-to-algorithmically-magnify images. Things like icons with gradients, a mixture of pixel art and natural images. For these, it would be interesting to offer an automated best-effort option but also create a marketplace where graphic designer can redo your image assets in high resolution. The proxy server would use the best option available to it.

imgix seems to be headed in this direction. Services like CloudFlare, Strange Loop and InStart could also offer this type of service.


Credit Rolls for Software

Software should have credit rolls the way films do.

At the end of every film, there are the credit rolls that acknowledge the work of everyone who was involved in the making of the film:

Star Wars Episode IV Credit Roll

Why not the same for software? In the case of films, all sorts of people get credit for their contributions but in most software and software services, we don’t know the contributors and usually we can’t even find out. I use tons of online services but I have no idea who the true contributors really are. At most, if the service is very popular, I might know the name of the founders (Larry/Sergey, Jerry/David, Adam D’Angelo, …).

This opacity is a huge shame because people who write software should rightfully be recognized for their work.

It is even technically possible to produce a credit roll for the software that we use. Here is a version control log (svn -blame) of who the latest contributor for each line of code for a Polyvore js function:

  8358      kwyee // Monitor calls f every interval ms.  
  8358      kwyee // It fires a change event when f() changes.
  8482      kwyee // f() is passed to the trigger event handler.
  2073      pasha function Monitor(f, interval) {
  2623      pasha     var curr = f();
  2073      pasha     if (!interval) { interval = 100; }
  8656      kwyee     this.check = function() {
  2073      pasha         var now = f();
  2073      pasha         if (now != curr) {
  2073      pasha             curr = now;
  2073      pasha             Event.trigger(this, 'change', curr);
  2073      pasha         }
  8656      kwyee         return now;
  8656      kwyee     };
 17189      pasha     this.timer = new Interval(interval, this.check, this);
  2073      pasha }

It is possible to take this information and instrument the code such that each execution keeps track of all the people who contributed and produce an exhaustive credit roll of sorts. There could be a dialog that shows all the people whose code was executed to make this app / page / REST call etc… possible. I actually remember that old System 7 era Apple software used to have credits in the ‘About X…’ dialogs.

Polyvore’s Awesome Crawler System

Note: This is cross-posted to Polyvore’s Engineering Blog.

Polyvore’s product index spans millions of items. The bulk of these arrive via our awesome user community who are constantly scouring the web for interesting products using our clipper bookmarklet.

Our clipper is quite smart — it auto-detects the correct price, landing page, etc… We also use a background task to scrape the Facebook open graph meta information for gleaning the correct description and title for each product.  However, this information is essentially a snapshot taken at the time of clipping.  We don’t get notified about price changes and the availability of the product.  Since Polyvore is a social commerce platform, we felt it was important to have up to date price and availability information about the products that are present in our index.

To augment our product index, we started by integrating data feeds directly from retailers that offered them.  But we soon found that these feeds were constantly breaking, out of date and missing useful meta data. So, we decided to write our own crawlers to regularly crawl retail sites and extract accurate, up to date product catalogue data.

We split the problem into two parts: a crawler framework and site specific definitions. The crawler framework’s job is to start at a URL, fetch it, optionally extract product data from it, find new URLs to crawl, and repeat the process until it has visited all parts of a given site.  

The site specific definitions tell the crawler where to start and how to extract information from a subset of the pages that have been crawled.  We outsourced writing these definitions to an external team.

Here is a sample crawler definition:

my $scraper = new Polyvore::Scraper({
  start => 'http://www.happysocks.com/us/',

  # invoke the scraper whenever the URL matches this pattern
  scrape_re => qr{\/us\/[a-z].*},

  # declare what needs to be extracted using CSS or XPATH expressions
  scraper => scraper {
    process 'div#content h1',            'title' => \&html2text;
    process 'div.product_container img', 'imgurl[]'   => '@src';
    process 'div#product_properties h2', 'price'      => 'TEXT';
    process 'div.size span.single',      'sizes[]'    => 'TEXT';
    process 'select#form_size option',   'sizes[]'    => 'TEXT';
    process 'span.sold_out_big',         'outofstock' => 'TEXT';


The crawler framework takes care of the rest.  It spins up EC2 instances as needed, deploys crawlers, performs the crawls, monitors the health of each crawler (using our stats collection system) and automatically opens trouble tickets for our outsourced team whenever it detects an issue.


Extracting information from HTML pages

Anyone who has ever dealt with extracting structured information from HTML pages knows that it is a total pain in the ass.  It is tedious to specify what elements of the page content you want to extract.  Most definitions are very fragile and susceptible to slight site changes or the presence or absence of additional page content.  Even though we were outsourcing this part of the process, we still wanted to make it easier. Fortunately, we found Tatsuhiko Miyagawa (@miyagawa)‘s amazing Web::Scraper.  It allows you to declare what you want extracted using XPATH or CSS selectors (internally translated to XPATH expressions).  Declarative systems make life a lot easier for developers because you are letting the machine map your declarations to what needs to happen to satisfy it.  We have also found that most sites are fairly stable in their CSS structure and therefore our crawlers are less fragile.

Infinite Crawl Loops

Writing generic crawlers is hard.  Your crawler can get stuck in infinite loops because of dynamically generated sites that offer infinite combinations of pagination / sort / filter permutations.  As a safeguard against this, we adopted the following strategy: As we crawl and discover new URLs to crawl, we always prioritize processing URLs that need to be scraped (typically product detail pages).  We also keep track of how many pages we have crawled since the last time we encountered a detail page.  If we have been crawling for a while (say 5000 pages) and have not encountered a new product detail page, we assume we are stuck in a loop and abort.

Error Detection

We are now crawling hundreds of sites.  When you are crawling that many sites, there is always a few crawlers that are misbehaving.  It is similar to managing a data center with 1000’s of machines.  Probabilistically, something is going to be broken at any given moment.  We value our time way too much to spend it baby-sitting hundreds of crawlers and looking for breakage.  So, using our stats system we built a statistical model of what the results from a healthy crawler looks like.  Then, we compare stats about each completed crawl against our model.  If they are out of whack, we know there is a problem.  Our system automatically shoots off a trouble ticket to the crawler team for further investigation.

Guaranteed Updates aka Update SLA

Even though our crawlers continuously run and extract the data that we need, and our shopping data is fresh and up-to-date, we don’t have a mechanism to guarantee that updates that were posted to a retailer site will make it into Polyvore index within a given period of time. This guaranteed time to update, or update SLA is important especially during the holiday season, black friday and cyber monday – but is also a useful feature to have year-round. We are considering couple of options on how to design and implement this feature – but if you have ideas or suggestions we would love to hear from you!


Accurate, up to date data about products is important to providing an awesome user experience. We built a highly scalable, easy to maintain crawler system that enables us to keep our product index updated on an ongoing basis. We made our own lives easier by using declarative tools that are less fragile and also taking the time to develop robust monitoring systems that do not require our constant attention. Creating a crawler framework, combined with crawler rules per site allowed us to keep our design and implementation simple, yet scalable and manageable. Designing for simplicity and efficiency is a principle that guides us in many of the decisions we make in engineering at Polyvore.

Measure Everything

Note: This is cross-posted to Polyvore’s Engineering Blog.

Measuring and acting on stats is an essential part of building successful products. There are many direct and indirect benefits to pervasive measurement and tracking of stats:

  • Accurate, real-time data enables better and faster decisions.
  • It empowers a data-driven culture where ideas can come from anyone — ideas can be easily tested, and the best ones can be chosen based on their merit, instead of pure intuition, or because it’s the HiPPO.
  • Tracking stats and watching them improve in response to our iterations helps teams stay focused and is incredibly motivating.
  • Historical stats are a great way to keep an eye on how code changes affect the health of a product and processes.

Early on in the life of Polyvore, we decided to weave stats into everything we did. One of our engineering philosophies is to invest a fair amount of our resources into making ourselves more efficient. This approach introduces some latency into our projects but this is made up in the future by an increase in overall team bandwidth. Consistent with this philosophy, the first thing we did was to invest in building a system that made it easy to collect stats and made it immediately useful.

We started by identifying different types of stats that people were already collecting in ad hoc ways or wished that they could collect. Based on these requirements, we came up with the following, super simple API:

my $stats = Polyvore::Stats->new({ 
    name => 'db_layer' , 
    roleup => MINUTE 

# observe a value for a given key
$stats->observe(45, ‘file_size’); 

# observe the occurrence of events
pre>stats->add(5, ‘facebook_disconnects’);

# utility hi-res timing functions that observe elapsed time for a given key

To measure anything, a developer on our team would instantiate a stats object with a given name, granularity. Arbitrary stats can be recorded using this object. Once the measurement was set up and running, we could access the collected data via a generic web dashboard, pragmatically through an API or graph them on a dashboard.

Some Examples

Polyvore’s stats collection system is integrated into almost every subsystem (except itself!)

One of the earliest applications was in instrumenting our page generation times. Site performance is very important to us and we needed to profile where we were spending time in generating the response to each request. Tracking the # of calls and the time spent on each call allowed us to identify bottlenecks and to prioritize our time on optimizing the most important ones (via tuning SQL, caching, parallelization, pre-computation, etc). We make all our back-end calls through an abstraction layer and this allowed us to easily instrument our DB reads by collecting stats in just a few places:

sub _read_sql {

# retrieve a stats object from a factory
# (these are reused across persistent server processes)
$stats = $instance->stats({ name => 'backend_stats' })

# find the calling method that is requesting the read so that we can log it.
# eg: select_user_by_id
my $caller = Polyvore::Util::first_caller_inside_pkg();

# begin hi-res timer
$stats->timer_begin($caller, 'read');

# execute the query
my $result = $self->_read($sql, $data, $shard_name);

# end hi-res timer
$stats->timer_end($caller, 'read');

return $result;

We did the same thing for our other back-ends — Memcached, Solr and Cassandra.

Collecting this data for each request allows us do other cool things, like dumping it at the end of every page as an HTML comment, so that we could view source and see what had happened during a particular request:

shard ‘read’
shard main read: db17.polyvore.com cnt max min sum
are_contacts 1 0.001073 0.001073 0.001073
count_collection_comments 1 0.001174 0.001174 0.001174
count_collection_favorites 1 0.00403 0.00403 0.00403
… some stats omitted …
select_sponsored_brands_for_collections 1 0.001372 0.001372 0.001372
select_sponsored_hosts_for_collections 1 0.00511 0.00511 0.00511
select_template_collection_basedon 1 0.000991 0.000991 0.000991
select_user_preferences 1 0.000869 0.000869 0.000869
set.handle_counter 25
write_counter 0

This request took 180ms to generate (somewhat average for a typical page) and performed 25 reads, including some memcached hits.

We can also look at historical data in the web UI for each of the calls (past 24 hours):

path avg
select_collection_comments 0.00170376

select_collection_comments on average takes 1.7ms to execute.



We started out by storing the stats archives in mysql using a very simple auto-generated table schema. This approach got us off the ground quickly and worked reasonably well until the stats system became a victim of its own success. We started to collect stats in every aspect of our system and as the volume of data grew, we ran into scalability issues on the storage system. Given the simple schema (key, timestamp, value) we decided to use Cassandra as the storage system.

Today, we are logging and archiving 35M data points / day from over 100 different subsystems.

Write Performance

We have been using the stats system to collect data in our production environment and very early on it became clear that we could not simply write out all the stats without overwhelming our storage system.

Fortunately, we had already started working on a distributed job queue system. Our stats objects internally buffer the stats and occasionally flush to a RabbitMQ based job queue. The data is picked up by a bank of workers that write it to Cassandra. This approach has allowed us to not worry about the volume of data we are collecting and let the job queue smooth out and manage the write volume. The other benefit is that writing to a queue is essentially async and very fast. Our front end code never has to wait while stats are being flushed.

Getting Everyone Onboard

We made it easy, simple and useful to collect stats. But we’ve also made a point of celebrating stats. Having actual data that showed how a metric was improved was very gratifying, and we made a point of encouraging everyone to present their work in the context of the metrics they had improved.

A work in Progress

As with everything else we do, our stats system is a work in progress.

The first area for improvement is keep track of distributions in values. Because of the way we’re rolling up data, we’re only looking at average measurements over a given time period. Looking at averages alone can be misleading because it hides patterns in how the values are distributed. For example, if we’re measuring the performance of a single database select over time in order to spot slow queries (something that we do!), it would be useful to know that 99% run in under 4ms, but that 1% take 200ms, vs an overall average value of 6ms. Knowing that the queries is generally fast but dramatically slows down for 1% of cases would suggest a need for a fix where the average of 6ms might seem just fine.

Another interesting possibility to our status system is integration with Splunk. We’ve recently started using Splunk, an amazing tool for slicing and dicing log data in real-time. and it would be interesting to tee off of our stats to Splunk for real-time monitoring and ad hoc queries.


Measuring and collecting stats is an essential tool for building successful products. The key to success is to make it easy, immediately useful, ensure you’re making decisions based on the data and to celebrate the results.

Also See