Polyvore’s Awesome Crawler System

Note: This is cross-posted to Polyvore’s Engineering Blog.

Polyvore’s product index spans millions of items. The bulk of these arrive via our awesome user community who are constantly scouring the web for interesting products using our clipper bookmarklet.


Our clipper is quite smart — it auto-detects the correct price, landing page, etc… We also use a background task to scrape the Facebook open graph meta information for gleaning the correct description and title for each product.  However, this information is essentially a snapshot taken at the time of clipping.  We don’t get notified about price changes and the availability of the product.  Since Polyvore is a social commerce platform, we felt it was important to have up to date price and availability information about the products that are present in our index.



To augment our product index, we started by integrating data feeds directly from retailers that offered them.  But we soon found that these feeds were constantly breaking, out of date and missing useful meta data. So, we decided to write our own crawlers to regularly crawl retail sites and extract accurate, up to date product catalogue data.



We split the problem into two parts: a crawler framework and site specific definitions. The crawler framework’s job is to start at a URL, fetch it, optionally extract product data from it, find new URLs to crawl, and repeat the process until it has visited all parts of a given site.  

The site specific definitions tell the crawler where to start and how to extract information from a subset of the pages that have been crawled.  We outsourced writing these definitions to an external team.



Here is a sample crawler definition:



my $scraper = new Polyvore::Scraper({
    
  start => 'http://www.happysocks.com/us/',

    
  # invoke the scraper whenever the URL matches this pattern
    
  scrape_re => qr{\/us\/[a-z].*},

    
  # declare what needs to be extracted using CSS or XPATH expressions
    
  scraper => scraper {
        
    process 'div#content h1',            'title' => \&html2text;
        
    process 'div.product_container img', 'imgurl[]'   => '@src';
        
    process 'div#product_properties h2', 'price'      => 'TEXT';
        
    process 'div.size span.single',      'sizes[]'    => 'TEXT';
        
    process 'select#form_size option',   'sizes[]'    => 'TEXT';
        
    process 'span.sold_out_big',         'outofstock' => 'TEXT';
    
  } 
}); 


$scraper->crawl();



The crawler framework takes care of the rest.  It spins up EC2 instances as needed, deploys crawlers, performs the crawls, monitors the health of each crawler (using our stats collection system) and automatically opens trouble tickets for our outsourced team whenever it detects an issue.

Challenges

Extracting information from HTML pages

Anyone who has ever dealt with extracting structured information from HTML pages knows that it is a total pain in the ass.  It is tedious to specify what elements of the page content you want to extract.  Most definitions are very fragile and susceptible to slight site changes or the presence or absence of additional page content.  Even though we were outsourcing this part of the process, we still wanted to make it easier. Fortunately, we found Tatsuhiko Miyagawa (@miyagawa)‘s amazing Web::Scraper.  It allows you to declare what you want extracted using XPATH or CSS selectors (internally translated to XPATH expressions).  Declarative systems make life a lot easier for developers because you are letting the machine map your declarations to what needs to happen to satisfy it.  We have also found that most sites are fairly stable in their CSS structure and therefore our crawlers are less fragile.

Infinite Crawl Loops

Writing generic crawlers is hard.  Your crawler can get stuck in infinite loops because of dynamically generated sites that offer infinite combinations of pagination / sort / filter permutations.  As a safeguard against this, we adopted the following strategy: As we crawl and discover new URLs to crawl, we always prioritize processing URLs that need to be scraped (typically product detail pages).  We also keep track of how many pages we have crawled since the last time we encountered a detail page.  If we have been crawling for a while (say 5000 pages) and have not encountered a new product detail page, we assume we are stuck in a loop and abort.

Error Detection

We are now crawling hundreds of sites.  When you are crawling that many sites, there is always a few crawlers that are misbehaving.  It is similar to managing a data center with 1000’s of machines.  Probabilistically, something is going to be broken at any given moment.  We value our time way too much to spend it baby-sitting hundreds of crawlers and looking for breakage.  So, using our stats system we built a statistical model of what the results from a healthy crawler looks like.  Then, we compare stats about each completed crawl against our model.  If they are out of whack, we know there is a problem.  Our system automatically shoots off a trouble ticket to the crawler team for further investigation.

Guaranteed Updates aka Update SLA

Even though our crawlers continuously run and extract the data that we need, and our shopping data is fresh and up-to-date, we don’t have a mechanism to guarantee that updates that were posted to a retailer site will make it into Polyvore index within a given period of time. This guaranteed time to update, or update SLA is important especially during the holiday season, black friday and cyber monday – but is also a useful feature to have year-round. We are considering couple of options on how to design and implement this feature – but if you have ideas or suggestions we would love to hear from you!


Summary

Accurate, up to date data about products is important to providing an awesome user experience. We built a highly scalable, easy to maintain crawler system that enables us to keep our product index updated on an ongoing basis. We made our own lives easier by using declarative tools that are less fragile and also taking the time to develop robust monitoring systems that do not require our constant attention. Creating a crawler framework, combined with crawler rules per site allowed us to keep our design and implementation simple, yet scalable and manageable. Designing for simplicity and efficiency is a principle that guides us in many of the decisions we make in engineering at Polyvore.

Advertisements

Comments

  1. Hi Pasha,
    Great to see polyvore using updating its web data regularly.

    A smart technologist Anand Rajaram once told me that if you have site-specific regular expressions for more than 1000 sites, one will break every day ! Curious if you guys see the same thing?

    -Shashikant

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: