Polyvore’s Image Masking (or someone I am really proud of)

before after

Cindy is a member of Polyvore’s awesome engineering team.  She just put up a post on our eng blog about how she improved Polyvore’s automatic image background removal tool. This level of attention to detail and polish has always made me proud. The change was part of a larger project to enhance the overall image quality on Polyvore and I hope she will get around to writing the entire series of posts.

Under the Hood: How we Mask our Images

No Hack Days at Polyvore

Do Whatever You Want Week

Instead we have: Do Whatever You Want Weeks.

Hack Days are in theory a great idea — let people who otherwise don’t get a chance to quickly prototype projects, showcase their ideas or plain work on something other than their usual work load.

But there are a few problems:

  • A single day is often not enough for realizing a serious idea.
  • People tend to work on whiz-bangy projects that are easy to demo but not very useful.
  • Sometimes, what people need most is not the variety of working on a new project but a complete break from programming.

To fix these problems in our version of Hack Days we created Do Whatever You Want Weeks.

DWYWW’s are exactly what the name suggests — a week of doing whatever you want. We chose a week because we felt it was enough time to get something serious done. People can do literally whatever they want — if someone feels what they need most is a week off, they can take it, no strings attached. If they want to code all week, they can do that too.

In practice, we work at a reasonable pace on stuff we really care about. We work on refactoring code, fixing bugs, mocking up new designs and of course creating new features that our community loves.

In some ways, DWYWW’s are not very different than our regular work weeks. And that is the point. Great companies shouldn’t really need Hack Days to let their staff work on cool stuff, they create an environment where it can happen every day.

Great Office Work Environments

Note: This is based on a answer to a question about what makes for a good office space on Quora

At Polyvore we made our fair share of mistakes but the office work environment is one of the things we mostly got right.

Polyvore Office

Lots and lots of plants being installed at our office

When we were starting out, I knew that we’d be spending most of our waking hours there for the foreseeable future and we wanted to make our work environment as great as possible.

Our guiding principle has been that great people deserve to work in a great environment.

Here is a partial list of what we did:

Office Building
  • Modern building
  • Look out the window and you see trees
  • Within walking distance of downtown and train station
Interior
  • Lots of natural light
  • Warm toned interior lighting — not the fridge like cold fluorescent.
  • High ceilings
  • High quality desks / seats
  • Lots and lots of plants
  • Ample space for everyone — not like a sardine can
  • Open layout — although there are people who prefer offices
Equipment
  • A general policy of you can get whatever you need to best do your job
  • X-Large monitors
  • Latest/fast hardware — I hated how Yahoo laptops took 15 minutes to boot up
  • MiFi dongles for commuters
  • Dedicated IT staff to make sure everything just works
Benefits / Perks
  • Better than most startups Health benefits,…
  • CalTrain Go passes
  • Gogo inflight passes
  • Health and fitness allowance
Food
  • Free home cooked catered lunch everyday
  • Organic fruits + snacks + drinks, …
  • Chocolate deliveries from The Chocolate Garage

I am probably missing some things. You can read more about it at Polyvore’s About Page.

There is of course the even more important human interactions element which is the topic of entirely different posts / threads / books.

Pareto’s Principle at Play in Polyvore

One of Fred Wilson’s great insights is the importance of mobile first design — mobile form factor constraints force design simplicity. This insight remains valid despite recent criticisms of a mobile first, web second strategy.

usage_dist

The above graph is a good illustration of this insight. The graph shows access distribution to Polyvore’s endpoints (I have removed the scales and endpoint names). We compiled this data using Splunk during a recent audit of our services to identify places where we could simplify.

The bulk of Polyvore’s activity is concentrated over a handful of endpoints. The 50th endpoint barely registers on the graph and I am embarrassed to confess that we have 630+ endpoints and only the first 100 are plotted :(

This graph tells me that our product could be far simpler and still deliver the bulk of its utility to most of our users. In fact, this is exactly what is happening in our recently launched iOS app.

The best part about simplifying products is that it allows concentration of effort on improving the parts that are used the most. Going deep is a better long term strategy than going wide.

Polyvore 2012 Infographic

2012 has been another amazing year of growth for Polyvore (20M monthly UV and 2.3x revenue growth) and the team created a great infographic to share it with the world.

On a personal note, when we started out, we never imagined points like this. It was always heads down and focusing on the next 3 months. Today however, after watching the company grow year over year, I am convinced that Polyvore will continue to grow far beyond. It is all thanks to our amazing team and equally amazing community.

Polyvore Infographic

Polyvore’s Awesome Crawler System

Note: This is cross-posted to Polyvore’s Engineering Blog.

Polyvore’s product index spans millions of items. The bulk of these arrive via our awesome user community who are constantly scouring the web for interesting products using our clipper bookmarklet.


Our clipper is quite smart — it auto-detects the correct price, landing page, etc… We also use a background task to scrape the Facebook open graph meta information for gleaning the correct description and title for each product.  However, this information is essentially a snapshot taken at the time of clipping.  We don’t get notified about price changes and the availability of the product.  Since Polyvore is a social commerce platform, we felt it was important to have up to date price and availability information about the products that are present in our index.



To augment our product index, we started by integrating data feeds directly from retailers that offered them.  But we soon found that these feeds were constantly breaking, out of date and missing useful meta data. So, we decided to write our own crawlers to regularly crawl retail sites and extract accurate, up to date product catalogue data.



We split the problem into two parts: a crawler framework and site specific definitions. The crawler framework’s job is to start at a URL, fetch it, optionally extract product data from it, find new URLs to crawl, and repeat the process until it has visited all parts of a given site.  

The site specific definitions tell the crawler where to start and how to extract information from a subset of the pages that have been crawled.  We outsourced writing these definitions to an external team.



Here is a sample crawler definition:



my $scraper = new Polyvore::Scraper({
    
  start => 'http://www.happysocks.com/us/',

    
  # invoke the scraper whenever the URL matches this pattern
    
  scrape_re => qr{\/us\/[a-z].*},

    
  # declare what needs to be extracted using CSS or XPATH expressions
    
  scraper => scraper {
        
    process 'div#content h1',            'title' => \&html2text;
        
    process 'div.product_container img', 'imgurl[]'   => '@src';
        
    process 'div#product_properties h2', 'price'      => 'TEXT';
        
    process 'div.size span.single',      'sizes[]'    => 'TEXT';
        
    process 'select#form_size option',   'sizes[]'    => 'TEXT';
        
    process 'span.sold_out_big',         'outofstock' => 'TEXT';
    
  } 
}); 


$scraper->crawl();



The crawler framework takes care of the rest.  It spins up EC2 instances as needed, deploys crawlers, performs the crawls, monitors the health of each crawler (using our stats collection system) and automatically opens trouble tickets for our outsourced team whenever it detects an issue.

Challenges

Extracting information from HTML pages

Anyone who has ever dealt with extracting structured information from HTML pages knows that it is a total pain in the ass.  It is tedious to specify what elements of the page content you want to extract.  Most definitions are very fragile and susceptible to slight site changes or the presence or absence of additional page content.  Even though we were outsourcing this part of the process, we still wanted to make it easier. Fortunately, we found Tatsuhiko Miyagawa (@miyagawa)‘s amazing Web::Scraper.  It allows you to declare what you want extracted using XPATH or CSS selectors (internally translated to XPATH expressions).  Declarative systems make life a lot easier for developers because you are letting the machine map your declarations to what needs to happen to satisfy it.  We have also found that most sites are fairly stable in their CSS structure and therefore our crawlers are less fragile.

Infinite Crawl Loops

Writing generic crawlers is hard.  Your crawler can get stuck in infinite loops because of dynamically generated sites that offer infinite combinations of pagination / sort / filter permutations.  As a safeguard against this, we adopted the following strategy: As we crawl and discover new URLs to crawl, we always prioritize processing URLs that need to be scraped (typically product detail pages).  We also keep track of how many pages we have crawled since the last time we encountered a detail page.  If we have been crawling for a while (say 5000 pages) and have not encountered a new product detail page, we assume we are stuck in a loop and abort.

Error Detection

We are now crawling hundreds of sites.  When you are crawling that many sites, there is always a few crawlers that are misbehaving.  It is similar to managing a data center with 1000’s of machines.  Probabilistically, something is going to be broken at any given moment.  We value our time way too much to spend it baby-sitting hundreds of crawlers and looking for breakage.  So, using our stats system we built a statistical model of what the results from a healthy crawler looks like.  Then, we compare stats about each completed crawl against our model.  If they are out of whack, we know there is a problem.  Our system automatically shoots off a trouble ticket to the crawler team for further investigation.

Guaranteed Updates aka Update SLA

Even though our crawlers continuously run and extract the data that we need, and our shopping data is fresh and up-to-date, we don’t have a mechanism to guarantee that updates that were posted to a retailer site will make it into Polyvore index within a given period of time. This guaranteed time to update, or update SLA is important especially during the holiday season, black friday and cyber monday – but is also a useful feature to have year-round. We are considering couple of options on how to design and implement this feature – but if you have ideas or suggestions we would love to hear from you!


Summary

Accurate, up to date data about products is important to providing an awesome user experience. We built a highly scalable, easy to maintain crawler system that enables us to keep our product index updated on an ongoing basis. We made our own lives easier by using declarative tools that are less fragile and also taking the time to develop robust monitoring systems that do not require our constant attention. Creating a crawler framework, combined with crawler rules per site allowed us to keep our design and implementation simple, yet scalable and manageable. Designing for simplicity and efficiency is a principle that guides us in many of the decisions we make in engineering at Polyvore.

Measure Everything

Note: This is cross-posted to Polyvore’s Engineering Blog.

Measuring and acting on stats is an essential part of building successful products. There are many direct and indirect benefits to pervasive measurement and tracking of stats:

  • Accurate, real-time data enables better and faster decisions.
  • It empowers a data-driven culture where ideas can come from anyone — ideas can be easily tested, and the best ones can be chosen based on their merit, instead of pure intuition, or because it’s the HiPPO.
  • Tracking stats and watching them improve in response to our iterations helps teams stay focused and is incredibly motivating.
  • Historical stats are a great way to keep an eye on how code changes affect the health of a product and processes.

Early on in the life of Polyvore, we decided to weave stats into everything we did. One of our engineering philosophies is to invest a fair amount of our resources into making ourselves more efficient. This approach introduces some latency into our projects but this is made up in the future by an increase in overall team bandwidth. Consistent with this philosophy, the first thing we did was to invest in building a system that made it easy to collect stats and made it immediately useful.

We started by identifying different types of stats that people were already collecting in ad hoc ways or wished that they could collect. Based on these requirements, we came up with the following, super simple API:

my $stats = Polyvore::Stats->new({ 
    name => 'db_layer' , 
    roleup => MINUTE 
});

# observe a value for a given key
$stats->observe(45, ‘file_size’); 

# observe the occurrence of events
$stats->inc(‘facebook_post’);
pre>stats->add(5, ‘facebook_disconnects’);

# utility hi-res timing functions that observe elapsed time for a given key
$stats->timer_begin(‘select_user_by_id’);
$stats->timer_end(‘select_user_by_id’);

To measure anything, a developer on our team would instantiate a stats object with a given name, granularity. Arbitrary stats can be recorded using this object. Once the measurement was set up and running, we could access the collected data via a generic web dashboard, pragmatically through an API or graph them on a dashboard.

Some Examples

Polyvore’s stats collection system is integrated into almost every subsystem (except itself!)

One of the earliest applications was in instrumenting our page generation times. Site performance is very important to us and we needed to profile where we were spending time in generating the response to each request. Tracking the # of calls and the time spent on each call allowed us to identify bottlenecks and to prioritize our time on optimizing the most important ones (via tuning SQL, caching, parallelization, pre-computation, etc). We make all our back-end calls through an abstraction layer and this allowed us to easily instrument our DB reads by collecting stats in just a few places:

sub _read_sql {
…

# retrieve a stats object from a factory
# (these are reused across persistent server processes)
$stats = $instance->stats({ name => 'backend_stats' })

# find the calling method that is requesting the read so that we can log it.
# eg: select_user_by_id
my $caller = Polyvore::Util::first_caller_inside_pkg();

# begin hi-res timer
$stats->timer_begin($caller, 'read');

# execute the query
my $result = $self->_read($sql, $data, $shard_name);

# end hi-res timer
$stats->timer_end($caller, 'read');

return $result;
}

We did the same thing for our other back-ends — Memcached, Solr and Cassandra.

Collecting this data for each request allows us do other cool things, like dumping it at the end of every page as an HTML comment, so that we could view source and see what had happened during a particular request:

shard ‘read’
shard main read: db17.polyvore.com cnt max min sum
are_contacts 1 0.001073 0.001073 0.001073
count_collection_comments 1 0.001174 0.001174 0.001174
count_collection_favorites 1 0.00403 0.00403 0.00403
… some stats omitted …
select_sponsored_brands_for_collections 1 0.001372 0.001372 0.001372
select_sponsored_hosts_for_collections 1 0.00511 0.00511 0.00511
select_template_collection_basedon 1 0.000991 0.000991 0.000991
select_user_preferences 1 0.000869 0.000869 0.000869
set.handle_counter 25
write_counter 0

This request took 180ms to generate (somewhat average for a typical page) and performed 25 reads, including some memcached hits.

We can also look at historical data in the web UI for each of the calls (past 24 hours):

path avg
select_collection_comments 0.00170376

select_collection_comments on average takes 1.7ms to execute.

Challenges

Scalability

We started out by storing the stats archives in mysql using a very simple auto-generated table schema. This approach got us off the ground quickly and worked reasonably well until the stats system became a victim of its own success. We started to collect stats in every aspect of our system and as the volume of data grew, we ran into scalability issues on the storage system. Given the simple schema (key, timestamp, value) we decided to use Cassandra as the storage system.

Today, we are logging and archiving 35M data points / day from over 100 different subsystems.

Write Performance

We have been using the stats system to collect data in our production environment and very early on it became clear that we could not simply write out all the stats without overwhelming our storage system.

Fortunately, we had already started working on a distributed job queue system. Our stats objects internally buffer the stats and occasionally flush to a RabbitMQ based job queue. The data is picked up by a bank of workers that write it to Cassandra. This approach has allowed us to not worry about the volume of data we are collecting and let the job queue smooth out and manage the write volume. The other benefit is that writing to a queue is essentially async and very fast. Our front end code never has to wait while stats are being flushed.

Getting Everyone Onboard

We made it easy, simple and useful to collect stats. But we’ve also made a point of celebrating stats. Having actual data that showed how a metric was improved was very gratifying, and we made a point of encouraging everyone to present their work in the context of the metrics they had improved.

A work in Progress

As with everything else we do, our stats system is a work in progress.

The first area for improvement is keep track of distributions in values. Because of the way we’re rolling up data, we’re only looking at average measurements over a given time period. Looking at averages alone can be misleading because it hides patterns in how the values are distributed. For example, if we’re measuring the performance of a single database select over time in order to spot slow queries (something that we do!), it would be useful to know that 99% run in under 4ms, but that 1% take 200ms, vs an overall average value of 6ms. Knowing that the queries is generally fast but dramatically slows down for 1% of cases would suggest a need for a fix where the average of 6ms might seem just fine.

Another interesting possibility to our status system is integration with Splunk. We’ve recently started using Splunk, an amazing tool for slicing and dicing log data in real-time. and it would be interesting to tee off of our stats to Splunk for real-time monitoring and ad hoc queries.

Summary

Measuring and collecting stats is an essential tool for building successful products. The key to success is to make it easy, immediately useful, ensure you’re making decisions based on the data and to celebrate the results.

Also See

Polyvore Jobs

Polyvore Growth CurveAt Polyvore, we strive to build products that delight people.  

By design, we keep things simple on the surface and yet go to great lengths to make them run better under the hood.  We prefer simple solutions to technical problems but always try to include novelty where it matters — in our products.

If you care about the same things and enjoy working with like-minded people, drop me a line: pasha [at] polyvore [dom] com.

Polyvore Badges

We recently introduced Polyvore Badges.  We wanted something that reflected each person’s profile on Polyvore and was also simple and streamlined.  I am happy with what we came up with: a strip of colors reflective of ones recent activity on Polyvore, refreshed once a day.  Enjoy :)