No Hack Days at Polyvore

Do Whatever You Want Week

Instead we have: Do Whatever You Want Weeks.

Hack Days are in theory a great idea — let people who otherwise don’t get a chance to quickly prototype projects, showcase their ideas or plain work on something other than their usual work load.

But there are a few problems:

  • A single day is often not enough for realizing a serious idea.
  • People tend to work on whiz-bangy projects that are easy to demo but not very useful.
  • Sometimes, what people need most is not the variety of working on a new project but a complete break from programming.

To fix these problems in our version of Hack Days we created Do Whatever You Want Weeks.

DWYWW’s are exactly what the name suggests — a week of doing whatever you want. We chose a week because we felt it was enough time to get something serious done. People can do literally whatever they want — if someone feels what they need most is a week off, they can take it, no strings attached. If they want to code all week, they can do that too.

In practice, we work at a reasonable pace on stuff we really care about. We work on refactoring code, fixing bugs, mocking up new designs and of course creating new features that our community loves.

In some ways, DWYWW’s are not very different than our regular work weeks. And that is the point. Great companies shouldn’t really need Hack Days to let their staff work on cool stuff, they create an environment where it can happen every day.

Advertisements

Measure Everything

Note: This is cross-posted to Polyvore’s Engineering Blog.

Measuring and acting on stats is an essential part of building successful products. There are many direct and indirect benefits to pervasive measurement and tracking of stats:

  • Accurate, real-time data enables better and faster decisions.
  • It empowers a data-driven culture where ideas can come from anyone — ideas can be easily tested, and the best ones can be chosen based on their merit, instead of pure intuition, or because it’s the HiPPO.
  • Tracking stats and watching them improve in response to our iterations helps teams stay focused and is incredibly motivating.
  • Historical stats are a great way to keep an eye on how code changes affect the health of a product and processes.

Early on in the life of Polyvore, we decided to weave stats into everything we did. One of our engineering philosophies is to invest a fair amount of our resources into making ourselves more efficient. This approach introduces some latency into our projects but this is made up in the future by an increase in overall team bandwidth. Consistent with this philosophy, the first thing we did was to invest in building a system that made it easy to collect stats and made it immediately useful.

We started by identifying different types of stats that people were already collecting in ad hoc ways or wished that they could collect. Based on these requirements, we came up with the following, super simple API:

my $stats = Polyvore::Stats->new({ 
    name => 'db_layer' , 
    roleup => MINUTE 
});

# observe a value for a given key
$stats->observe(45, ‘file_size’); 

# observe the occurrence of events
$stats->inc(‘facebook_post’);
pre>stats->add(5, ‘facebook_disconnects’);

# utility hi-res timing functions that observe elapsed time for a given key
$stats->timer_begin(‘select_user_by_id’);
$stats->timer_end(‘select_user_by_id’);

To measure anything, a developer on our team would instantiate a stats object with a given name, granularity. Arbitrary stats can be recorded using this object. Once the measurement was set up and running, we could access the collected data via a generic web dashboard, pragmatically through an API or graph them on a dashboard.

Some Examples

Polyvore’s stats collection system is integrated into almost every subsystem (except itself!)

One of the earliest applications was in instrumenting our page generation times. Site performance is very important to us and we needed to profile where we were spending time in generating the response to each request. Tracking the # of calls and the time spent on each call allowed us to identify bottlenecks and to prioritize our time on optimizing the most important ones (via tuning SQL, caching, parallelization, pre-computation, etc). We make all our back-end calls through an abstraction layer and this allowed us to easily instrument our DB reads by collecting stats in just a few places:

sub _read_sql {
…

# retrieve a stats object from a factory
# (these are reused across persistent server processes)
$stats = $instance->stats({ name => 'backend_stats' })

# find the calling method that is requesting the read so that we can log it.
# eg: select_user_by_id
my $caller = Polyvore::Util::first_caller_inside_pkg();

# begin hi-res timer
$stats->timer_begin($caller, 'read');

# execute the query
my $result = $self->_read($sql, $data, $shard_name);

# end hi-res timer
$stats->timer_end($caller, 'read');

return $result;
}

We did the same thing for our other back-ends — Memcached, Solr and Cassandra.

Collecting this data for each request allows us do other cool things, like dumping it at the end of every page as an HTML comment, so that we could view source and see what had happened during a particular request:

shard ‘read’
shard main read: db17.polyvore.com cnt max min sum
are_contacts 1 0.001073 0.001073 0.001073
count_collection_comments 1 0.001174 0.001174 0.001174
count_collection_favorites 1 0.00403 0.00403 0.00403
… some stats omitted …
select_sponsored_brands_for_collections 1 0.001372 0.001372 0.001372
select_sponsored_hosts_for_collections 1 0.00511 0.00511 0.00511
select_template_collection_basedon 1 0.000991 0.000991 0.000991
select_user_preferences 1 0.000869 0.000869 0.000869
set.handle_counter 25
write_counter 0

This request took 180ms to generate (somewhat average for a typical page) and performed 25 reads, including some memcached hits.

We can also look at historical data in the web UI for each of the calls (past 24 hours):

path avg
select_collection_comments 0.00170376

select_collection_comments on average takes 1.7ms to execute.

Challenges

Scalability

We started out by storing the stats archives in mysql using a very simple auto-generated table schema. This approach got us off the ground quickly and worked reasonably well until the stats system became a victim of its own success. We started to collect stats in every aspect of our system and as the volume of data grew, we ran into scalability issues on the storage system. Given the simple schema (key, timestamp, value) we decided to use Cassandra as the storage system.

Today, we are logging and archiving 35M data points / day from over 100 different subsystems.

Write Performance

We have been using the stats system to collect data in our production environment and very early on it became clear that we could not simply write out all the stats without overwhelming our storage system.

Fortunately, we had already started working on a distributed job queue system. Our stats objects internally buffer the stats and occasionally flush to a RabbitMQ based job queue. The data is picked up by a bank of workers that write it to Cassandra. This approach has allowed us to not worry about the volume of data we are collecting and let the job queue smooth out and manage the write volume. The other benefit is that writing to a queue is essentially async and very fast. Our front end code never has to wait while stats are being flushed.

Getting Everyone Onboard

We made it easy, simple and useful to collect stats. But we’ve also made a point of celebrating stats. Having actual data that showed how a metric was improved was very gratifying, and we made a point of encouraging everyone to present their work in the context of the metrics they had improved.

A work in Progress

As with everything else we do, our stats system is a work in progress.

The first area for improvement is keep track of distributions in values. Because of the way we’re rolling up data, we’re only looking at average measurements over a given time period. Looking at averages alone can be misleading because it hides patterns in how the values are distributed. For example, if we’re measuring the performance of a single database select over time in order to spot slow queries (something that we do!), it would be useful to know that 99% run in under 4ms, but that 1% take 200ms, vs an overall average value of 6ms. Knowing that the queries is generally fast but dramatically slows down for 1% of cases would suggest a need for a fix where the average of 6ms might seem just fine.

Another interesting possibility to our status system is integration with Splunk. We’ve recently started using Splunk, an amazing tool for slicing and dicing log data in real-time. and it would be interesting to tee off of our stats to Splunk for real-time monitoring and ad hoc queries.

Summary

Measuring and collecting stats is an essential tool for building successful products. The key to success is to make it easy, immediately useful, ensure you’re making decisions based on the data and to celebrate the results.

Also See