Skip to main content
  1. Posts/

Building-Analytics-at-500px-Samson-Hu-Medium

358 words·2 mins

Building-Analytics-at-500px-Samson-Hu-Medium #

A place where:

  • Everyone understands company metrics and knows how their work fits into the measurement framework
  • We measure the impact of decisions and acknowledge their success or failure
  • Teams use self service tools to answer their own questions with data
  • We can predict the future by digging into the data and finding leading indicators I knew that it wouldn’t be easy.

The Dimensional Data Warehouse Schema #

The schemas of data warehouses have two types of tables:

  • Fact tables record measurements in time of a process.

ETL #

In order to populate our data warehouse with data from our production databases and logs, we apply a process called ETL. An ETL script that has to turn messy production data into clean data warehouse data will naturally be extremely messy.

The Finished Data Warehouse #

So that’s our data model — stored on Amazon Redshift, and fed new data every night by ETL pipelines run on Luigi. You can think of three funnels at 500px:

  • visitor -> signup -> daily active -> daily engaged -> paid subscriber
  • visitor -> signup -> photo upload -> photo submit to marketplace -> photo sold on marketplace
  • visitor -> signup -> purchase photo from marketplace Each team owns different parts of this funnel for different products:
  • The marketing teams own (page views) and the top and bottom (revenue) of this funnel
  • The product teams has less of an emphasis on top of funnel metrics
  • The development teams (web and mobile), want to see the entire funnel with respect to their own products. There’s so much room for failure in this system:
  • Logs could get parsed wrong, and you undercount or over-count an event
  • The log sender on an API server could fail and you don’t notice and you miss a fraction of your events
  • There might be network issues one day, and 10% of your log entries might fail to send to S3
  • Some metrics that are important might be hard to query in the database and you might pull the wrong number
  • etc This doesn’t include hard fail events like your ETL’s failing and the metrics not being refreshed.