Archives For Developer’s Corner

If there’s one thing I’ve learned this summer it’s this, “start before you’re ready”. Which makes it ironic then, that I’m writing this after my internship has technically ended. The problem was, I wasn’t sure what to write about and I was also really busy shipping to production. I could have quickly written at length about the endless office snacks, the Razor scooters or the incredibly comfortable egg chairs here at the office. But from what I’ve heard, these are pretty standard fair for elite Seattle tech companies, and they’re the type of thing that doesn’t really matter in the long run. What matters to me is community, education and excellence, three values I found everywhere at WhitePages.katie chair

This summer was my first computer science internship, having only discovered the field late in my college career, as is not uncommon for women. Like Nick wrote in his intern blog, WhitePages gave me an interview when many other companies said only, “see you next year”. Even before accepting my offer, I was connected with my team and a mentor, Owyn Richen. After accepting I was in weekly correspondence with Owyn. That winter I had decided to tackle Rails for a personal project, and he helped me work out model-data organization and possible offline storage solutions. I was taking a Databases class, and he explained how the SQL v. NoSQL v. NewSql question is answered at WhitePages, which was incredibly cool to see how my coursework applied in the real world. During my internship, Owyn checked in with me almost every day, “How’s it going, what can I help you with, when are you shipping?” Even though he used Emacs, he created a fast-paced and safe learning environment.

Owyn wasn’t the only one at the company eager to help me learn and ship. When I submitted my pull request at least five people reviewed every single line of my code, which, although incredibly nerve wracking, helped me learn more about Ruby in day then I might have otherwise in a year. It also proved that WhitePages was serious about excellent code.

The commenters were all people I had met before or would soon meet at one of our twice-monthly company happy hours. The happy hours were a great place to learn more about the tech industry in Seattle, make friends with my coworkers, ramble with Devin Ben-Hur about databases and pester the design people to do a caricature of me. Katie character

In addition to the happy hours, I had another opportunity for growth at the Intern Coffee Chats. At these meetings, the interns were given an hour to talk with the leaders at WhitePages. We met with the CEO, CFO, CTO, CRO and team leaders to talk about everything from career paths to SnapChat’s business model. WhitePages takes advantage of its small size to educate all engineers about the business. At least twice this summer the Mobile team business leader gave a presentation about the plans for our products and impact on company business as a whole. All engineers were in attendance and asking tough questions that made me realize they really cared. As a new programmer, it set a great precedent for me: I was given a very challenging back-end project, but I was also responsible for understanding why the project was important and what impact it would have. Furthermore, I was expected to give an hour-long presentation about it!

A year ago, the longest time I’d talked about a computer science project was five minutes, and the longest I’d worked on a project, was five hours. Now, thanks to the encouragement and opportunity at WhitePages, I’ve spent three months writing and shipping a high quality service in a new language. I learned how to use Chef for dev-driven deployment and how Lucene indexes work. Most importantly I’ve connected with mentors who will influence me my whole career. Next month, WhitePages is sponsoring me, a former intern, to attend the Grace Hopper Celebration of Women in Computing. When I’m there I’m excited to share what I’ve learned, “Start before you’re ready. And consider starting at WhitePages.”

Introduction

Here at WhitePages we are fond of Postgres. We use it to store most of our data. When we recently decided to rebuild our entire Email and SMS system, Postgres was the natural choice for storing user contact information in that system. What was not an easy choice for that system, however, was the Ruby ORM to talk to the database.  In the end we went with Sequel, but not with out a few modifications to a Ruby Gem in order to allow for testing without a database connection. Here’s how it all came together:

Why ActiveRecord wasn’t the solution

Being (predominantly) a Ruby on Rails shop the first place we looked was at ActiveRecord. Though it has become a very mature and full featured ORM we thought that it was packed with way too much functionality for what we needed, especially considering that we were not building this entire new Email/SMS system on Rails. Had we decided to build on Rails we would have taken a much more serious look at ActiveRecord but we felt like we had enough experience in using it across several other products to know that it was way more bloated than what we wanted to deal with for this project. There has also been quite a bit of discussion about performance issues in recent ActiveRecord versions and that certainly played a part in our skepticism about using it.

Choosing Sequel

What we needed was an ORM that gave us a reasonably well organized model representation of our data, was light weight, supported multiple connection configurations (to quickly switch between read-only and read-write), included basic migration tools, and was easy to test. We found everything we needed with Sequel… well, almost everything. Though the Sequel ORM is simple and elegant and very easy to tie to Postgres it was not obvious how to test against anything other than a live, connected database.

That wasn’t good enough for us. We wanted to be able to allow developers and automated test systems to perform service tests that quickly generated their own test data, threw it away when they were done with it, and did it all without a connection to a database. Basically, we wanted in-memory fixtures.

Sequel Fixture

Fortunately, we came across a public Ruby Gem called ‘Sequel-Fixture’. It was built to hook into Sequel and allow data to be created and destroyed easily from rspec tests. Unfortunately, however, that Gem had not been maintained for several months, was not Ruby 2.0 compatible, and lacked a major key feature: ‘in memory’. It still depended on a connection to the database. So it was part way there.

Not wanting to start from scratch we decided that there was quite enough useful functionality in the Gem for us to extend it to what we needed. With the original author’s permission (thank you Xavier) we did just that.

What We Updated

Dynamically Defined Schema

A key part of our automated database testing is being able to easily respond to schema changes. As we iterate on our database, especially early in the project, we want to be able to extend and maintain our tests with as little work as possible. More importantly, however, to be fully managed in-memory we need to be able to specify exactly what the database tables are supposed to look like on every test run.

So we extended Sequel-Fixture to support a “schema” section right inside of the fixture file.

As you can see, the data definition moved into a new “data” section while the new “schema” section is a yaml array of column definitions in the “users” table. Each row in the schema allows the name and type of the row to be specified.

There is also support for specifying which row is the “primary_key”, this is not exactly that interesting for most testing but it is something that the Sequel ORM needs when creating a table. If the primary_key attribute is not specified then we just pick the first column (entry).

Rows in the Data Section

In the fixture example above you will note that the “data” section is an array of entries. Obviously each of these entries (rows) in the array represents an individual row of fixture data in the ‘users’ table. In the original version of the Gem each data entry was a Hash value, it was made up of a key and a blob of data for the value.

As you can see below, this allowed for a neat way of referencing the data in your tests.

However, we felt that this was not the most accurate way to represent a table of rows so we updated the internal Table definition to look more like an array.

The Result

Today we have a Gem which represents the database layer of our Email/SMS Service and the spec folder contains several fixture files. Our test coverage is great, as we have iterated on the database the updates to the tests have been painless, and best of all, our Sequel driven service code is tested daily on our Jenkins server without a connection to a real database… just what we were aiming for.

Taking a snippet from Gem’s sample tests you can see how nice and clean this turned out.


Take a look at the updated Sequel Fixtures Gem

When designing a resourceful object hierarchy in Rails, often a single model has meaning in multiple different contexts. Take for example an application managing rental properties. This application has two main models, Properties and Tenants, which have obvious relationships with each other. These models also have value independent of this relationship.

At WhitePages.com we are often modeling entities that exist in many different contexts, somewhat like a graph. The question then becomes, how can I model a relationship such as a has_many/belongs_to using resourceful routes, while still allowing direct root access to a model, or potentially access through several different model connections? Can I access the same controller resource through different routes? The answer is yes. Lets take a look at a basic example to see how we can make this happen.

Let’s start with the following models:

The default scaffolding generator gives us the following routes with both resources at root:

These routes are workable, but do not correctly illustrate our designed model. Modeling the routes to match our belongs_to/has_many relationship, we would generate the following nested routes in config/routes.rb:

Nesting our resources like this, we now have the following routes:

Much better! This routing structure allows us to access our list of properties at /properties, a specific property at /properties/:id, a list of a properties Tenants at /properties/:property_id/tenants, and a specific tenant at /properties/:property_id/tenants/:id. This now models the relationship we’ve created between properties and Tenants. The only problem now is that our tenant controller does not know how to use the :property_id parameter correctly to set our scope. We need to make a few modifications to make use of the provided property_id. The majority of our changes are in ‘index’ and ‘new’.

You can see from the code above that we are now using the property_id parameter provided by our route to inform ActiveRecord of the scope of our search as well as initializing new models. Hooray! But this isn’t the goal we are looking for. What we want is the ability to see Tenants in both the context of a property, but also, to view Tenants without any context. This will allow us to view all our Tenants without regard to what Property they are assigned to, and provide a Tenant details path without having to find through the Property relationship. Lets start with adding the rout to our routes file:

This gives us the following routes:

Now we have both root access to our Tenants as well as routes through our properties relationship! Success! Well, not yet. If we use any of the root routes to Tenants, our controller is going to throw an error, as property_id is not being sent along. Let’s fix that right now:

Now we can declare success. We now have the ability to access all our Tenants through a root route, as well as through their defined relationship through Properties using a nested route.

Reducing data consumption for the end customer is a paramount task when building a mobile app, and it was one of our biggest challenges with Current Caller ID.  Since Current feeds information from social networks, news and weather, it naturally delivers a lot of data to clients.  Our solution for reducing data was twofold: The first was using Apache Thrift as an API interface to improve payload size, even when gzipping the payloads.  The second was ensuring the server could identify and deliver new items incrementally.

Here’s how it all came together…

An intro to Current Caller ID

WhitePages has provided standard caller ID on the Android platform for several years, but caller ID is only useful for calls you receive from folks you don’t know. Looking at our data we saw that only about 40% of the calls a person made or received were from unknown callers. We saw a huge opportunity to improve the value we deliver to our customers by including more information. Our goal was to inform you about who you communicate with, and how you communicate with them.

Broadly, we grouped the features into:

  • Social network activity from your contacts
  • Location-specific news and alerts based on your contacts’ location
  • Statistics about your communication patterns with your contacts

current 1current 2current 3

 

Feeding social updates, news and weather into our app required us to deliver lots of data to our clients, and we were concerned with how delivering this extra data across the wire would impact customers’ data plans.

Apache Thrift to the Rescue

Most of our existing APIs to clients were semi-RESTful APIs with JSON payloads when retrieving or sending 100 contact deltas at a time the numbers were poor.  The average JSON payload was around 100-120k.  Gzipped these payloads were around 12-15k, which is good, but we thought we could do better.  That led us to look at other avenues for APIs.

Some teams at WhitePages had been exploring Thrift as a replacement for a legacy RPC mechanism we had for internal services.  Its abstract service and structure definition, client generation for multiple languages and binary payload made it a natural choice for exploration in mobile client interfaces, too.

One thing we were concerned about at the time was the only other company we knew that exposed Thrift APIs publicly was Evernote.  But, we decided to trudge on anyway, and found that over-the-wire (for our data types), Thrift with the standard binary serializer gave us a >=50% savings on average uncompressed (40-50k payloads for 100 contacts, versus 100-120k).  Gzipped the savings was less significant, but on average we still saw ~20-30% savings (between 10-12k payloads versus 12-15k for JSON).  Also, the CPU cost to process Thrift binary payloads vs. JSON payloads was better in many cases on Android devices, which reduces battery consumption.

 

Thrift also gave us more freedom to iterate on the service and structures during development with a reduced impact on client engineers, since the clients and the structures could be re-generated, and only the encapsulating models expressing those structures needed to be changed.  Compared to the amount of time we’d spent in the past building client SDKs for our REST APIs, this process saved us several weeks of engineering effort for these new APIs.

Current’s interface definitions have been used to generate services and models in Ruby (for our services and clients used unit/integration testing of those services), Java (for our Android application in the market), and Objective-C (for some iOS prototypes we’ve built).

There are a few downsides to our choice of Thrift and Ruby.  The first is it doesn’t support inheritance directly, so to extend existing structures you have to build composites, which in some cases creates a deep hierarchy.  The second is because of how the Ruby runtime allocates objects, this compositing has a tendency to allocate a lot of objects, and at scale Ruby systems whose primary job is to fetch and transmit Thrift can spend a lot of CPU time in GC.

Only Delivering When Important Data has Changed

To keep the data usage down, we also wanted the client to only get records of data that have important changes to spark conversation.  In our world examples of an important change would be:

  1. Address and phone changes.  “Congrats on your new home!”

  2. Social profile updates. “Your job title changed, congrats on the promotion!”

  3. New status and check-in data. “I was just at Disneyland too!  The Matterhorn stands the test of time.”

  4. News and weather changes about the contacts locale. “Wow, Lindsay Lohan, why are you calling me, and in the news too?  At least it’s still sunny in California.”

In all cases, the important changes above should only matter for people we can actually resolve against people you communicate with.  Because of this, we don’t fetch and deliver statuses and checkins for people who we haven’t tried to a contact, nor do we deliver news and weather for locales that you don’t have contacts associated with.

Statuses and check-ins are fairly transient and time-sensitive, so we schedule updates for this information on the server more frequently than profile changes.  News and weather are time sensitive but don’t change frequently during the day, so we fetch that twice a day.  Profile data changes happen less often, so depending on the social provider we fetch that anywhere from twice a day to every 5 days.  Even though we fetch at these periods, we only update the client when something has actually changed.

thriftAPI

Also, to reduce the cost of spinning up network connections, all of our APIs accept and return batches of data.

To help identify true changes, we rely on the Ruby implementation of Thrift, which includes a hash function that is a true checksum of the data elements of the entire structure.  Some of the fields in our structure are not relevant to the change (like the last updated time), so we had to make some modifications for our dupe check code to ignore fields that we wanted to exclude, but it saved us the effort of having to implement our own mechanism from scratch.

When data comes in from the server, we compare it to existing data and only update items that have true differences.  Each of these updated items then gets an updated timestamp, which we store along with the contact.

Clients are aware of the latest timestamp of data that they’ve got locally, and submit that to us as part of their request for updates, which we use to filter out data older than that timestamp before returning.  The client then records the latest timestamp for the next periodic request.

The Result

The broad features of Current required us to design several new systems, and ensure those services scaled to support a much larger customer base.  Nine months after launch, we’ve proven that our system can scale up effectively.

My last blog post, Example of a Real-Time Web App, was an overview of the architecture of WhitePages’s Mailer. We used Ruby on Rails, Spine.js, Faye, and Sidekiq to build an asynchronous web app. Rails is a server-side MVC framework, Spine.js is a client-side MVC framework, Sidekiq processes jobs, and Faye ties those three pieces together with HTTP server push. This blog dives deeper into Faye, which can be the basis for implementing a publish-subscribe design pattern.

Continue Reading...