Grace Hopper 2013 – A Guy’s Perspective

WP 20131002 001

kristineI recently went with several co-workers to the Grace Hopper Celebration of Women in Computing. My teammate, Kristine Delossantos, was giving a lightning talk about the technology behind Current Caller ID, and I wanted to go support the team and help with recruiting. I was a little intimidated thinking about being among the minority at the conference, gender-wise, and it made me better understand how a woman might feel in our male-dominated industry. Once I got there, all I really felt was excitement about having the opportunity to learn from leaders in both the academic and industry sectors, who happen to be women.

Women only make up 23% of computer science professionals, according to Telle Whitney, head of the Anita Borg institute. We have a growing gap between the number of computing jobs that are available in the US, and workers skilled enough to fill them. Code.org projects that there will be 1.4 million computing jobs by 2020, but only 400,000 computer science students, and increasing those numbers is essential for the technology industry in the US to continue to be leaders in global innovation. Maria Klawe of Harvey Mudd college and Sheryl Sandberg of Facebook discussed at the keynote how essential it was to encourage women to go into the field earlier in the education cycle to help bridge the gap, which I wholeheartedly agree with.

WP 20131002 001

I met quite a few interesting people, and had a ton of great discussions with women in all areas of computing, from students to senior leaders. I got a chance to hear some really interesting research that people were doing. One woman, Emma from Harvard, sat next to me and we talked about her PhD research, which was in the field of computer vision. Apparently the brain can process data from the eyes with only a few watts of energy, and current computer vision techniques that rely on large computing-based neural networks are not nearly that efficient. She was exploring how to increase the energy efficiency using specialized hardware sensors that have the various convolution filters built in. She mentioned one reason for the research was to improve the vision capabilities of the Robobee project at Harvard, which itself sounds pretty cool.

The mixture of academics and industry leaders created a great blend of pure innovation and innovation applied towards business goals in a single location. In the same lightning talk panel in which Kristine was presenting, another woman, Denise Koessler from the University of Tennessee, was presenting an abridged version of her PhD thesis. It was describing how each phone customer has a unique call and text pattern based on times and durations of call/lengths of text, which Denise theorized could be used social fingerprint. Current, our product, actually presents some of the same types of information to our customers for information purposes and it was insightful to hear about how mobile carriers could leverage the data for customer identification and acquisition at scale.

owyngirlsAll in all, the experience was really enlightening and wonderful. I learned some things about myself and how gender is still an issue in our industry. I learned that even though I think I’m not gender-biased, I need to work hard to be better and recognize that there are still unconscious biases that both women and men have that can be actively corrected. I got to hang out with many of my female co-workers and get to know them better. I got to learn about really interesting technical research, and hear about women’s experiences in our industry. I was further convinced how essential it is for us all to get our children, both daughters and sons, to learn about computer science earlier in school to ensure they are set up for success in the work force of tomorrow. I would strongly encourage both men and women to attend GHC next year.

Click here to view Kristine’s presentation

Efficient Mobile APIs Using Apache Thrift

thriftAPI

Reducing data consumption for the end customer is a paramount task when building a mobile app, and it was one of our biggest challenges with Current Caller ID.  Since Current feeds information from social networks, news and weather, it naturally delivers a lot of data to clients.  Our solution for reducing data was twofold: The first was using Apache Thrift as an API interface to improve payload size, even when gzipping the payloads.  The second was ensuring the server could identify and deliver new items incrementally.

Here’s how it all came together…

An intro to Current Caller ID

WhitePages has provided standard caller ID on the Android platform for several years, but caller ID is only useful for calls you receive from folks you don’t know. Looking at our data we saw that only about 40% of the calls a person made or received were from unknown callers. We saw a huge opportunity to improve the value we deliver to our customers by including more information. Our goal was to inform you about who you communicate with, and how you communicate with them.

Broadly, we grouped the features into:

  • Social network activity from your contacts
  • Location-specific news and alerts based on your contacts’ location
  • Statistics about your communication patterns with your contacts

current 1current 2current 3

 

Feeding social updates, news and weather into our app required us to deliver lots of data to our clients, and we were concerned with how delivering this extra data across the wire would impact customers’ data plans.

Apache Thrift to the Rescue

Most of our existing APIs to clients were semi-RESTful APIs with JSON payloads when retrieving or sending 100 contact deltas at a time the numbers were poor.  The average JSON payload was around 100-120k.  Gzipped these payloads were around 12-15k, which is good, but we thought we could do better.  That led us to look at other avenues for APIs.

Some teams at WhitePages had been exploring Thrift as a replacement for a legacy RPC mechanism we had for internal services.  Its abstract service and structure definition, client generation for multiple languages and binary payload made it a natural choice for exploration in mobile client interfaces, too.

One thing we were concerned about at the time was the only other company we knew that exposed Thrift APIs publicly was Evernote.  But, we decided to trudge on anyway, and found that over-the-wire (for our data types), Thrift with the standard binary serializer gave us a >=50% savings on average uncompressed (40-50k payloads for 100 contacts, versus 100-120k).  Gzipped the savings was less significant, but on average we still saw ~20-30% savings (between 10-12k payloads versus 12-15k for JSON).  Also, the CPU cost to process Thrift binary payloads vs. JSON payloads was better in many cases on Android devices, which reduces battery consumption.

 

Thrift also gave us more freedom to iterate on the service and structures during development with a reduced impact on client engineers, since the clients and the structures could be re-generated, and only the encapsulating models expressing those structures needed to be changed.  Compared to the amount of time we’d spent in the past building client SDKs for our REST APIs, this process saved us several weeks of engineering effort for these new APIs.

Current’s interface definitions have been used to generate services and models in Ruby (for our services and clients used unit/integration testing of those services), Java (for our Android application in the market), and Objective-C (for some iOS prototypes we’ve built).

There are a few downsides to our choice of Thrift and Ruby.  The first is it doesn’t support inheritance directly, so to extend existing structures you have to build composites, which in some cases creates a deep hierarchy.  The second is because of how the Ruby runtime allocates objects, this compositing has a tendency to allocate a lot of objects, and at scale Ruby systems whose primary job is to fetch and transmit Thrift can spend a lot of CPU time in GC.

Only Delivering When Important Data has Changed

To keep the data usage down, we also wanted the client to only get records of data that have important changes to spark conversation.  In our world examples of an important change would be:

  1. Address and phone changes.  “Congrats on your new home!”

  2. Social profile updates. “Your job title changed, congrats on the promotion!”

  3. New status and check-in data. “I was just at Disneyland too!  The Matterhorn stands the test of time.”

  4. News and weather changes about the contacts locale. “Wow, Lindsay Lohan, why are you calling me, and in the news too?  At least it’s still sunny in California.”

In all cases, the important changes above should only matter for people we can actually resolve against people you communicate with.  Because of this, we don’t fetch and deliver statuses and checkins for people who we haven’t tried to a contact, nor do we deliver news and weather for locales that you don’t have contacts associated with.

Statuses and check-ins are fairly transient and time-sensitive, so we schedule updates for this information on the server more frequently than profile changes.  News and weather are time sensitive but don’t change frequently during the day, so we fetch that twice a day.  Profile data changes happen less often, so depending on the social provider we fetch that anywhere from twice a day to every 5 days.  Even though we fetch at these periods, we only update the client when something has actually changed.

thriftAPI

Also, to reduce the cost of spinning up network connections, all of our APIs accept and return batches of data.

To help identify true changes, we rely on the Ruby implementation of Thrift, which includes a hash function that is a true checksum of the data elements of the entire structure.  Some of the fields in our structure are not relevant to the change (like the last updated time), so we had to make some modifications for our dupe check code to ignore fields that we wanted to exclude, but it saved us the effort of having to implement our own mechanism from scratch.

When data comes in from the server, we compare it to existing data and only update items that have true differences.  Each of these updated items then gets an updated timestamp, which we store along with the contact.

Clients are aware of the latest timestamp of data that they’ve got locally, and submit that to us as part of their request for updates, which we use to filter out data older than that timestamp before returning.  The client then records the latest timestamp for the next periodic request.

The Result

The broad features of Current required us to design several new systems, and ensure those services scaled to support a much larger customer base.  Nine months after launch, we’ve proven that our system can scale up effectively.