TL;DR — How I lost a 3 day weekend but won a stronger team overlooking an atomic feed write delivered to an early MVP served by an eager CDN.
It’s been a few years, and a different place of employment, yet I remember the conversation and the events that followed as if it were yesterday …
… it was back in March, I had added to the backlog a technical item for our mobile app which read “Trust, but verify the feed received is valid.”
Being an old fart with a background in real-time systems integrations there’s a part of me that sees all feeds as files. More pointedly, what plays back in my mind are horrible flashbacks of narrow escapes learning the hard way how files and feeds are sometimes deceptive in their completeness and fidelity.
With the advent of April, I found myself in a refinement meeting being passionately challenged by a developer on said technical story. He argued that the open source library we’d selected to make API calls and ingest our JSON feeds checked HTTP response codes for 200 OK, 400 Bad Request, 404 Not Found, and a variety of 500 server errors. It even accommodated the 304 Not Modified our content delivery network (CDN) delivered.
Being a product person of a certain age, I work very hard never to reply or respond to a technical disagreement with “… well in my day, blah, blah, huff, puff, cough, cough, it was uphill both ways, in the snow, with no shoes!” All the more so when working with incredibly talented developers who are of a different generational culture than my own.
So instead, I attempted to float the concept of the mobile app interrogating incoming JSON like a files. Adding, that like any file, there is the remote possibility of it getting damaged in transit. I suggested that along with the status code check, we could at least look for null and/or an incomplete data objects.
The coder, who was both brilliant and well versed in the technology stack in which he worked, very nicely and with diplomacy beyond his years, politely responded with “… I get what you’re saying, but this doesn’t quite function like feeds you used to work with. The feed ingestion library should be able to handle such situations.” At least he didn’t call me “pops” … like some others … in some other context … but I digress.
Taking a step back and thinking about adding an additional layer of feed validation from the point-of-view of the mobile app developer, I got it. I heard the voice of my former real-time device programmer self asking my current product persona “Why re-invent the wheel? Why introduce this complexity? If speed is of the essence, then let’s not add additional processing. Less is always more!”
So I let it go.
The month of May rolled around, and with it, our MVP white-label mobile app was working great in smaller news markets. The conversation about treating feeds like files a distant memory. So we branded and launched a variant of the app for one of our much larger news organizations. A large city with a well-known professional sports enterprise on an amazing hot streak.
All was right with the world … the app was working … the grill was cleaned of melted cheese … the leftover baked beans put away… and then it hit … on Memorial Day Weekend.
Crap starting hitting the fan at about 9pm Saturday night when I got a call from the panicked support peer on call. The news organization from the aforementioned major market was receiving complaints by users periodically getting blank pages for the sports section of their news app.
A ghost screen if you will … sometimes appearing … then not … with no apparent rhyme or reason.
I was hoping it was a feed problem. That’s something I knew I could fix. I dove into the content managlement system (CMS) that ingested stories from The Sports Network and The Associated Press. Something big had indeed happened in said larger market, causing a significant increase in the volume and size of incoming stories. None-the-less, the CMS was functioning as expected.
I then moved over to the feeds. As an MVP we were still experimenting, so rather than a dynamic and RESTful API, we were simply writing JSON files every 5 minutes to an HTTP exposed path on our load balanced server farm for pickup by our CDN. A “try before you buy MVP API endpoint” if you will.
A bit tail and grep fed into Puppet indicated that the sports stories in the feeds all aligned with the stories in the CMS throughout the entire server farm. So I took to hammering out fast requests on our test app using wget using a variety of header messages, interrogating our logs after to see if anything was amiss with server delivery.
I was frustrated, and a little panicked. It being the Memorial Day weekend, and with the MVP successfully in flight, many of the developers were on the road in various states of travel. And while I left emails and messages I still needed a stop-gap fix.
While I couldn’t figure out the root cause at the time, I did get lucky, tripping over a work-around using the feed file prior to the latest created. I had previously asked we keep 4 predecessor feed files available for debugging purposes. We had also defined a configuration file that determined which files got picked-up.
Now I’d use these archival feeds in combination to avert any further disaster until we could reconvene after the holiday weekend. (Yes, I know it all sounds a bit hackish, but it kept our investments very low while we learned exactly what to build.)
During the truncated week that followed we discovered there were two culprits which we had not seen with our initial offerings … as there wasn’t the huge volume of news we had seen created by the recently added larger market.
First, remember I said we were periodically writing JSON files for our CDN to pick up an distribute? Well seems we weren’t writing those with atomicity in mind. I won’t get into the gory details of atomic vs. non-atomic file reads and writes using Ruby, but let’s just say that our eager CDN was picking up JSON files faster than they could be completely written. Something we’d keep in mind once we evolved to a dynamic and fully RESTful API.
Second, remember when I said it might be good to trust but verify incoming data? So the CDN is serving up this partial — and sometimes blank — file with a nice “hey, hey, you’re ok” HTTP 200 message in the header. So along with the feed ingestion library, we added some checks to ensure the JSON was indeed valid, and not empty.
As failures go, not a horrible one.
No irreparable damage was done, and as an team LEANing into a mobile app, we learned early in the development of our mobile app infrastructure on issues of scale and ingestion we’d need to guard and test against moving forward.
At a personal level, it reinforced the notion that when it comes to product — and when dealing with developers of a different generational culture — it pays to pick carefully on which hills I chose to not to die.
In this case, showing patience with the feed==file system==fragility debate in the context of an early and limited MVP ultimately led to more thorough testing, a more stable app, and a stronger team.
Put another way, we still got what was best for the product while learning much, without paying the steep price of dictating nor talking-down-to one another.