Using Go to Stream the NYTimes Newswire

UPDATE: Courtesy of Mark Bates, he’s kindly refactored this to make it idiomatic. If you’re interested in the differences, I blogged about it!

In celebration of Valentine’s Day, the day every year that I celebrate my longest relationship-one with the NY Times. While I still subscribe to the WSJ, the NYT is undoubtedly my #1. Love you bae.

So this year, I decided I’d play around with the NYTimes API, a public API they provide that includes article search, top stories, a few NYC specific services, and the one I chose to use, the Newswire API.

From developer.nytimes.com: “The Times Newswire API provides an up-to-the-minute stream of published items”

This is actually pretty misleading, because they don’t actually provide a stream of anything. It’s a standard REST API that provides a list of all the new articles the NYTimes publishes. There’s some sort of “offset” you can hand it, but when I tried to use this based on the number of records available, I got a 500 error. This makes it hard, unless you compare all of the items you’ve pulled down, to actually know based on the API what’s being published.

To complicate things further, each published item is NOT provided with a unique identifier nor does it expose any sort of global integer for each piece of content.

Today’s project was figuring out how to consume their API and then expose it as streaming API. I chose to keep this really simple. If this was going to actually be used I’d want to have, oh I don’t know, a persistence layer (something I’ll probably try at a later date).

I’ve been flirting with Go on and off for a while, so decided it was a good time to take Golang out for a spin. Go is also really well suited for these sorts of proxies, as it’s lightening fast and when/if I needed to add concurrency, it’s trivial.

Learning to Stream With Go

My first step was figuring out how in the world I was going to stream anything. I stumbled upon go-json-rest which is a really cool, easy to use wrapper on top of the net/http Go library. Turns out you don’t need this at all. With Go, if you just don’t close the connection, it’ll just stream.

Understanding the NYTimes Data Format

This was a little tricker, but payload returned by the NYT was relatively straightforward. As I mentioned before, they don’t provide you with unique IDs, however you can sort of make these yourself if you wanted between the dates and titles.

The actual top level provided an actual Status Field, which I assume is the same as the HTTP status codes. But weird for that to be in the body. The NumResults I believe to be the number of articles in their database, but I’m not sure how this is useful, because while you can use the offset, that number can change.

For instance: there are 40 records. I use an offset of 30 to grab the last 10 records. However when I make that call, there is a new record added. I’m now missing the last record. Seems weird for a realtime API, I’d rather give it a range.

The Results field is all I needed. Those are all of my articles, which I marshall into the Article type. I have to say, Go makes dealing with REST APIs so much easier-you know exactly what you’re going to get.

Sorting the Articles

I discovered pretty quickly that the NYTimes API was being sorted by the created_at, not the updated_at. I decided I wanted to sort by the updated_at, because if an Article changes, I want to push out the update (I’d let the client deal with conflicts). There are certainly use cases where you want the created_at though, specifically ones where you don’t want to handle updates.

Calling the NYTimes

This was the easiest part. The NYTimes provides a very nice API Console and its trivial to create a very generous API Key (5K calls/day!).  This made up for the fact that I didn’t really love the structure of the API, but the tooling made that easy. We just grab the data from the API, and decode it directly into a Struct of Articles that is only looking at the “results” field in the response body.

Sending it to the Client

Now we just loop through the articles and send down anything that’s new directly to the ResponseWriter. We do a funny thing with reversing the natural order of the articles. Instead of sending it down with the newest articles first, we send down those last. While most of the time it’s just one new article, on the initial payload and on multiples, you want someone to always be able to hook onto the API and grab the newest article off the front of the stack.

And that’s it! Successfully streaming the NYTimes. What’s the use for this? Well, probably a tweetbot that tweets all the newest articles that meet a certain criteria. That’s probably what I’ll build next, after I add some persistence to this.

You can view the full thing here: https://github.com/scottefein/nytimes-streaming-proxy

Happy Valentines Day Everyone!

Leave a Reply

Your email address will not be published. Required fields are marked *