17 February 2015

How I built ArePeopleTalkingAboutIt.com

This is the story of how I built arepeopletalkingaboutit.com.

The idea came to me when I was advocating for Git on my team at work (we're still stuck on SVN). I decided to use the number of tags on Stackoverflow to get a sense of whether or not people were talking about certain tools or technologies, and to what extent.

I manually visited the pages on Stackoverflow for a few popular source and version control tools to count the number of questions. Two important values were the total number of questions and how many questions had answers. I threw the numbers in a Google spreadsheet, graphed it out, then took a screenshot of the graph and uploaded it to the blog post as a PNG.

As I was manually visiting those pages and compiling the chart I realized how easy it would be to write a crawler to do so. After all, my last pet project involved a lot of web scraping (I tried Ruby's Mechanize, PhantomJS, and CasperJS). Wouldn't it be neat to have a website where you could type in a few tags and instantly get a graph? And then I realized how silly it would be to crawl a technology site that probably has an API.

The API

The documentation for the StackExchange API is over here. I glanced through it to see if I could get the total number of questions and the total number of answers from a single API, but wasn't able to find anything that would do the trick. I ended up combining these two API calls:

https://api.stackexchange.com/docs/tags-by-name
https://api.stackexchange.com/docs/unanswered-questions (with a &tagged=tag and &filter=total appended)

The API had a pretty neat page to test out the different queries and to see what the responses would like look. I'm a big believer that the faster you can get feedback, the faster you can iterate and test new ideas, so I commend StackExchange for providing that API feature.

The minimum viable product

I'm used to contributing to big enterprise Java applications at work. The way that we make web applications is kind of painful, complicated, and involved. They're arguably more maintainable this way, but that's besides the point. I wanted to try something different for this project, because it's mine and I get to choose! (I don't always get to choose things at work).

Fortunately I did lots of research the last time I was going to architecture a web application so it was easy for me to do it again.

I picked Node.js + Express running on a single Amazon t2.micro EC2 machine. We're not allowed to use these trendy, fancy new cloud services at work so it always feels so liberating to do it at home.

I've already set up web applications this way in the past, so I was able to get a minimum viable product in a few hours. Here's what it looked like:

The very first thing I did was create a bare bones page with a single text input field. I used jQuery because I was familiar with it to submit the form and send a GET request to the server. On the server Express was there to retrieve the incoming request and parse the request parameter. I then made the two requests to the StackExchange API. Originally the two requests were written sequentially, but then I figured out how to use promises to send them both at the same time. I'm still wrapping my head around the "asynchronous event-driven callback" magic Javascript and Node.js love to do (and I'm starting to appreciate it too!). I combine the two Stackoverflow API responses and send them back to the client web browser which displays them on the screen.

You've must have noticed the chart. I'm a big believer in charts and visualization tools, and humans are bad at quickly comparing numbers without some sort of graphical representation. For the charts I debated using D3.js because a) all the cool kids are using it and b) because I have some experience with it.

And then I remembered how long it took me to make a bar chart with it.

Let's be real, D3.js is a wonderful visualization library where the sky is the limit. All I wanted was a simple bar chart. I checked out Google's visualization libraries which had great documentation and examples. In under half an hour I was able to take the server response and throw it into a chart.

Let's recap what's going on, in a very crude diagram:

Optimizing

I had some ideas to refine the product. For starters, some of stack exchange responses came with a little bit of extra information, specifically:

{"items":[{"has_synonyms":false,"is_moderator_only":false,"is_required":false,"count":1321,"name":"cvs"}],"has_more":false                                                                                     ,"quota_max":300,"quota_remaining":248}

Spot an issue? I was only allowed 300 requests per day. After reading the throttling documentation I learned I can eventually get 10,000 requests per day by requesting an API key.

But how often are these numbers changing? What if I cached the responses? Would a six hour, 12 hour hour, three day, seven day, or one month old, stale, count of answered/unanswered questions still be helpful and relevant? Probably not a month. Ideally I'd like to watch these trends to pick a good number, but off the top of my head I decided to cache any response for one day. It's a starting point.

What's the tool for this job? Redis! I really like working with Redis.

I've installed, configured, and used Redis on a t2.micro VM in the past so it was pretty quick to do again. I would use the SET command like SET "cvs" "{"totalQuestions":1321,"totalAnswers":996,"totalUnanswered":325}" 86400 to store it in memory for one day.

Then I modified my requests to the StackExchange API to first check Redis for a value (using the GET command) and only proceed to make the expensive query across the internet to the API if the key wasn't found in my local Redis cache. Upon getting a response from StackExchange, I would stick it in Redis (using that SET command) in addition to sending it back to the client web browser.

In another rough diagram

How do you verify the cache is working? The easiest way is to wrap a flag on the response and make two requests. The second request should come from the cache, and if I timed the HTTP request/response (maybe using the Chrome/Firefox dev tools in the browser) it should be significantly faster.

Making it production ready

As I was testing the web application I used nodemon to listen for changes and restart the web application so I could iterate quickly. I tested it by going to my host:port address in a web browser that was firewalled to only accept connections from my IP address.

This is not ideal for the real world. Before I deployed to a world wide web, I wanted to stick a robust webserver in front of the Node/Express webserver. This way I could put up multiple Node/Express applications and use the Internet-facing webserver to load balance (hah, if I ever get so popular) but more importantly for being able to update and deploy any fixes or changes to the Node/Express web apps without downtime.

What's the tool for this job? I'm familiar with Apache, but it looks like NGINX is more relevant in my use case. Also I wanted to learn something new.

Here's the latest version, along with a few other technologies that I used listed out.

Now, for example, If I want to deploy a hot fix I can:

Edit the NGINX configuration to only forward requests to port 54001 followed by a nginx reload
On the web app at port 54002 do a forever stop, git pull then forever start to start it back up.
Edit the NGINX configuration to only forward requests to port 54002 followed by a nginx reload
On the web app at port 54001 repeat the forever stop git pull and forever start to get it updated.
Edit the NGINX configuration to forward requests to either web applications since they're both now on the same version.

Obviously there are many improvements to make this more fault tolerant and scalable, the most glaring of which is that everything is sitting on a single t2.micro EC2 machine. Considering this is a stateless web application it should be simple enough to throw more EC2 machines at the problem (maybe in different availability regions), perhaps by resolving to different hostnames via DNS or by using Amazon's Elastic IP. I'll add that feature to my TODO list as soon as this becomes a raging viral success.

Polishing

The last step was making sure the web UI looked pretty. I wanted a simple FAQ and an explanation of how it worked. I imagined developers would approach it differently than how business/managers approached it, and I wanted it to appeal to both of them.

I also wanted to make sure it was easy to interact with and share those results with friends/colleagues/decision makers. It was important that the URL was readable and that you could send it in an instant message or email to someone and both be looking at the same thing.

I also wanted a way for people to get in touch with me and offer feedback or suggestions. I welcome constructive criticism, and the only way I'm going to get better at these things is by learning, doing and listening. It's probably how I got a job as an entry level developer two years ago without much prior experience (I have a B.S. in Optics). I've been coding for a few years, and only recently professionally.

Oh, and the site needed to be mobile friendly, of course.

What's the tool for this job? I've used Twitter Bootstrap in the past and liked it a lot, so I'm picking it again. On the server side, Node.js had a few templating engines to choose from, and after some research Jade looked favorable over EJS. But don't take my word for it:

Heeheehee. See an interactive version here.

Ship it

This is always the hardest part for me. It plagued me while I was designing pricepatient.com (and still does today-- there's so much room for improvement on that website). But it's important to get the product out there and then iterate like crazy.

For one thing it forces you get used to deploying regularly. You're have to learn how to check in code changes in a dev environment, pull them down in a production environment, and deploy without any downtime. That's the goal anyways, right? That's what this whole agile idea is about. I'm not building a rocket, nor am I dealing with sensitive or private data. This is the perfect place to fail fast, or to succeed iteratively and slowly.

I registered the domain, deployed my code, and opened up the firewall to port 80 for the whole world to see. Check it out and let me know what you think!