Iron Viz entry: Blame the Weather

6/14/2015 Matt Chambers 4 Comments


Tableau launched the second Iron Viz feeder contest, and the requirements were to use a data set of at least 1 million records.  Ben Jones from Tableau posted a pretty cool viz and data set on flight data for 2014.  When I looked at the viz, I kept wondering what was contributing to the flight delays.

I decided to build a viz looking at the impact of the weather, specifically precipitation, on flight delays.  Since Ben's data set only contained flight data, I needed a way to join the weather data for each day to all of the flights in the data set.  The only issue with this is I couldn't find a data set that contained weather specifically for airports.  Since nothing was available, I started looking at possible data sources, and I found an API endpoint that offers CSV data for airports from WeatherUnderground.  Now I needed to be able to scrape all of that data for every airport and compile it into one file.

To do this, I used the trusty curl command.  In order to retrieve the csv and store it in a file, I used the following command:

curl "http://www.wunderground.com/history/airport/KPDT/2014/1/1/CustomHistory.html?dayend=1&monthend=1&yearend=2015&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=&format=1" >> output.csv

Since I already had a list of all of the airport IDs in the flight file, I used Excel to generate all of the commands and stored them in a shell script that I ran to retrieve all of the weather data for every airport for 2014.

Once I had the complete weather data for all of the airports, I needed to be able to join all of the weather data to each flight.  With over 5 million records, using Excel was out of the question.  I tried to do a data blend in Tableau, but I was not able to get the weather data tied to the flight records in a way that gave me the drill-down I was looking for.  I've been using Alteryx a bit lately, and I decided that was my best course for this project.

I fired up Alteryx and started building a workflow.  The first thing I needed to do was convert all of the dates to proper date formats so I could join the data sets.  Then, I joined the data sets by date and airport ID.  Since I wanted to be able to see how precipitation affects the flights, I needed to filter out the days that didn't have precipitation.  Next, I joined another data set that contained geocoordinates for the airports.

For this viz, I wanted do do a small multiple map that showed the flight paths.  In order to do that, I needed to be able to create a path order as Andy Kriebel's tutorial shows.  To do this in Alteryx, I first joined the geocoordinates from the airports file to the origin airport records, and then I joined the file again to get the geoocordinates for the destinations as well.  Once I had everything joined, I was able to create new fields for latitude, longitude, path order, and route as indicated in Andy's tutorial.

After creating the new fields, I was simply able to union the files back together.  I attempted to do this in Tableau with custom SQL, but I kept getting an error about "bad input", so I'm not sure if Tableau can handle 3 million+ records in custom SQL.

Now that I finally had my complete data set built, it was time to start vizzing.  I had an idea about doing small multiple maps.  I had seen another post that Ben Jones had done on small multiple maps, and I've really been wanting to do a viz like that.  After looking at the data, I identified the top 10 airports in terms of overall flights, and I decided to narrow my viz to just those.  With only the top 10 airports, there are still 1.2 million flight records, so this met the requirements for the contest.

I wanted to be able to look at the data from a statistical perspective to identify the airports and flights that are more affected by precipitation.  I started thinking about how to create a metric that would make this type of analysis possible.  I came up with a metric called Weather Delay Minutes per Inch of Precipitation (WDMIP), and it is simply the average weather delay divided by the average precipitation in inches.  I'm a big fan of using Z scores, so I used the tutorial provided by Tableau to calculate the Z scores based on WDMIP.


Now that I had the Z scores calculated, I put together a quick heatmap of the airports by month to look at which months were more affected by the weather.  As soon as I saw the heatmap, I started thinking about how to build a user interface with the heatmap as the primary control for the dashboard.  Since you can click the columns for the months, rows for the airports, and individual cells for the intersection of the rows and columns, you could conceivably control a dashboard without the use of filters.  I thought this would be a much more immersive user experience since the user can simply click the heatmap to filter the views.


As I previously mentioned, I wanted to be able to build a small multiple view of the top 10 airports. Once I had the map put together, I was able to color the paths of each flight by the Z scores I had previously calculated.  The Z score easily lets you see which flights have longer flight delays when there is precipitation.

This was a really fun viz to build even though it was quite a bit of work to get the data set together. I'm really happy Tableau upgraded Public so that it would allow over a million records.  I hope this viz helps people find new insights about the impact of precipitation on flight delays. Check it out!

4 comments:

  1. Very cool. I need to get Alteryx! I'm hearing so many great things about it.

    ReplyDelete
  2. Good job! I think it's about time to fire up Alteryx!

    ReplyDelete
  3. Hey Matt Please check out my viz at
    https://public.tableau.com/profile/clinton.johnson6689#!/vizhome/LeagueofLegendsStronkestADCs2/Dashboard1
    it has 1000 views in 2 days and I'm looking to get on 'viz of the day' or something...thanks

    ReplyDelete

Note: Only a member of this blog may post a comment.