Indexes

Today we're going to be talking about pandas' Indexes. They're essential to pandas, but can be a difficult concept to grasp at first. I suspect this is partly because they're unlike what you'll find in SQL or R.

Indexes offer

  • a metadata container
  • easy label-based row selection and assignment
  • easy label-based alignment in operations

One of my first tasks when analyzing a new dataset is to identify a unique identifier for each observation, and set that as the index. It could be a simple integer, or like in our first chapter, it could be several columns (carrier, origin dest, tail_num date).

To demonstrate the benefits of proper Index use, we'll first fetch some weather data from sensors at a bunch of airports across the US. See here for the example scraper I based this off of. Those uninterested in the details of fetching and prepping the data and skip past it.

At a high level, here's how we'll fetch the data: the sensors are broken up by "network" (states). We'll make one API call per state to get the list of airport IDs per network (using get_ids below). Once we have the IDs, we'll again make one call per state getting the actual observations (in get_weather). Feel free to skim the code below, I'll highlight the interesting bits.

In [30]:
%matplotlib inline

import os
import json
import glob
import datetime
from io import StringIO

import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import prep

sns.set_style('ticks')
pd.options.display.max_rows = 10
# States are broken into networks. The networks have a list of ids, each representing a station.
# We will take that list of ids and pass them as query parameters to the URL we built up ealier.
states = """AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME
 MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT
 WA WI WV WY""".split()

# IEM has Iowa AWOS sites in its own labeled network
networks = ['AWOS'] + ['{}_ASOS'.format(state) for state in states]
In [107]:
def get_weather(stations, start=pd.Timestamp('2017-01-01'),
                end=pd.Timestamp('2017-01-31')):
    '''
    Fetch weather data from MESONet between ``start`` and ``stop``.
    '''
    url = ("http://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?"
           "&data=tmpf&data=relh&data=sped&data=mslp&data=p01i&data=v"
           "sby&data=gust_mph&data=skyc1&data=skyc2&data=skyc3"
           "&tz=Etc/UTC&format=comma&latlon=no"
           "&{start:year1=%Y&month1=%m&day1=%d}"
           "&{end:year2=%Y&month2=%m&day2=%d}&{stations}")
    stations = "&".join("station=%s" % s for s in stations)
    weather = (pd.read_csv(url.format(start=start, end=end, stations=stations),
                           comment="#")
                 .rename(columns={"valid": "date"})
                 .rename(columns=str.strip)
                 .assign(date=lambda df: pd.to_datetime(df['date']))
                 .set_index(["station", "date"])
                 .sort_index())
    float_cols = ['tmpf', 'relh', 'sped', 'mslp', 'p01i', 'vsby', "gust_mph"]
    weather[float_cols] = weather[float_cols].apply(pd.to_numeric, errors="corce")
    return weather
In [108]:
def get_ids(network):
    url = "http://mesonet.agron.iastate.edu/geojson/network.php?network={}"
    r = requests.get(url.format(network))
    md = pd.io.json.json_normalize(r.json()['features'])
    md['network'] = network
    return md

There isn't too much in get_weather worth mentioning, just grabbing some CSV files from various URLs. They put metadata in the "CSV"s at the top of the file as lines prefixed by a #. Pandas will ignore these with the comment='#' parameter.

I do want to talk briefly about the gem of a method that is json_normalize . The weather API returns some slightly-nested data.

In [109]:
url = "http://mesonet.agron.iastate.edu/geojson/network.php?network={}"
r = requests.get(url.format("AWOS"))
js = r.json()

js['features'][:2]
Out[109]:
[{'geometry': {'coordinates': [-94.2724, 43.0796], 'type': 'Point'},
  'id': 'AXA',
  'properties': {'climate_site': 'IA0133',
   'country': 'US',
   'county': 'Kossuth',
   'elevation': 368.8,
   'ncdc81': 'USC00130133',
   'sid': 'AXA',
   'sname': 'ALGONA',
   'state': 'IA',
   'tzname': 'America/Chicago',
   'ugc_county': 'IAC109',
   'ugc_zone': 'IAZ005',
   'wfo': 'DMX'},
  'type': 'Feature'},
 {'geometry': {'coordinates': [-93.5695, 41.6878], 'type': 'Point'},
  'id': 'IKV',
  'properties': {'climate_site': 'IA0241',
   'country': 'US',
   'county': 'Polk',
   'elevation': 270.7,
   'ncdc81': 'USC00130241',
   'sid': 'IKV',
   'sname': 'ANKENY',
   'state': 'IA',
   'tzname': 'America/Chicago',
   'ugc_county': 'IAC153',
   'ugc_zone': 'IAZ060',
   'wfo': 'DMX'},
  'type': 'Feature'}]

If we just pass that list off to the DataFrame constructor, we get this.

In [111]:
pd.DataFrame(js['features']).head()
Out[111]:
geometry id properties type
0 {'type': 'Point', 'coordinates': [-94.2724, 43... AXA {'ugc_county': 'IAC109', 'sname': 'ALGONA', 'u... Feature
1 {'type': 'Point', 'coordinates': [-93.5695, 41... IKV {'ugc_county': 'IAC153', 'sname': 'ANKENY', 'u... Feature
2 {'type': 'Point', 'coordinates': [-95.0465, 41... AIO {'ugc_county': 'IAC029', 'sname': 'ATLANTIC', ... Feature
3 {'type': 'Point', 'coordinates': [-94.9204, 41... ADU {'ugc_county': 'IAC009', 'sname': 'AUDUBON', '... Feature
4 {'type': 'Point', 'coordinates': [-93.8486, 42... BNW {'ugc_county': 'IAC015', 'sname': 'BOONE MUNI'... Feature

In general, DataFrames don't handle nested data that well. It's often better to normalize it somehow. In this case, we can "lift" the nested items (geometry.coordinates, properties.sid, and properties.sname) up to the top level.

In [112]:
pd.io.json.json_normalize(js['features'])
Out[112]:
geometry.coordinates geometry.type id properties.climate_site properties.country properties.county properties.elevation properties.ncdc81 properties.sid properties.sname properties.state properties.tzname properties.ugc_county properties.ugc_zone properties.wfo type
0 [-94.2724, 43.0796] Point AXA IA0133 US Kossuth 368.8 USC00130133 AXA ALGONA IA America/Chicago IAC109 IAZ005 DMX Feature
1 [-93.5695, 41.6878] Point IKV IA0241 US Polk 270.7 USC00130241 IKV ANKENY IA America/Chicago IAC153 IAZ060 DMX Feature
2 [-95.0465, 41.4059] Point AIO IA0364 US Cass 351.7 USC00130364 AIO ATLANTIC IA America/Chicago IAC029 IAZ070 DMX Feature
3 [-94.9204, 41.6994] Point ADU IA0385 US Audubon 399.3 USC00130385 ADU AUDUBON IA America/Chicago IAC009 IAZ057 DMX Feature
4 [-93.8486, 42.0486] Point BNW IA0807 US Boone 349.3 USC00130807 BNW BOONE MUNI IA America/Chicago IAC015 IAZ047 DMX Feature
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
40 [-95.4112, 40.7533] Point SDA IA7613 US Fremont 296.0 USC00137613 SDA SHENANDOAH MUNI IA America/Chicago IAC071 IAZ090 OAX Feature
41 [-95.2399, 42.5972] Point SLB IA7979 US Buena Vista 449.9 USC00137979 SLB Storm Lake IA America/Chicago IAC021 IAZ301 FSD Feature
42 [-92.0248, 42.2176] Point VTI IA8568 US Benton 255.0 USC00138568 VTI VINTON IA America/Chicago IAC011 IAZ051 DVN Feature
43 [-91.6748, 41.2751] Point AWG IA8688 US Washington 228.9 USC00138688 AWG WASHINGTON IA America/Chicago IAC183 IAZ077 DVN Feature
44 [-93.8691, 42.4392] Point EBS IA8806 US Hamilton 337.4 USC00138806 EBS Webster City IA America/Chicago IAC079 IAZ036 DMX Feature

45 rows × 16 columns

Sure, it's not that difficult to write a quick for loop or list comprehension to extract those, but that gets tedious. If we were using the latitude and longitude data, we would want to split the geometry.coordinates column into two. But we aren't so we won't.

Going back to the task, we get the airport IDs for every network (state) with get_ids. Then we pass those IDs into get_weather to fetch the actual weather data.

In [117]:
import os

ids = pd.concat([get_ids(network) for network in networks], ignore_index=True)
gr = ids.groupby('network')

store = 'data/weather.h5'

if not os.path.exists(store):
    os.makedirs("data/weather", exist_ok=True)

    for k, v in gr:
        weather = get_weather(v['id'])
        weather.to_csv("data/weather/{}.csv".format(k))

    weather = pd.concat([
        pd.read_csv(f, parse_dates=['date'], index_col=['station', 'date'])
        for f in glob.glob('data/weather/*.csv')
    ]).sort_index()

    weather.to_hdf("data/weather.h5", "weather")
else:
    weather = pd.read_hdf("data/weather.h5", "weather")
In [ ]:
weather.head()

OK, that was a bit of work. Here's a plot to reward ourselves.

In [36]:
airports = ['W43', 'AFO', '82V', 'DUB']

g = sns.FacetGrid(weather.loc[airports].reset_index(),
                  col='station', hue='station', col_wrap=2, size=4)
g.map(sns.regplot, 'sped', 'gust_mph')
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x11e9917f0>

Set Operations

Indexes are set-like (technically multisets, since you can have duplicates), so they support most python set operations. Since indexes are immutable you won't find any of the inplace set operations. One other difference is that since Indexes are also array-like, you can't use some infix operators like - for difference. If you have a numeric index it is unclear whether you intend to perform math operations or set operations. You can use & for intersection, | for union, and ^ for symmetric difference though, since there's no ambiguity.

For example, lets find the set of airports that we have both weather and flight information on. Since weather had a MultiIndex of airport, datetime, we'll use the levels attribute to get at the airport data, separate from the date data.

In [37]:
# Bring in the flights data

flights = pd.read_hdf('data/flights.h5', 'flights')

weather_locs = weather.index.levels[0]
# The `categories` attribute of a Categorical is an Index
origin_locs = flights.origin.cat.categories
dest_locs = flights.dest.cat.categories

airports = weather_locs & origin_locs & dest_locs
airports
Out[37]:
Index(['ABE', 'ABI', 'ABQ', 'ABR', 'ABY', 'ACT', 'ACV', 'ACY', 'AEX', 'AGS',
       ...
       'TUL', 'TUS', 'TVC', 'TWF', 'TXK', 'TYR', 'TYS', 'VLD', 'VPS', 'XNA'],
      dtype='object', length=265)
In [38]:
print("Weather, no flights:\n\t", weather_locs.difference(origin_locs | dest_locs), end='\n\n')

print("Flights, no weather:\n\t", (origin_locs | dest_locs).difference(weather_locs), end='\n\n')

print("Dropped Stations:\n\t", (origin_locs | dest_locs) ^ weather_locs)
Weather, no flights:
	 Index(['01M', '04V', '04W', '05U', '06D', '08D', '0A9', '0CO', '0E0', '0F2',
       ...
       'Y50', 'Y51', 'Y63', 'Y70', 'YIP', 'YKM', 'YKN', 'YNG', 'ZPH', 'ZZV'],
      dtype='object', length=1910)

Flights, no weather:
	 Index(['ADK', 'ADQ', 'ANC', 'BET', 'BQN', 'BRW', 'CDV', 'FAI', 'FCA', 'GUM',
       'HNL', 'ITO', 'JNU', 'KOA', 'KTN', 'LIH', 'MQT', 'OGG', 'OME', 'OTZ',
       'PPG', 'PSE', 'PSG', 'SCC', 'SCE', 'SIT', 'SJU', 'STT', 'STX', 'WRG',
       'YAK', 'YUM'],
      dtype='object')

Dropped Stations:
	 Index(['01M', '04V', '04W', '05U', '06D', '08D', '0A9', '0CO', '0E0', '0F2',
       ...
       'Y63', 'Y70', 'YAK', 'YIP', 'YKM', 'YKN', 'YNG', 'YUM', 'ZPH', 'ZZV'],
      dtype='object', length=1942)

Flavors

Pandas has many subclasses of the regular Index, each tailored to a specific kind of data. Most of the time these will be created for you automatically, so you don't have to worry about which one to choose.

  1. Index
  2. Int64Index
  3. RangeIndex: Memory-saving special case of Int64Index
  4. FloatIndex
  5. DatetimeIndex: Datetime64[ns] precision data
  6. PeriodIndex: Regularly-spaced, arbitrary precision datetime data.
  7. TimedeltaIndex
  8. CategoricalIndex
  9. MultiIndex

You will sometimes create a DatetimeIndex with pd.date_range (pd.period_range for PeriodIndex). And you'll sometimes make a MultiIndex directly too (I'll have an example of this in my post on performace).

Some of these specialized index types are purely optimizations; others use information about the data to provide additional methods. And while you might occasionally work with indexes directly (like the set operations above), most of they time you'll be operating on a Series or DataFrame, which in turn makes use of its Index.

Row Slicing

We saw in part one that they're great for making row subsetting as easy as column subsetting.

In [39]:
weather.loc['DSM'].head()
Out[39]:
tmpf relh sped mslp p01i vsby gust_mph skyc1 skyc2 skyc3
date
2014-01-01 00:54:00 10.94 72.79 10.4 1024.9 0.0 10.0 NaN FEW M M
2014-01-01 01:54:00 10.94 72.79 11.5 1025.4 0.0 10.0 NaN OVC M M
2014-01-01 02:54:00 10.94 72.79 8.1 1025.3 0.0 10.0 NaN BKN M M
2014-01-01 03:54:00 10.94 72.79 9.2 1025.3 0.0 10.0 NaN OVC M M
2014-01-01 04:54:00 10.04 72.69 9.2 1024.7 0.0 10.0 NaN BKN M M

Without indexes we'd probably resort to boolean masks.

In [40]:
weather2 = weather.reset_index()
weather2[weather2['station'] == 'DSM'].head()
Out[40]:
station date tmpf relh sped mslp p01i vsby gust_mph skyc1 skyc2 skyc3
884855 DSM 2014-01-01 00:54:00 10.94 72.79 10.4 1024.9 0.0 10.0 NaN FEW M M
884856 DSM 2014-01-01 01:54:00 10.94 72.79 11.5 1025.4 0.0 10.0 NaN OVC M M
884857 DSM 2014-01-01 02:54:00 10.94 72.79 8.1 1025.3 0.0 10.0 NaN BKN M M
884858 DSM 2014-01-01 03:54:00 10.94 72.79 9.2 1025.3 0.0 10.0 NaN OVC M M
884859 DSM 2014-01-01 04:54:00 10.04 72.69 9.2 1024.7 0.0 10.0 NaN BKN M M

Slightly less convenient, but still doable.

Indexes for Easier Arithmetic, Analysis

It's nice to have your metadata (labels on each observation) next to you actual values. But if you store them in an array, they'll get in the way of your operations. Say we wanted to translate the Fahrenheit temperature to Celsius.

In [41]:
# With indecies
temp = weather['tmpf']

c = (temp - 32) * 5 / 9
c.to_frame()
Out[41]:
tmpf
station date
01M 2014-01-01 00:15:00 1.0
2014-01-01 00:35:00 0.8
2014-01-01 00:55:00 0.3
2014-01-01 01:15:00 -0.1
2014-01-01 01:35:00 0.0
... ... ...
ZZV 2014-01-30 19:53:00 -2.8
2014-01-30 20:53:00 -2.2
2014-01-30 21:53:00 -2.2
2014-01-30 22:53:00 -2.8
2014-01-30 23:53:00 -1.7

3303647 rows × 1 columns

In [42]:
# without
temp2 = weather.reset_index()[['station', 'date', 'tmpf']]

temp2['tmpf'] = (temp2['tmpf'] - 32) * 5 / 9
temp2.head()
Out[42]:
station date tmpf
0 01M 2014-01-01 00:15:00 1.0
1 01M 2014-01-01 00:35:00 0.8
2 01M 2014-01-01 00:55:00 0.3
3 01M 2014-01-01 01:15:00 -0.1
4 01M 2014-01-01 01:35:00 0.0

Again, not terrible, but not as good. And, what if you had wanted to keep Fahrenheit around as well, instead of overwriting it like we did? Then you'd need to make a copy of everything, including the station and date columns. We don't have that problem, since indexes are immutable and safely shared between DataFrames / Series.

In [43]:
temp.index is c.index
Out[43]:
True

Indexes for Alignment

I've saved the best for last. Automatic alignment, or reindexing, is fundamental to pandas.

All binary operations (add, multiply, etc.) between Series/DataFrames first align and then proceed.

Let's suppose we have hourly observations on temperature and windspeed. And suppose some of the observations were invalid, and not reported (simulated below by sampling from the full dataset). We'll assume the missing windspeed observations were potentially different from the missing temperature observations.

In [44]:
dsm = weather.loc['DSM']

hourly = dsm.resample('H').mean()

temp = hourly['tmpf'].sample(frac=.5, random_state=1).sort_index()
sped = hourly['sped'].sample(frac=.5, random_state=2).sort_index()
In [45]:
temp.head().to_frame()
Out[45]:
tmpf
date
2014-01-01 00:00:00 10.94
2014-01-01 02:00:00 10.94
2014-01-01 03:00:00 10.94
2014-01-01 04:00:00 10.04
2014-01-01 05:00:00 10.04
In [46]:
sped.head()
Out[46]:
date
2014-01-01 01:00:00    11.5
2014-01-01 02:00:00     8.1
2014-01-01 03:00:00     9.2
2014-01-01 04:00:00     9.2
2014-01-01 05:00:00    10.4
Name: sped, dtype: float64

Notice that the two indexes aren't identical.

Suppose that the windspeed : temperature ratio is meaningful. When we go to compute that, pandas will automatically align the two by index label.

In [47]:
sped / temp
Out[47]:
date
2014-01-01 00:00:00         NaN
2014-01-01 01:00:00         NaN
2014-01-01 02:00:00    0.740402
2014-01-01 03:00:00    0.840951
2014-01-01 04:00:00    0.916335
                         ...   
2014-01-30 13:00:00         NaN
2014-01-30 14:00:00    0.590416
2014-01-30 17:00:00         NaN
2014-01-30 21:00:00         NaN
2014-01-30 23:00:00         NaN
Length: 550, dtype: float64

This lets you focus on doing the operation, rather than manually aligning things, ensuring that the arrays are the same length and in the same order. By deault, missing values are inserted where the two don't align. You can use the method version of any binary operation to specify a fill_value

In [48]:
sped.div(temp, fill_value=1)
Out[48]:
date
2014-01-01 00:00:00     0.091408
2014-01-01 01:00:00    11.500000
2014-01-01 02:00:00     0.740402
2014-01-01 03:00:00     0.840951
2014-01-01 04:00:00     0.916335
                         ...    
2014-01-30 13:00:00     0.027809
2014-01-30 14:00:00     0.590416
2014-01-30 17:00:00     0.023267
2014-01-30 21:00:00     0.035663
2014-01-30 23:00:00    13.800000
Length: 550, dtype: float64

And since I couldn't find anywhere else to put it, you can control the axis the operation is aligned along as well.

In [49]:
hourly.div(sped, axis='index')
Out[49]:
tmpf relh sped mslp p01i vsby gust_mph
date
2014-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2014-01-01 01:00:00 0.951304 6.329565 1.0 89.165217 0.0 0.869565 NaN
2014-01-01 02:00:00 1.350617 8.986420 1.0 126.580247 0.0 1.234568 NaN
2014-01-01 03:00:00 1.189130 7.911957 1.0 111.445652 0.0 1.086957 NaN
2014-01-01 04:00:00 1.091304 7.901087 1.0 111.380435 0.0 1.086957 NaN
... ... ... ... ... ... ... ...
2014-01-30 19:00:00 NaN NaN NaN NaN NaN NaN NaN
2014-01-30 20:00:00 NaN NaN NaN NaN NaN NaN NaN
2014-01-30 21:00:00 NaN NaN NaN NaN NaN NaN NaN
2014-01-30 22:00:00 NaN NaN NaN NaN NaN NaN NaN
2014-01-30 23:00:00 1.588406 4.502174 1.0 73.434783 0.0 0.724638 NaN

720 rows × 7 columns

The non row-labeled version of this is messy.

In [50]:
temp2 = temp.reset_index()
sped2 = sped.reset_index()

# Find rows where the operation is defined
common_dates = pd.Index(temp2.date) & sped2.date
pd.concat([
    # concat to not lose date information
    sped2.loc[sped2['date'].isin(common_dates), 'date'],
    (sped2.loc[sped2.date.isin(common_dates), 'sped'] /
     temp2.loc[temp2.date.isin(common_dates), 'tmpf'])],
    axis=1).dropna(how='all')
Out[50]:
date 0
1 2014-01-01 02:00:00 0.740402
2 2014-01-01 03:00:00 0.840951
3 2014-01-01 04:00:00 0.916335
4 2014-01-01 05:00:00 1.035857
8 2014-01-01 13:00:00 NaN
... ... ...
351 2014-01-29 23:00:00 0.541495
354 2014-01-30 05:00:00 0.493440
356 2014-01-30 09:00:00 NaN
357 2014-01-30 10:00:00 0.624643
358 2014-01-30 14:00:00 NaN

170 rows × 2 columns

And we have a bug in there. Can you spot it? I only grabbed the dates from sped2 in the line sped2.loc[sped2['date'].isin(common_dates), 'date']. Really that should be sped2.loc[sped2.date.isin(common_dates)] | temp2.loc[temp2.date.isin(common_dates)]. But I think leaving the buggy version states my case even more strongly. The temp / sped version where pandas aligns everything is better.

Merging

There are two ways of merging DataFrames / Series in pandas.

  1. Relational Database style with pd.merge
  2. Array style with pd.concat

Personally, I think in terms of the concat style. I learned pandas before I ever really used SQL, so it comes more naturally to me I suppose.

Concat Version

In [51]:
pd.concat([temp, sped], axis=1).head()
Out[51]:
tmpf sped
date
2014-01-01 00:00:00 10.94 NaN
2014-01-01 01:00:00 NaN 11.5
2014-01-01 02:00:00 10.94 8.1
2014-01-01 03:00:00 10.94 9.2
2014-01-01 04:00:00 10.04 9.2

The axis parameter controls how the data should be stacked, 0 for vertically, 1 for horizontally. The join parameter controls the merge behavior on the shared axis, (the Index for axis=1). By default it's like a union of the two indexes, or an outer join.

In [52]:
pd.concat([temp, sped], axis=1, join='inner')
Out[52]:
tmpf sped
date
2014-01-01 02:00:00 10.94 8.100
2014-01-01 03:00:00 10.94 9.200
2014-01-01 04:00:00 10.04 9.200
2014-01-01 05:00:00 10.04 10.400
2014-01-01 13:00:00 8.96 13.825
... ... ...
2014-01-29 23:00:00 35.96 18.400
2014-01-30 05:00:00 33.98 17.300
2014-01-30 09:00:00 35.06 16.100
2014-01-30 10:00:00 35.06 21.900
2014-01-30 14:00:00 35.06 20.700

170 rows × 2 columns

Merge Version

Since we're joining by index here the merge version is quite similar. We'll see an example later of a one-to-many join where the two differ.

In [53]:
pd.merge(temp.to_frame(), sped.to_frame(), left_index=True, right_index=True).head()
Out[53]:
tmpf sped
date
2014-01-01 02:00:00 10.94 8.100
2014-01-01 03:00:00 10.94 9.200
2014-01-01 04:00:00 10.04 9.200
2014-01-01 05:00:00 10.04 10.400
2014-01-01 13:00:00 8.96 13.825
In [54]:
pd.merge(temp.to_frame(), sped.to_frame(), left_index=True, right_index=True,
         how='outer').head()
Out[54]:
tmpf sped
date
2014-01-01 00:00:00 10.94 NaN
2014-01-01 01:00:00 NaN 11.5
2014-01-01 02:00:00 10.94 8.1
2014-01-01 03:00:00 10.94 9.2
2014-01-01 04:00:00 10.04 9.2

Like I said, I typically prefer concat to merge. The exception here is one-to-many type joins. Let's walk through one of those, where we join the flight data to the weather data. To focus just on the merge, we'll aggregate hour weather data to be daily, rather than trying to find the closest recorded weather observation to each departure (you could do that, but it's not the focus right now). We'll then join the one (airport, date) record to the many (airport, date, flight) records.

Quick tangent, to get the weather data to daily frequency, we'll need to resample (more on that in the timeseries section). The resample essentially splits the recorded values into daily buckets and computes the aggregation function on each bucket. The only wrinkle is that we have to resample by station, so we'll use the pd.TimeGrouper helper.

In [56]:
idx_cols = ['unique_carrier', 'origin', 'dest', 'tail_num', 'fl_num', 'fl_date']
data_cols = ['crs_dep_time', 'dep_delay', 'crs_arr_time', 'arr_delay',
             'taxi_out', 'taxi_in', 'wheels_off', 'wheels_on']

df = flights.set_index(idx_cols)[data_cols].sort_index()
In [57]:
def mode(x):
    '''
    Arbitrarily break ties.
    '''
    return x.value_counts().index[0]

aggfuncs = {'tmpf': 'mean', 'relh': 'mean',
            'sped': 'mean', 'mslp': 'mean',
            'p01i': 'mean', 'vsby': 'mean',
            'gust_mph': 'mean', 'skyc1': mode,
            'skyc2': mode, 'skyc3': mode}
# TimeGrouper works on a DatetimeIndex, so we move `station` to the
# columns and then groupby it as well.
daily = (weather.reset_index(level="station")
                .groupby([pd.TimeGrouper('1d'), "station"])
                .agg(aggfuncs))

daily.head()
Out[57]:
tmpf relh sped mslp p01i vsby gust_mph skyc1 skyc2 skyc3
date station
2014-01-01 01M 35.747500 81.117917 2.294444 NaN 0.0 9.229167 NaN CLR M M
04V 18.350000 72.697778 11.250000 NaN 0.0 9.861111 31.603571 CLR M M
04W -9.075000 69.908056 3.647222 NaN 0.0 10.000000 NaN OVC M M
05U 26.321127 71.519859 3.829577 NaN 0.0 9.929577 NaN CLR M M
06D -11.388060 73.784179 5.359722 NaN 0.0 9.576389 NaN CLR M M

Now that we have daily flight and weather data, we can merge. We'll use the on keyword to indicate the columns we'll merge on (this is like a USING (...) SQL statement), we just have to make sure the names align.

In [64]:
flights.fl_date.unique()
Out[64]:
array(['2017-01-01T00:00:00.000000000', '2017-01-02T00:00:00.000000000',
       '2017-01-03T00:00:00.000000000', '2017-01-04T00:00:00.000000000',
       '2017-01-05T00:00:00.000000000', '2017-01-06T00:00:00.000000000',
       '2017-01-07T00:00:00.000000000', '2017-01-08T00:00:00.000000000',
       '2017-01-09T00:00:00.000000000', '2017-01-10T00:00:00.000000000',
       '2017-01-11T00:00:00.000000000', '2017-01-12T00:00:00.000000000',
       '2017-01-13T00:00:00.000000000', '2017-01-14T00:00:00.000000000',
       '2017-01-15T00:00:00.000000000', '2017-01-16T00:00:00.000000000',
       '2017-01-17T00:00:00.000000000', '2017-01-18T00:00:00.000000000',
       '2017-01-19T00:00:00.000000000', '2017-01-20T00:00:00.000000000',
       '2017-01-21T00:00:00.000000000', '2017-01-22T00:00:00.000000000',
       '2017-01-23T00:00:00.000000000', '2017-01-24T00:00:00.000000000',
       '2017-01-25T00:00:00.000000000', '2017-01-26T00:00:00.000000000',
       '2017-01-27T00:00:00.000000000', '2017-01-28T00:00:00.000000000',
       '2017-01-29T00:00:00.000000000', '2017-01-30T00:00:00.000000000',
       '2017-01-31T00:00:00.000000000'], dtype='datetime64[ns]')
In [65]:
flights.origin.unique()
Out[65]:
[ORD, LAS, DCA, TPA, PHL, ..., YAK, ELM, MOT, LSE, TKI]
Length: 298
Categories (298, object): [ORD, LAS, DCA, TPA, ..., ELM, MOT, LSE, TKI]
In [66]:
daily.reset_index().date.unique()
Out[66]:
array(['2014-01-01T00:00:00.000000000', '2014-01-02T00:00:00.000000000',
       '2014-01-03T00:00:00.000000000', '2014-01-04T00:00:00.000000000',
       '2014-01-05T00:00:00.000000000', '2014-01-06T00:00:00.000000000',
       '2014-01-07T00:00:00.000000000', '2014-01-08T00:00:00.000000000',
       '2014-01-09T00:00:00.000000000', '2014-01-10T00:00:00.000000000',
       '2014-01-11T00:00:00.000000000', '2014-01-12T00:00:00.000000000',
       '2014-01-13T00:00:00.000000000', '2014-01-14T00:00:00.000000000',
       '2014-01-15T00:00:00.000000000', '2014-01-16T00:00:00.000000000',
       '2014-01-17T00:00:00.000000000', '2014-01-18T00:00:00.000000000',
       '2014-01-19T00:00:00.000000000', '2014-01-20T00:00:00.000000000',
       '2014-01-21T00:00:00.000000000', '2014-01-22T00:00:00.000000000',
       '2014-01-23T00:00:00.000000000', '2014-01-24T00:00:00.000000000',
       '2014-01-25T00:00:00.000000000', '2014-01-26T00:00:00.000000000',
       '2014-01-27T00:00:00.000000000', '2014-01-28T00:00:00.000000000',
       '2014-01-29T00:00:00.000000000', '2014-01-30T00:00:00.000000000'], dtype='datetime64[ns]')
In [67]:
daily.reset_index().station.unique()
Out[67]:
array(['01M', '04V', '04W', ..., 'NR02', '0CO', 'AXV'], dtype=object)

The merge version

In [71]:
daily_ = (
    daily
    .reset_index()
    .rename(columns={'date': 'fl_date', 'station': 'origin'})
    .assign(origin=lambda x: pd.Categorical(x.origin,
                                            categories=flights.origin.cat.categories))
)
In [99]:
flights.fl_date.unique()
Out[99]:
array(['2017-01-01T00:00:00.000000000', '2017-01-02T00:00:00.000000000',
       '2017-01-03T00:00:00.000000000', '2017-01-04T00:00:00.000000000',
       '2017-01-05T00:00:00.000000000', '2017-01-06T00:00:00.000000000',
       '2017-01-07T00:00:00.000000000', '2017-01-08T00:00:00.000000000',
       '2017-01-09T00:00:00.000000000', '2017-01-10T00:00:00.000000000',
       '2017-01-11T00:00:00.000000000', '2017-01-12T00:00:00.000000000',
       '2017-01-13T00:00:00.000000000', '2017-01-14T00:00:00.000000000',
       '2017-01-15T00:00:00.000000000', '2017-01-16T00:00:00.000000000',
       '2017-01-17T00:00:00.000000000', '2017-01-18T00:00:00.000000000',
       '2017-01-19T00:00:00.000000000', '2017-01-20T00:00:00.000000000',
       '2017-01-21T00:00:00.000000000', '2017-01-22T00:00:00.000000000',
       '2017-01-23T00:00:00.000000000', '2017-01-24T00:00:00.000000000',
       '2017-01-25T00:00:00.000000000', '2017-01-26T00:00:00.000000000',
       '2017-01-27T00:00:00.000000000', '2017-01-28T00:00:00.000000000',
       '2017-01-29T00:00:00.000000000', '2017-01-30T00:00:00.000000000',
       '2017-01-31T00:00:00.000000000'], dtype='datetime64[ns]')
In [102]:
daily.reset_index().date.unique()
Out[102]:
array(['2014-01-01T00:00:00.000000000', '2014-01-02T00:00:00.000000000',
       '2014-01-03T00:00:00.000000000', '2014-01-04T00:00:00.000000000',
       '2014-01-05T00:00:00.000000000', '2014-01-06T00:00:00.000000000',
       '2014-01-07T00:00:00.000000000', '2014-01-08T00:00:00.000000000',
       '2014-01-09T00:00:00.000000000', '2014-01-10T00:00:00.000000000',
       '2014-01-11T00:00:00.000000000', '2014-01-12T00:00:00.000000000',
       '2014-01-13T00:00:00.000000000', '2014-01-14T00:00:00.000000000',
       '2014-01-15T00:00:00.000000000', '2014-01-16T00:00:00.000000000',
       '2014-01-17T00:00:00.000000000', '2014-01-18T00:00:00.000000000',
       '2014-01-19T00:00:00.000000000', '2014-01-20T00:00:00.000000000',
       '2014-01-21T00:00:00.000000000', '2014-01-22T00:00:00.000000000',
       '2014-01-23T00:00:00.000000000', '2014-01-24T00:00:00.000000000',
       '2014-01-25T00:00:00.000000000', '2014-01-26T00:00:00.000000000',
       '2014-01-27T00:00:00.000000000', '2014-01-28T00:00:00.000000000',
       '2014-01-29T00:00:00.000000000', '2014-01-30T00:00:00.000000000'], dtype='datetime64[ns]')
In [72]:
m = pd.merge(flights, daily_,
             on=['fl_date', 'origin']).set_index(idx_cols).sort_index()

m.head()
Out[72]:
airline_id origin_airport_id origin_airport_seq_id origin_city_market_id origin_city_name dest_airport_id dest_airport_seq_id dest_city_market_id dest_city_name crs_dep_time ... tmpf relh sped mslp p01i vsby gust_mph skyc1 skyc2 skyc3
unique_carrier origin dest tail_num fl_num fl_date
OO YUM PHX N423SW 3068 2017-01-31 20304 16218 1621801 33785 Yuma 14107 1410702 30466 Phoenix 2017-01-31 11:30:00 ... 35.747500 81.117917 2.294444 NaN 0.0 9.229167 NaN CLR M M
2017-01-31 20304 16218 1621801 33785 Yuma 14107 1410702 30466 Phoenix 2017-01-31 11:30:00 ... 18.350000 72.697778 11.250000 NaN 0.0 9.861111 31.603571 CLR M M
2017-01-31 20304 16218 1621801 33785 Yuma 14107 1410702 30466 Phoenix 2017-01-31 11:30:00 ... -9.075000 69.908056 3.647222 NaN 0.0 10.000000 NaN OVC M M
2017-01-31 20304 16218 1621801 33785 Yuma 14107 1410702 30466 Phoenix 2017-01-31 11:30:00 ... 26.321127 71.519859 3.829577 NaN 0.0 9.929577 NaN CLR M M
2017-01-31 20304 16218 1621801 33785 Yuma 14107 1410702 30466 Phoenix 2017-01-31 11:30:00 ... -11.388060 73.784179 5.359722 NaN 0.0 9.576389 NaN CLR M M

5 rows × 37 columns

Since data-wrangling on its own is never the goal, let's do some quick analysis. Seaborn makes it easy to explore bivariate relationships.

Looking at the various sky coverage states:

In [76]:
m.groupby('skyc1').dep_delay.agg(['mean', 'count']).sort_values(by='mean')
Out[76]:
mean count
skyc1
BKN -8.0 276
CLR -8.0 3033
FEW -8.0 258
M -8.0 180
OVC -8.0 1545
SCT -8.0 246
VV -8.0 57
In [77]:
import statsmodels.api as sm
/Users/taugspurger/miniconda3/envs/modern-pandas/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Statsmodels (via patsy can automatically convert dummy data to dummy variables in a formula with the C function).

In [78]:
mod = sm.OLS.from_formula('dep_delay ~ C(skyc1) + tmpf + relh + sped + mslp', data=m)
res = mod.fit()
res.summary()
Out[78]:
OLS Regression Results
Dep. Variable: dep_delay R-squared: 0.000
Model: OLS Adj. R-squared: -0.004
Method: Least Squares F-statistic: 4.527e-14
Date: Fri, 01 Sep 2017 Prob (F-statistic): 1.00
Time: 17:21:21 Log-Likelihood: -4390.8
No. Observations: 2487 AIC: 8804.
Df Residuals: 2476 BIC: 8868.
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -8.0000 3.438 -2.327 0.020 -14.742 -1.258
C(skyc1)[T.CLR] 2.782e-15 0.122 2.28e-14 1.000 -0.240 0.240
C(skyc1)[T.FEW] 1.943e-16 0.166 1.17e-15 1.000 -0.326 0.326
C(skyc1)[T.M] 3.166e-15 0.194 1.63e-14 1.000 -0.381 0.381
C(skyc1)[T.OVC] 3.383e-16 0.124 2.73e-15 1.000 -0.243 0.243
C(skyc1)[T.SCT] -2.024e-14 0.284 -7.12e-14 1.000 -0.557 0.557
C(skyc1)[T.VV ] 3.973e-15 0.224 1.77e-14 1.000 -0.440 0.440
tmpf -3.123e-17 0.002 -1.77e-14 1.000 -0.003 0.003
relh -3.144e-17 0.002 -1.41e-14 1.000 -0.004 0.004
sped -9.498e-17 0.007 -1.38e-14 1.000 -0.013 0.013
mslp -2.251e-16 0.003 -6.76e-14 1.000 -0.007 0.007
Omnibus: 170.859 Durbin-Watson: 0.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 440.406
Skew: -0.707 Prob(JB): 2.33e-96
Kurtosis: 1.500 Cond. No. 1.24e+05
In [79]:
fig, ax = plt.subplots()
ax.scatter(res.fittedvalues, res.resid, color='k', marker='.', alpha=.25)
ax.set(xlabel='Predicted', ylabel='Residual')
sns.despine()

Those residuals should look like white noise. Looks like our linear model isn't flexible enough to model the delays, but I think that's enough for now.


We'll talk more about indexes in the Tidy Data and Reshaping section. Let me know if you have any feedback. Thanks for reading!