Introduction to Spatial Data Analysis in Python

In [1]:
import os, sys, glob
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

# vector / visualisation packages
import geopandas as gpd
import geoplot as gplt
import mapclassify as mc
import geoplot.crs as gcrs
from earthpy import clip as cl

# raster packages
import rasterio as rio
import georasters as gr
from rasterstats import zonal_stats

# spatial econometrics 
import pysal as ps
import esda
import libpysal as lps
from pysal.lib.weights.weights import W


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
/home/alal/anaconda3/envs/gds/lib/python3.7/site-packages/pysal/explore/segregation/network/network.py:16: UserWarning: You need pandana and urbanaccess to work with segregation's network module
You can install them with  `pip install urbanaccess pandana` or `conda install -c udst pandana urbanaccess`
  "You need pandana and urbanaccess to work with segregation's network module\n"
/home/alal/anaconda3/envs/gds/lib/python3.7/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql

Vector data

In [2]:
cities = gpd.read_file("https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_populated_places.zip")
In [3]:
countries = gpd.read_file('https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip')
In [4]:
countries.head()
Out[4]:
featurecla scalerank LABELRANK SOVEREIGNT SOV_A3 ADM0_DIF LEVEL TYPE ADMIN ADM0_A3 ... NAME_KO NAME_NL NAME_PL NAME_PT NAME_RU NAME_SV NAME_TR NAME_VI NAME_ZH geometry
0 Admin-0 country 5 2 Indonesia IDN 0 2 Sovereign country Indonesia IDN ... 인도네시아 Indonesië Indonezja Indonésia Индонезия Indonesien Endonezya Indonesia 印度尼西亚 MULTIPOLYGON (((117.70361 4.16341, 117.70361 4...
1 Admin-0 country 5 3 Malaysia MYS 0 2 Sovereign country Malaysia MYS ... 말레이시아 Maleisië Malezja Malásia Малайзия Malaysia Malezya Malaysia 马来西亚 MULTIPOLYGON (((117.70361 4.16341, 117.69711 4...
2 Admin-0 country 6 2 Chile CHL 0 2 Sovereign country Chile CHL ... 칠레 Chili Chile Chile Чили Chile Şili Chile 智利 MULTIPOLYGON (((-69.51009 -17.50659, -69.50611...
3 Admin-0 country 0 3 Bolivia BOL 0 2 Sovereign country Bolivia BOL ... 볼리비아 Bolivia Boliwia Bolívia Боливия Bolivia Bolivya Bolivia 玻利維亞 POLYGON ((-69.51009 -17.50659, -69.51009 -17.5...
4 Admin-0 country 0 2 Peru PER 0 2 Sovereign country Peru PER ... 페루 Peru Peru Peru Перу Peru Peru Peru 秘鲁 MULTIPOLYGON (((-69.51009 -17.50659, -69.63832...

5 rows × 95 columns

In [5]:
cities.head()
Out[5]:
SCALERANK NATSCALE LABELRANK FEATURECLA NAME NAMEPAR NAMEALT DIFFASCII NAMEASCII ADM0CAP ... name_ja name_ko name_nl name_pl name_sv name_tr name_vi wdid_score ne_id geometry
0 8 10 3 Admin-0 capital Vatican City None None 0 Vatican City 1.0 ... バチカン 바티칸 시국 Vaticaanstad Watykan Vatikanstaten Vatikan Thành Vatican 4 1159127243 POINT (12.45339 41.90328)
1 7 20 0 Admin-0 capital San Marino None None 0 San Marino 1.0 ... サンマリノ市 산마리노 San Marino San Marino San Marino San Marino Thành phố San Marino 4 1159146051 POINT (12.44177 43.93610)
2 7 20 0 Admin-0 capital Vaduz None None 0 Vaduz 1.0 ... ファドゥーツ 파두츠 Vaduz Vaduz Vaduz Vaduz Vaduz 4 1159146061 POINT (9.51667 47.13372)
3 6 30 8 Admin-0 capital alt Lobamba None None 0 Lobamba 0.0 ... ロバンバ 로밤바 Lobamba Lobamba Lobamba Lobamba Lobamba 4 1159146343 POINT (31.20000 -26.46667)
4 6 30 8 Admin-0 capital Luxembourg None None 0 Luxembourg 1.0 ... ルクセンブルク市 룩셈부르크 Luxemburg Luksemburg Luxemburg Lüksemburg Luxembourg 4 1159146437 POINT (6.13000 49.61166)

5 rows × 120 columns

Constructing Geodataframe from lon-lat

In [6]:
df = pd.DataFrame({'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
     'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
     'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
     'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
lat_am_capitals = gpd.GeoDataFrame(df, geometry = gpd.points_from_xy(df.Longitude, df.Latitude))

Making Maps

In [7]:
f, ax = plt.subplots(dpi = 200)
countries.plot(edgecolor = 'k', facecolor = 'None', linewidth = 0.6, ax = ax)
cities.plot(markersize = 0.5, facecolor = 'red', ax = ax)
lat_am_capitals.plot(markersize = 0.5, facecolor = 'y', ax = ax)
ax.set_title('World Map')
ax.set_axis_off()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fc6154b00>
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fc6154b00>
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fc6154b00>
Out[7]:
Text(0.5, 1, 'World Map')

Static Webmaps

In [8]:
ax = gplt.webmap(countries, projection=gplt.crs.WebMercator(), figsize = (16, 12))
gplt.pointplot(cities, ax=ax, hue = 'POP2015')
/home/alal/anaconda3/envs/gds/lib/python3.7/site-packages/geoplot/geoplot.py:685: UserWarning: Cound not set plot extent successfully due to numerical instability. Try setting extent manually. Defaulting to a global extent.
  'Cound not set plot extent successfully due to numerical instability. '
Out[8]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb1447198>

Aside on Projections

Map projections flatten a globe's surface onto a 2D plane. This necessarily distorts the surface (one of Gauss' lesser known results), so one must choose specific form of 'acceptable' distortion.

By convention, the standard projection in GIS is World Geodesic System(lat/lon - WGS84). This is a cylindrical projection, which stretches distances east-west and results in incorrect distance and areal calculations. For accurate distance and area calculations, try to use UTM (which divides map into zones). See epsg.io

In [9]:
countries.crs
Out[9]:
{'init': 'epsg:4326'}
In [10]:
countries_2 = countries.copy()
countries_2 = countries_2.to_crs({'init': 'epsg:3035'})
In [11]:
f, ax = plt.subplots(dpi = 200)
countries_2.plot(edgecolor = 'k', facecolor = 'None', linewidth = 0.6, ax = ax)
ax.set_title('World Map - \n Lambert Azimuthal Equal Area')
ax.set_axis_off()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb13810f0>
Out[11]:
Text(0.5, 1, 'World Map - \n Lambert Azimuthal Equal Area')

Choropleths

Maps with color-coding based on value in table

  • scheme=None—A continuous colormap.
  • scheme=”Quantiles”—Bins the data such that the bins contain equal numbers of samples.
  • scheme=”EqualInterval”—Bins the data such that bins are of equal length.
  • scheme=”FisherJenks”—Bins the data using the Fisher natural breaks optimization procedure.

(Example from geoplots gallery)

In [12]:
cali = gpd.read_file(gplt.datasets.get_path('california_congressional_districts'))
cali['area'] =cali.geometry.area

proj=gcrs.AlbersEqualArea(central_latitude=37.16611, central_longitude=-119.44944)
fig, axarr = plt.subplots(2, 2, figsize=(12, 12), subplot_kw={'projection': proj})

gplt.choropleth(
    cali, hue='area', linewidth=0, scheme=None, ax=axarr[0][0]
)
axarr[0][0].set_title('scheme=None', fontsize=18)

scheme = mc.Quantiles(cali.area, k=5)
gplt.choropleth(
    cali, hue='area', linewidth=0, scheme=scheme, ax=axarr[0][1]
)
axarr[0][1].set_title('scheme="Quantiles"', fontsize=18)

scheme = mc.EqualInterval(cali.area, k=5)
gplt.choropleth(
    cali, hue='area', linewidth=0, scheme=scheme, ax=axarr[1][0]
)
axarr[1][0].set_title('scheme="EqualInterval"', fontsize=18)

scheme = mc.FisherJenks(cali.area, k=5)
gplt.choropleth(
    cali, hue='area', linewidth=0, scheme=scheme, ax=axarr[1][1]
)
axarr[1][1].set_title('scheme="FisherJenks"', fontsize=18)

plt.subplots_adjust(top=0.92)
plt.suptitle('California State Districts by Area, 2010', fontsize=18)
Out[12]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb1343b00>
Out[12]:
Text(0.5, 1.0, 'scheme=None')
Out[12]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb136c160>
Out[12]:
Text(0.5, 1.0, 'scheme="Quantiles"')
Out[12]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb1317940>
Out[12]:
Text(0.5, 1.0, 'scheme="EqualInterval"')
Out[12]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb133b5c0>
Out[12]:
Text(0.5, 1.0, 'scheme="FisherJenks"')
Out[12]:
Text(0.5, 0.98, 'California State Districts by Area, 2010')

Spatial Merge

Subset to Africa

In [13]:
afr = countries.loc[countries.CONTINENT == 'Africa']
afr.plot()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb1203390>

Subset cities by merging with African boundaries

In [14]:
afr_cities = gpd.sjoin(cities, afr, how='inner')
In [15]:
ax = gplt.webmap(afr, projection=gplt.crs.WebMercator(), figsize = (10, 14))
gplt.pointplot(afr_cities, ax=ax, hue = 'NAME_EN')
Out[15]:
<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x7f8fb10ce780>
In [16]:
afr_cities.head()
Out[16]:
SCALERANK NATSCALE LABELRANK_left FEATURECLA NAME_left NAMEPAR NAMEALT DIFFASCII NAMEASCII ADM0CAP ... NAME_JA NAME_KO NAME_NL NAME_PL NAME_PT NAME_RU NAME_SV NAME_TR NAME_VI NAME_ZH
3 6 30 8 Admin-0 capital alt Lobamba None None 0 Lobamba 0.0 ... スワジランド 스와질란드 Swaziland Suazi eSwatini Свазиленд Swaziland Svaziland Swaziland 斯威士兰
16 4 50 8 Admin-0 capital Mbabane None None 0 Mbabane 1.0 ... スワジランド 스와질란드 Swaziland Suazi eSwatini Свазиленд Swaziland Svaziland Swaziland 斯威士兰
9 6 30 0 Admin-0 capital alt Bir Lehlou None None 0 Bir Lehlou 0.0 ... 西サハラ 서사하라 Westelijke Sahara Sahara Zachodnia Saara Ocidental Западная Сахара Västsahara Batı Sahra Tây Sahara 西撒哈拉
12 6 30 0 Admin-0 capital Moroni None None 0 Moroni 1.0 ... コモロ 코모로 Comoren Komory Comores Коморы Komorerna Komorlar Comoros 葛摩
15 4 50 0 Admin-0 capital Kigali None None 0 Kigali 1.0 ... ルワンダ 르완다 Rwanda Rwanda Ruanda Руанда Rwanda Ruanda Rwanda 卢旺达

5 rows × 215 columns

Distance Calculations

In [17]:
rivers = gpd.read_file('https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/physical/ne_50m_rivers_lake_centerlines.zip')
In [18]:
rivers.geometry.head()
Out[18]:
0    LINESTRING (51.93713 55.70107, 51.88087 55.686...
1    LINESTRING (53.69385 58.20632, 53.67715 58.273...
2    LINESTRING (37.11301 11.85499, 37.15037 11.893...
3    LINESTRING (38.56119 35.86264, 38.36534 35.903...
4    MULTILINESTRING ((-86.52177 33.03212, -86.5209...
Name: geometry, dtype: geometry
In [19]:
rivers.plot()
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb10c1668>
In [20]:
def min_distance(point, lines = rivers):
    return lines.distance(point).min()

afr_cities['min_dist_to_rivers'] = afr_cities.geometry.apply(min_distance)
In [21]:
f, ax = plt.subplots(dpi = 200)
afr.plot(edgecolor = 'k', facecolor = 'None', linewidth = 0.6, ax = ax)
rivers.plot(ax = ax, linewidth = 0.5)
afr_cities.plot(column = 'min_dist_to_rivers', markersize = 0.9, ax = ax)
ax.set_ylim(-35, 50)
ax.set_xlim(-20, 55)
ax.set_title('Cities by shortest distance to Rivers')
ax.set_axis_off()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb1044a90>
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb1044a90>
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb1044a90>
Out[21]:
(-35, 50)
Out[21]:
(-20, 55)
Out[21]:
Text(0.5, 1, 'Cities by shortest distance to Rivers')

Buffers

In [22]:
afr_cities_buf = afr_cities.buffer(1)
In [23]:
f, ax = plt.subplots(dpi = 200)
afr.plot(facecolor = 'None', edgecolor = 'k', linewidth = 0.1, ax = ax)
afr_cities_buf.plot(ax=ax, linewidth=0)
afr_cities.plot(ax=ax, markersize=.2, color='yellow')
ax.set_title('1 decimal degree buffer \n Major cities in Africa', fontsize = 12)
ax.set_axis_off()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb0ec4390>
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb0ec4390>
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb0ec4390>
Out[23]:
Text(0.5, 1, '1 decimal degree buffer \n Major cities in Africa')

Raster Data

In [24]:
raster = 'data/res03_crav6190h_sihr_cer.tif'
In [25]:
# Get info on raster
NDV, xsize, ysize, GeoT, Projection, DataType = gr.get_geo_info(raster)

grow = gr.load_tiff(raster)
grow = gr.GeoRaster(grow, GeoT, projection = Projection)
In [26]:
f, ax = plt.subplots(1, figsize=(13, 11))
grow.plot(ax = ax, cmap = 'YlGn_r')
ax.set_title('GAEZ Crop Suitability Measures')
ax.set_axis_off()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb09c6278>
Out[26]:
Text(0.5, 1.05, 'GAEZ Crop Suitability Measures')

Clipping Raster

In [27]:
brazil = countries.query('ADMIN == "Brazil"')
In [28]:
grow_clip = grow.clip(brazil)[0]
In [29]:
f, ax = plt.subplots(1, figsize=(13, 11))
grow_clip.plot(ax = ax, cmap = 'YlGn_r')
ax.set_title('GAEZ Crop Suitability Measures')
ax.set_axis_off()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb0912e80>
Out[29]:
Text(0.5, 1.05, 'GAEZ Crop Suitability Measures')

Zonal Statistics

In [30]:
murdock = gpd.read_file('https://scholar.harvard.edu/files/nunn/files/murdock_shapefile.zip')
In [31]:
murdock_cs = gpd.GeoDataFrame.from_features((zonal_stats(murdock, raster, geojson_out = True)))
In [32]:
f, ax = plt.subplots(dpi = 300)
gplt.choropleth(
    murdock_cs, hue='mean', linewidth=.5, cmap='YlGn_r', ax=ax
)
ax.set_title('Crop Suitability by Homeland \n Murdock Atlas', fontsize = 12)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb086d828>
Out[32]:
Text(0.5, 1.0, 'Crop Suitability by Homeland \n Murdock Atlas')

Spatial Econometrics

Weight Matrices

In [33]:
%time w = lps.weights.Queen.from_dataframe(murdock_cs)
CPU times: user 545 ms, sys: 12.3 ms, total: 557 ms
Wall time: 561 ms
/home/alal/anaconda3/envs/gds/lib/python3.7/site-packages/libpysal/weights/weights.py:167: UserWarning: The weights matrix is not fully connected: 
 There are 65 disconnected components.
 There are 63 islands with ids: 8, 18, 21, 38, 56, 60, 67, 100, 103, 110, 127, 130, 134, 144, 153, 154, 179, 186, 197, 201, 249, 266, 267, 281, 286, 294, 298, 314, 335, 364, 391, 397, 462, 469, 480, 481, 549, 557, 562, 571, 619, 632, 634, 643, 653, 665, 667, 678, 682, 685, 691, 727, 749, 757, 781, 787, 789, 790, 791, 792, 793, 800, 832.
  warnings.warn(message)
In [34]:
w.n
w.mean_neighbors
Out[34]:
843
Out[34]:
5.029655990510083
In [35]:
ax = murdock_cs.plot(color='k', figsize=(9, 9))
murdock_cs.loc[w.islands, :].plot(color='red', ax=ax);
In [36]:
mur = murdock_cs.drop(w.islands)
In [37]:
%time w = lps.weights.Queen.from_dataframe(mur)
w.transform = 'r'
CPU times: user 375 ms, sys: 15.5 ms, total: 390 ms
Wall time: 386 ms
/home/alal/anaconda3/envs/gds/lib/python3.7/site-packages/libpysal/weights/weights.py:167: UserWarning: The weights matrix is not fully connected: 
 There are 2 disconnected components.
  warnings.warn(message)

Moran's I

Measure of spatial correlation

$$I = \frac{N}{W} \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x} ) ( x_j - \bar{x} ) }{ \sum_i (x_i - \bar{x})^2 }$$

where $N$ is the total number of units, $x$ is the variable of interest, $w_{ij}$ is the spatial weight between units $i$ and $j$, and $W$ is the sum of all weights $w_{ij}$

$I \in [-1, 1]$. Under null of no spatial correlation, $E(I) = \frac{-1}{N-1} \rightarrow 0$ with large $N$.

In [38]:
mur.shape
Out[38]:
(780, 9)
In [39]:
mi = esda.moran.Moran(mur['mean'], w)
mi.I
mi.p_sim
Out[39]:
0.7994262412596213
Out[39]:
0.001

Spatial Lag

In [40]:
mur['cs'] = (mur['mean'] - mur['mean'].mean()) / mur['mean'].std()
In [41]:
mur['lag_cs'] = lps.weights.lag_spatial(w, mur['cs'])
In [42]:
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot values
sns.regplot(x='cs', y='lag_cs', data=mur, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-1, 1, "LH", fontsize=25)
plt.text(-1.0, -1.5, "LL", fontsize=25)
# Display
plt.show()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8fb100ae10>
Out[42]:
<matplotlib.lines.Line2D at 0x7f8fb09397b8>
Out[42]:
<matplotlib.lines.Line2D at 0x7f8fa451a6a0>
Out[42]:
Text(1.75, 0.5, 'HH')
Out[42]:
Text(1.5, -1.5, 'HL')
Out[42]:
Text(-1, 1, 'LH')
Out[42]:
Text(-1.0, -1.5, 'LL')

More

  • Local Indicators of Spatial Autocorrelation (spatial clustering)
  • Conley SEs
  • Gaussian Random Fields