Debugging my thesis

Fri Dec 25 2020

tags: public thesis programming debugging

Introduction

Since my undergraduate thesis was quite well-received, I thought it would be a good idea to publish it. To that end, Gabe and I have been working on my thesis. As a replication check both of us ran the code on our machines... both of our laptops produced the same data but they were different from the existing data generated by my old laptop.

In my thesis I generate 100,000 hypothetical districting plans, calculate four different compactness metrics (Polsby-Popper, Reock, Convex Hull, and my own Human Compactness) and see how they correlate with a measure of district homogeneity (Stephanopoulos's "spatial diversity" metric). On our replication runs we observed the following:

  1. Spatial diversity metric unchanged
  2. Polsby-Popper, Reock, and Convex Hull very slightly changed (around 1/10th of the values were different in the 11th or 12th decimal place)
  3. Big changes in human compactness metric (from 0.7 to 0.9, that sort of thing)

This was pretty bad, and threatened to derail my research completely

What wasn't the bug?

What was the bug?

Because I didn't freeze the environment, we had to install the newest versions of the libraries.

There is one line of code in sample_rvps

all_rvps = all_rvps.to_crs("EPSG:4326")

which converts the initial shapefile to a new coordinate reference system (CRS).

For some reason, the to_crs function in the new setup gives very slightly different values, differing in the 11th or 12th decimal place. You can see it in the snippet below:

gdf = gdf.to_crs({"init": "epsg:4326"})
print(str(gdf["geometry"][0])) # on old machine: POINT (-73.59250979083568 41.01678126113871)
# This assert passes on my new machine but fails on my old one
assert str(gdf["geometry"][0]) == "POINT (-73.59250974534255 41.01678109722333)"
print("All asserts passed!")

This alone wouldn't cause that much of a discrepancy, but it just so happens that this tiny change causes a point which was originally out of the state line to be in the state line. This causes the points sampled by sample_rvps to change completely even with the same random seed, because all the points have been shifted down 1. Because the points have all changed, all the precalculated point-kNN values are all incorrect (because they were for different points). And this leads to completely incorrect results.

The other metrics all changed by a very very tiny amount.

initial_partition = GeographicPartition(
graph,
assignment="New_Seed",
updaters={
"cut_edges": cut_edges,
"population": Tally("population", alias="population"),
"spatial_diversity": spatial_diversity.calc_spatial_diversity,
"human_compactness": partial(
hc_utils.calculate_human_compactness,
DURATION_DICT,
tract_dict,
duration_matrix,
),
"polsby_compactness": polsby_popper,
"reock_compactness": partial(reock.reock, state_shp),
},)

We know the EPSG conversion is very slightly different in the new state. But there is no EPSG conversion in the Reock function!

My best guess (but I haven't verified this) is that there is an upstream change in how points are calculated. The Reock function calls the _generate_external_points subroutine to find all external points of the Census tracts so that it can draw the circumscribing circle of the district:

def _generate_external_points(row, geoid_to_id_mapping):
'''
For each Census Tract, generate the exterior points of that Tract
'''

exterior_points = []

# print(row['geometry'].geom_type)
if row['geometry'].geom_type == 'MultiPolygon':
for polygon in row['geometry']:
exterior_points += list(polygon.exterior.coords)
else:
assert(row['geometry'].geom_type == 'Polygon')
polygon = row['geometry']
exterior_points += list(polygon.exterior.coords)

return exterior_points

I believe there might be a difference with polygon.exterior.coords. Perhaps there's something in the getting points from a shapefile thing that has changed. If the exterior points' coordinates are very slightly different, this will affect the Reock function.

What causes Polsby-Popper to change?

TL;DR: I have no idea.

I dug into the Gerrychain source code to look at the Polsby-Popper function:

import math

def compute_polsby_popper(area, perimeter):
try:
return 4 * math.pi * area / perimeter ** 2
except ZeroDivisionError:
return math.nan


def polsby_popper(partition):
"""
Computes Polsby-Popper compactness scores for each district in the partition.
"""

return {
part: compute_polsby_popper(
partition["area"][part], partition["perimeter"][part]
)
for part in partition.parts
}

It gets area and perimeter from the Partition object, which takes in a graph, an assignment, and a list of updaters.

In 10_Process_Tracts.py we have the following code:

state_graph = Graph.from_geodataframe(state_tracts, ignore_errors = True)
state_graph.add_data(state_tracts)
state_tracts.to_file(f"./Data/Shapefiles/Tract2000_{fips}.shp")
state_graph.to_json(f"./Data/Dual_Graphs/Tract2000_{fips}.json")

Looking at the GerryChain docs for from_geodataframe:

Creates the adjacency Graph of geometries described by dataframe. The areas of the polygons are included as node attributes (with key area). The shared perimeter of neighboring polygons are included as edge attributes (with key shared_perim). Nodes corresponding to polygons on the boundary of the union of all the geometries (e.g., the state, if your dataframe describes VTDs) have a boundary_node attribute (set to True) and a boundary_perim attribute with the length of this “exterior” boundary. By default, areas and lengths are computed in a UTM projection suitable for the geometries. This prevents the bizarro area and perimeter values that show up when you accidentally do computations in Longitude-Latitude coordinates. If the user specifies reproject=False, then the areas and lengths will be computed in the GeoDataFrame’s current coordinate reference system. This option is for users who have a preferred CRS they would like to use.

So the areas and perimeters of the polygons have already been pre-calculated and saved. As far as I can tell, the areas and perimeters of the polygons are not recalculated. And there doesn't seem to be a reprojection either. Why then can the Polsby-Popper metric still give a different result?

Questions I still have

Why does to_crs give a very slightly different result? I created an issue + minimum reproducible example and posted it on the geopandas repo.

I am almost certain it is not the fault of geopandas but rather some upstream dependency. This is why reock and polsby-popper are affected as well.

What's next

Things I learned

Bit-by-bit replication should be done with conda list -e > requirements.txt. Not using conda list -e or pip freeze or the like is a mistake that I won't make again.

Conclusion