Skip to content

Conversation

nitinmagima
Copy link
Contributor

Code for Local Shapefile Support.

@nitinmagima nitinmagima self-assigned this Aug 14, 2025
@aaron-kaplan
Copy link
Collaborator

Thanks for removing the data files. Sorry for the confusion there.

Please put back the configs for other countries that you removed from fbfmaproom-sample.yaml. Only make changes to the djibouti section for now.

Let's keep the shapefiles alongside the data files, e.g. in /data/aaron/fbf-candidate/djibouti/shapes. We may have to deal with version updates, so let's use a subdir for each version, e.g. /data/aaron/fbf-candidate/djibouti/shapes/2025-08. If you have a version number that comes from the provider, you could use that instead of 2025-08.

Then specify the shapefile directory in the config file as a relative path, relative to the data root, just as we have done for zarr files. Remove all country-specific information from retrieve_shapes_local (it currently has data/djibouti/dji_adm_gadm_2022_shp/ hard-coded, which obviously won't work for other countries).

@nitinmagima
Copy link
Contributor Author

I've updated it based on your feedback. I've also tested it on the server and it works. Let me know if you face any issues.

Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I would like to make some changes that I feel will simplify the code and make it more maintainable. I'm entering my comments here both as notes for myself, and to explain the changes to you. I'm not asking you to make these changes--I will do it.

In addition to the line-by-line comments I made below, I have this general comment: there are now two different ways to get shape data, and the knowledge about how to query each one is spread out throughout the program. I will centralize that knowledge in two classes, one for db and one for files, and have both of them implement the same interface, so that code that uses them doesn't need to know which kind of shape source it's using.

# Build path
data_root = CONFIG["data_root"]
shapes_version = config.get("shapes_version")
shapefile_path = f"{data_root}/{country_key}/shapes/{shapes_version}/{shape_config['local_file']}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

country_key isn't actually appropriate here, because we ended up having multiple country_keys for a single country, e.g. djibouti and djibouti-mam, but there's only going to be one set of shapes per country. Let's handle shapefile paths the same way we handle zarr paths, i.e. the config file specifies a path that's relative to data_root. (That will also eliminate the shapes_version field.)

from shapely import wkb
from shapely.geometry import Polygon, Point
from shapely.geometry.multipoint import MultiPoint
import geopandas as gpd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked into eliminating geopandas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked into it. Shapely doesn't speak shapefiles, so we do need to add a dependency.

If we're going to use geopandas for the shapefiles, I will also use it for the db shapes. Having two different interfaces depending on the source of the shapes is an unnecessary complication.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geopandas is an insanely expensive dependency to add. Adding geopandas to the project pulls in the following other packages that we weren't previously using:

ranca, brotli, charset-normalizer, contourpy, cycler, fiona, folium, fonttools, h2, hypack, hyperframe, idna, importlib-resources, joblib, kiwisolver, lcms2, libgfortran-ng (in addition to libgfortran and libgfortran5 that we already had), libspatialindex, mapclassify, matplotlib-base, munkres, networkx, openjpeg, pillow, pyproj (surprised we didn't have that), pysocks, qhull, requests, rtree, scikit-learn, scipy, threadpoolctl, unicodedata2, urllib3, xyzservices.

fiona is the library geopandas uses for reading shapefiles. All of fiona's dependencies are already present in our environment, so it's just one package to add. I'll try using fiona without the rest of geopandas.

'the_geom': gdf.geometry # Keep as shapely geometry objects
})

return df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally try to avoid having multiple returns in the same function, because I find that when I edit a function that has multiple returns, I often change one and forget to change the other.

Suggestion:

try:
    df = ...
except Exception as e:
    ...
    df = ...
return df

# Try local shapefile first
if 'local_file' in sc:
try:
return retrieve_shapes_local(country_key, mode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would prefer not to have multiple returns.

if 'local_file' in sc:
try:
return retrieve_shapes_local(country_key, mode)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make most sense to remove the error-handling code from retrieve_shapes_local, let it throw an exception if something goes wrong, and have a single try/except in this function that handles exceptions for both cases. Less duplication, less risk of changing one and forgetting to change the other.

try:
return retrieve_shapes_local(country_key, mode)
except Exception as e:
print(f"Local shapefile failed, trying database: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this error handling strategy. I think the config file should specify either shapefiles or db queries, but not both; and if it specifies shapefiles, and we get an error while trying to load those shapefiles, then it should fail, not fall back to something else.

if mode == "pixel":
[[y0, x0], [y1, x1]] = json.loads(geom_key)
shape = Polygon([(x0, y0), (x0, y1), (x1, y1), (x1, y0)])
return shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid multiple returns.

@aaron-kaplan
Copy link
Collaborator

validate_csv also needs to be changed. It's used by the CSV upload/validate button.

@aaron-kaplan
Copy link
Collaborator

Superseded by #570 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants