You are here

Merge big CSV files with Pandas and Python

I deal with large CSV at work, mostly database dumps. Google Sheets and Numbers/Excel just can't keep up with formula changes for 300k+ lines. What do we do? We use Python.

The use case here is replicating a the VLOOKUP function with a left join. We want to get the matching criteria from our referenced CSV file, but only the matching (otherwise that would be a full outer join).

Here's what our data looks like:

We want to look up the lat/long in our reference file and join on zip code. Here's what that data looks like:

Note the matching "zip" column names.

Here's the code:

#!/bin/python3


# import pandas
import pandas as pd
   
# read csv data
df1 = pd.read_csv('locations.csv',dtype={"zip": float})

# Our zipcodes
df2 = pd.read_csv('US_zips.csv')
   
Left_join = pd.merge(df1, 
                     df2, 
                     on ='zip', 
                     how ='left')


Left_join.to_csv('combined-retail.csv')

Let's take a look:

df1 = pd.read_csv('locations.csv',dtype={"zip": float})

Here we're reading from our list of locations file, the "zip" column is of the data type float

Left_join = pd.merge(df1, 
                     df2, 
                     on ='zip', 
                     how ='left')

We have several options on how we'd like to join: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’} or by default ‘inner’.

If our columns had different names we'd like to join on, we could use:

Left_join = pd.merge(df1, 
                     df2, 
                     left_on ='zip', 
                     right_on = 'zip_code',
                     how ='left')

When we run the code, this is our result:

In this use case, I was mapping locations for use with Kepler.gl and as such it requires latitude and longitude. A few resources haven't been updated in years regarding zip codes, but I was able to find this: http://download.geonames.org/export/zip/. You'll find all the zip codes with additional geographical names, great for looking up data.
I didn't know much about zip codes (since we have postal codes up here) but I found this tidbit interesting:
"
ZIP codes are a very messy kind of geography. They were created by the U.S. Postal Service as a tool to help deliver the mail more efficiently. ("ZIP" is actually an acronym for "Zone Improvement Plan", where "Zone" is a reference to the 2-digit postal zones that were used by the post office prior to implementing nationwide ZIP codes back in the early 1960's. Because it is an acronym we always use the uppercase for it.) ZIP codes have been adopted by marketing people and by all kinds of other researchers as a standard geographic area, like a city or a county. We see maps of ZIP codes in telephone books and from commercial vendors that make us think of them as spatially defined areas with precise boundaries, similar to counties. But, from the perspective of the agency that defines them, the U.S. Postal Service, ZIP codes are not and never have been such spatial entities. They are simply categories for grouping mailing addresses. As such, ZIP codes do in most cases resemble spatial areas since they are comprised of spatially clustered street ranges. But not always. In rural areas, ZIP codes can be collections of lines (rural delivery routes) that in reality do no look much like a closed spatial area. In areas where there is no mail delivery (deserts, mountains, lakes, much of Nevada and Utah) ZIP codes are not really defined. You may see maps that show ZIP code boundaries that include such areas, but these are not post-office-defined official definitions. An area will not be assigned a ZIP code until there is a reason for it, i.e. until there needs to be mail delivered there. So the actual definition of a ZIP code "boundary" is quite fuzzy at best, and a purely extrapolated guess (at what it would be if someone were to start receiving mail there) at worst. If you have an application that requires extreme geographic precision, especially in sparsely populated areas, then you need to avoid using ZIP codes.
"

Powered by Drupal