Geocoding in Stata

I had to geocode about 50,000 addresses for a project recently and found a new user-written Stata command, georoute:

The authors argue their command is easier to use than others, some of which apparently don’t work anymore. The major issue for all of these commands appears to be the source dataset. Turns out it is tough to process thousands of observations, while at the same time geocoding addresses (as opposed to using latitude and longitude) and getting back both distance and travel time. Most of us need the latter for our research projects.

The authors use You can process 15,000 addresses per month for free, or pay for additional access (e.g., $49/month for 100K).

Here are some hints for using the command:

  1. Be sure to include the country; this is easy to forget when using addresses within a single country.
  2. Be careful of addresses with a comma in the street address (e.g., 123 Elm St, #1), I had to pull those out to get the address to geocode.
  3. Reduce the dataset to the bare minimum number of variables and avoid Stata IC. The command must create a huge number of temporary variables, because I couldn’t get it to run in IC when I ran it in my IC version by mistake.
  4. You may have to partition the dataset into chunks and process them separately, see my code below. Even in SE on my souped up desktop, I was getting a weird Stata memory error (no room to add more variables because of width; Width refers to the number of bytes required to store a single observation; it is the sum of the widths of the individual variables. The maximum width allowed is 1,048,576 bytes. You just attempted to exceed that). I managed to get it to work by processing only 500 cases at a time.
  5. Check the first few slices of the dataset to make sure the geocoding is working, otherwise you will end up wasting a lot of time rerunning observations.
  6. It is really slow, takes about 3-5 minutes to process 500 observations, and I have Google Fiber 100 mbps internet. I corresponded with a tech at, who said calling their database via the API should be almost instantaneous. So it must be how the georoute command is coded in Stata. The timer option did not speed up the process. Nor did the speed differ between using a free account and a paid account.
gen comma_pos=strpos(student_street,",")
replace student_street=substr(student_street,1,comma_pos-1) if comma_pos>1 // geocode does not like commas in street addresses

* first, divide the 47k dataset into separate datasets of 500 observations
foreach j of numlist 0(500)47000 { // counts from 0 to 47K by 500; lower bound for keeping cases, e.g. 1000
   local k = `j' + 499             // upper bound for keeping cases, e.g. 1499, 1999
   use temp0, clear                // load main dataset, n=47K
   keep if _n>=`j' & _n<=`k'       // only keep a slice of 500 observations
   save "$srpdata\temp\geodata`j'", replace   // save the slice
   display `j' " " `k'             // display the lower and upper bounds in output to keep track of what is going on
   local j = `k'+1                 // now bump up the lower bound by 1, so that it is equal to 500, 1,000, etc.

* second, read in each dataset and geocode them separately
foreach j of numlist 0(500)47000{              // counts counts from 0 to 47K by 500 to match dataset suffixes
   use "$srpdata\temp\geodata`j'", clear       // load a slice of data
   georoute, hereid(XXXX) herecode(XXXXX) ///  // use georoute for the slice
   startad(student_street student_city student_zip student_country) ///
   endad(site_address site_city site_zip country) ///
   distance(camp_dist) time (camp_time) herepaid timer replace
   save "$srpdata\temp\geodata`j'", replace    // save the slice with geocoded data
   display "file saved at " "$S_TIME"          // let's you track time to geocode

* third, recombine datasets
use "$srpdata\temp\geodata0", clear
foreach j of numlist 500(500)47000 { 
   append using "$srpdata\temp\geodata`j'"