WARNING: This post was originally published in October 2009 and may no longer be up-to-date.

MySQL Distance Calculations

One customer recently needed their website to know the distance between people. It did not have to include exact driving directions so we based it off of zip code latitudes/longitudes. We obtained these indirectly from the US Census Bureau and used fuzzy trigonometry.

The first step was to find a way to put the zip code information into a database. The latest available (circa 2004) which was easily workable came from http://www.boutell.com/zipcodes/.

We created a table so that we could import the data.

CREATE TABLE zipcodes (
    zip varchar(5) NOT NULL default '',
    city varchar(64) default NULL,
    state varchar(2) default NULL,
    latitude float NOT NULL default '0',
    longitude float NOT NULL default '0',
    timezone int(11) default NULL,
    dst tinyint(1) NOT NULL default '0',
    PRIMARY KEY (zip)
)

This will allow for a direct import of the CSV data with one small annoyance. The .csv file had blank lines in it. You can remove those by hand in most text editors.

If you did try to import before removing those blank lines, you will want to truncate (empty) the database before the next import. MySQL should stop you automatically if you made the zip code a primary key, as we did, and you do not want to duplicate this information.

After saving, we imported the data into the database by using phpMyAdmin. You can also use the SQL command LOAD DATA LOCAL INFILE.

Once it has been imported, we needed a way to track a point in space. Even though there is little support for it in some third-party programs (such as phpMyAdmin), it turns out that MySQL does have a POINT data type which is specifically designed for this purpose. You need to alter the structure of the table using this command:

ALTER TABLE `zipcodes`
ADD `location` POINT NOT NULL

The next part gets into the realm of math. If you have ever taken a look at 3D programming, speed is squeezed out of the code by performing as many calculations as possible before they are needed. One popular trick, at least in the past, has been to create sine and cosine lookup tables. Instead of having to calculate these numbers you can instruct the program to find their value by going to the right location in an array. This allowed some 3D games to run smoothly even on old 80 Mhz computers.

Distance calculations in three dimensions (as our world requires for larger distances) could benefit from this same speed increase. It would be bad to overload the server too much with these frequent calculations. After considering the possibilities, we decided that the best option was to calculate ahead of time the distance from latitude and longitude 0.

If we save both the latitudinal and longitudinal distances, we can use Pythagoras’ theorem (a^2 = b^2 + c^3) to find distances between two of those points. Now the most processor-intensive calculation is a single square root for each distance calculated, which will speed up the routine calculations significantly. Some people who have not done this have ended up with 15 second SQL queries, which is unacceptable. There is another trick later that will give us even more speed.

We visited MeridianWorldData.com and borrowed their calculations. The first step was to boil them down because they were designed to calculate from two specific coordinates and we wanted one of them to have both an x and y coordinate of 0.

This is what we ended up with:

UPDATE `zipcodes`
SET location = GeomFromText(CONCAT('POINT(',
    IF(latitude < 0, -1, 1) * 3963.0 * ACOS(COS(latitude/57.2958) * COS(longitude/57.2958 - longitude/57.2958)), ' ',
    IF(longitude < 0, -1, 1) * 3963.0 * ACOS(SIN(latitude/57.2958) * SIN(latitude/57.2958) + COS(latitude/57.2958) * COS(latitude/57.2958) * COS(longitude/57.2958)),
')'))

Notice that the numbers can be negative in the database. The reason for this is to prevent our calculations from matching American zip codes to Asian cities if this is ever expanded.

We feel bad about using GeomFromText() because that adds processing overhead, and we are huge fans of optimization, but we were not able to get this to work without it and the command is only run when creating (or updating) the database. That will have to be optimized enough for now.

Now it is time to test this with a random zip code in New York city:

SELECT ROUND(X(location)) AS x, ROUND(Y(location)) AS y
FROM zipcodes
WHERE zip = 10285

This returns an X value of 2816 and a Y value of -3755.

That looks good so far. The comment on this blog entry provides the best reference we have found so far for running the distance calculations. If you take the X and Y values from the last SELECT statement, you can find the distance between them and every zip code in the entire database (with the nearest ones first) by using this query:

SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2816, -3755))))) As distance
FROM `zipcodes`
ORDER BY distance ASC

The query reports that Jersey City is only a couple miles away, which is mostly accurate (MapQuest gives different distances because you have to find a bridge). Since we are using zip code areas and not exact addresses anyway, a little variation is alright. You would want to use different calculations if you needed distances to be accurate to within a centimeter anyway.

You might also notice the speed on this query. For us, it required roughly 0.1 seconds which is a much faster than if we had not created the POINT data, but it is still too long. There are a couple of things we can still do to speed this up without hard-coding the distances - which would limit us to searches only from one location.

One speed enhancement would be to index the POINT data:

ALTER TABLE `zipcodes`
ADD SPATIAL INDEX (location)

That modification brought it down one or two tenths of a second per search, but we can take even more advantage of the speed by creating boundaries for our query. If we decided that we wanted the zip codes within 10 miles, we could use WHERE distance < 10 but that requires us to calculate distance for every row in the table - which we are already doing - and only saves us on the number of results returned to our code for processing. If we can avoid calculating every row then that should save us a lot more time.

What if we asked for a box? A query that asks for any zip codes within 60 miles of the latitude and 60 miles of the longitude would look like this:

SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2816, -3755))))) As distance
FROM `zipcodes`
WHERE X(location) > 2816 - 60
  AND X(location) < 2816 + 60
  AND Y(location) > -3755 - 60
  AND Y(location) < -3755 + 60
ORDER BY distance ASC

That cut the time in half for a query on my server, which puts it somewhere around 0.04 seconds each. That is not bad. The difference in speed is now minuscule if you delete the POINT index but there is still a small difference. We have chosen to keep it.

You will probably notice that some of the values returned are as much as 85 miles from our starting point. That is okay. We can silently discard any rows with a distance above 60 when displaying the data to the users.

We know general distances now. Let us find out how far it is to a specific location, such as Miami, Florida. This is all that it takes to check the distance to one zip code:

SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2817, -3754))))) As distance
FROM `zipcodes`
WHERE zip = 33125

It says 1548. Mapquest reports 1292. For Memphis, TN (37544) we get 1198 and MapQuest says 1099. Portland, OR (97232) says 1529 and MapQuest says 2895.

This is definitely not exact, but it is close enough for small distances. Feel free to adapt this to your needs. If nothing else, hopefully these SQL functions will give you a few ideas. If this project grows, we will refine these calculations further.

We would be interested in hearing what you are doing.