Calculating Distances with MySQL and Zip Codes

March 20th, 2008

I recently have been working on a major project where were setting up a system to search through candidates and find people within a certain distance from a specific zipcode and i wanted to put my ideas out there. I found a lot of help on the MySQL mailing list archives and other places across the net, and because of that i wanted to share my solution with others who may run into these questions.

Ok so here’s my problem, given a zip code of say 27615, (which I find the latitude and longitude of through Yahoo), I wanted to find everyone who is within 50 miles of that location. I wanted most of the heavy lifting to be done in MySQL since its known for being fast for doing complex math calculations so that’s where I started, I also wanted it to be accurate, but the two don’t come easy. I found the best formula was to use the haversine formula, if you don’t know what this is, I suggest you head out to Google and do a little searching. Basically it can calculate the distance between two points using latitude and longitude and also account for the fact that the planet isn’t flat and it’s also not a perfect circle.

Here is a simple function using the formula in PHP format:

function calculateDistLatLong($lat1, $lon1, $lat2, $lon2,  $units = "S") {
$a = pow( (sin(0.0174 * ($lat2 -  $lat1) / 2)) ,2) +
        cos(0.0174  * $lat1) *
        cos(0.0174  * $lat2) *
        pow(  
            (sin(0.0174 * ($lon2 - $lon1)/2)) 
     ,2);
 
$c = 2 * atan2(sqrt($a), sqrt(1-($a)));
 
//R (Earth  Radius) = 3956.0 mi = 3437.7 nm = 6367.0 km
 
switch($units)  {
    case 'S': // STATUTE MILES
    $R = 3956.0;
    break;
case  'N': // NAUTICAL
    $R  = 3437.7;
    break;
case  'K': // KILOMETERS
    $R  = 6367.0;
    break;
}
 
return ($R * $c);
}

A modified MySQL version is this:

3956 * 2 *
    atan2(
        sqrt(
            pow((
                sin(0.0174 * ($lat1 - $lat2)/2)
            ),2) +
        cos(0.0174  * $lat2) *
        cos(0.0174  * $lat1) *
        pow((
            sin(0.0174 * ($long1 - $long2) / 2)
        ),2)
    ) ,
    sqrt(
        1  - (
            pow((
                sin(0.0174  * ($lat1 - $lat2) / 2)
            ),2) +
            cos(0.0174  * $lat2) *
            cos(0.0174  * $lat1) *
            pow((
                sin(0.0174  * ($long1 - $long2)/2)
            ),2)
        )
    )
) AS geoCodeDistance

As you can see, pretty complex. But mainly because of the math function.

My initial problem was that the SQL query itself was taking around 10 seconds to run against 30,000+ records, so what I initially did was to move the code outside of the query and process the distance with PHP. This brought the query time down to around 3 seconds and the total process time down to around 6 seconds. This was still unacceptable. So I figured out there was a way to cut down the number of results I needed to process within the same query. I found another similar math function to trim my results. I found the latitude and longitude of my starting coordinates then calculated out a rough square shaped perimeter of the radius I was searching.

When I was using the PHP method previously, I did this then ran the results through the function above to narrow down the results. If it returned a value greater than my $radius I would throw it out.
So with the assumptions of $lat2 and $long2 being my source, or where I want to search from.

$lat_range = $radius / 69.172;
$lon_range = abs($radius / (cos($lat) * 69.172));
$min_lat = number_format($lat2 - $lat_range, "6",  ".", "");
$max_lat = number_format($lat2 + $lat_range, "6",  ".", "");
$min_lon = number_format($long2 - $lon_range, "6",  ".", "");
$max_lon = number_format($long2 + $lon_range, "6",  ".", "");

Then add this to the sql query

$sql .= " (geoLatitude BETWEEN '$min_lat' AND '$max_lat' AND  geoLongitude BETWEEN '$min_lon' AND '$max_lon')";

What I eventually settled on was a mixture of the two. By running the first query in the select then placing

$sql .= " HAVING geoCodeDistance <=  ".$radius."";

At the end of the query, I could limit the number of results that were looked at in the select. It now takes under 1 second to run the whole thing. Much, much better result.

So with a table such as

CREATE TABLE `MyTable` (
    `geoLongitude` FLOAT(9,6),
    `geoLatitude` FLOAT(9,6)
) TYPE=MyISAM ;

I end up with a query similar to this

SELECT *,
3956 * 2 * atan2(
            sqrt(
                pow((
                    sin(0.0174 * ($lat1 - $lat2)/2)
                ),2) +
                cos(0.0174  * $lat2) *
                cos(0.0174  * $lat1) *
                pow((
                    sin(0.0174 * ($long1 - $long2) / 2)
                ),2)
            ) ,
            sqrt(
                1  - (
                    pow((
                        sin(0.0174  * ($lat1 - $lat2) / 2)
                    ),2) +
                    cos(0.0174  * $lat2) *
                    cos(0.0174  * $lat1) *
                    pow((
                        sin(0.0174  * ($long1 - $long2)/2)
                    ),2)
                )
            )
) AS geoCodeDistance
FROM MyTable
WHERE (geoLatitude BETWEEN '$min_lat' AND '$max_lat' 
    AND geoLongitude BETWEEN '$min_lon' AND '$max_lon')
HAVING geoCodeDistance <= 50
ORDER BY geoCodeDistance

As far as finding the latitude and longitude, there are a plethora of free resources out there to help you with that. Yahoo has released their API for geocoding as well as mapping. Hopefully Google won’t be far behind with an API for geocoding. Although the Yahoo one is excellent, and allows for quite a large number of queries to their server. You can find more info at http://developer.yahoo.net/maps/rest/V1/geocode.html

I wish I had all the links to the pages where I found some of this information, but I don’t. I Hope this helps someone out there.