Friday, August 12, 2011

Find the Closet Pair of Points

Problem: Given N points on a cartesian ordinate plane, find the two points with the shortest distance.

Solution: We can use the approach that solving P(n) based on P(n-1). We first sort the points based on their X coordinates. Then assume we have solved the problem with the first n-1 points and the shortest distance is d. Then given the nth point, we don't need naively compute the distance between the nth point and all the previous n-1 points. We can leverage the important information we have obtained: "d". If we can further find a closer pair  given the nth point, the other point in the pair should be in a specific area. The area is a rectangle with dimension (d*2d) that sits on the left of the nth point. Moreover, the number of potential points that could be in this area is limited, at most six points (which can be proved using pigeonhole principle). Therefore our algorithm comes as follows:
  • two important data structures are used. The first one is an array X_array that contains all the points sorted by their X coordinates. The second is a BST Y_BST that contains the active points sorted by their Y coordinates. Insert the first two points into X_array and Y_BST. Initialize d_min as the distance between the first points.
  • using a sweeping line method. Have a vertical line sweep from left to right and start from the third point in X_array.
  • when the line touch a point p, given the current d_min,  first we need to remove the points in X_array that have distance larger than d_min to p. We also need to remove these points in Y_BST. Then we add p to Y_BST and adjust the active region within X_array accordingly.
  • search the points in the range [p.y-d_min, p.y+d_min] in Y_BST, for each point in the range, calculate its distance to p. If smaller than d_min, update d_min.
  • When all the points have been swept, the algorithm finishes.
  • The complexity for the whole algorithm is O(nlogn). Why? The initial sorting cost for X_array is O(nlogn).  For each points being removed from the active region in X_array, it takes O(logn). So the total cost for removing points from the active region in X_array  is O(nlogn). Similarly,  the total cost for removing points from  Y_BST is also O(nlogn). Since there are limited number of points in the rectangle with dimension (d*2d). The cost to calculate the distance between them and p is constant. Therefore, the complexity for the whole algorithm is O(nlogn).
typedef pair<double, double> Coordinate;

bool compX(Coordinate p1, Coordinate p2)
{
   if(p1.first != p2.first)
        return p1.first < p2.first;
   else return p1.second < p2.second;
}

bool compY(Coordinate p1, Coordinate p2)
{
  if(p1.second != p2.second)
    return p1.second < p2.second;
  else return p1.first < p2.first;
}

double cal_distance(Coordinate p, Coordinate q)
{
   double a = abs(p.first-q.first);
   double b = abs(p.second-q.second);

   return sqrt(a*a + b*b);
}

double find_closest(vector<Coordinate> &co)
{

   //sort based on X coordinates
   sort(co.begin(), co.end(), compX);

   if(co.size() == 0) return -1;
   if(co.size() == 2) return cal_distance(co[0], co[1]);

   double d_min = cal_distance(co[0], co[1]);
   //inition the y table
   bool(*fn_pt)(Coordinate,Coordinate) = compY;
   set<Coordinate, bool(*)(Coordinate,Coordinate)> y_table(fn_pt);
   y_table.insert(co[0]);
   y_table.insert(co[1]);

   //start from the third point from the left
   int start = 2;

   vector<Coordinate>::iterator tp;
   vector<Coordinate>::iterator it_head = co.begin();
   vector<Coordinate>::iterator it_tail = it_head+1;

   for( ; start<co.size(); start++)
   {
       double l_bound = co[start].first - d_min;
       Coordinate lc(l_bound, 0);
       tp = lower_bound(it_head, it_tail, lc, compX);

      //delete points from y_table
      vector::iterator pp, endp;
      if(tp == it_tail)
      {
         if((*tp).first > (*it_tail).first)
           tp = it_tail+1;
      }

      for( pp = it_head; pp!=tp; pp++)
      {
         y_table.erase(*pp);
      }
      it_head = tp;
      it_tail++;
      y_table.insert(co[start]);

      //select points from y_table
      double y_low = co[start].second - d_min;
      double y_high = co[start].second + d_min;

      //search for these two bounds
      set<Coordinate>::iterator sp, sp_low, sp_high;
      Coordinate y_lc(0, y_low);
      Coordinate y_hc(0, y_high);
      sp_low = lower_bound(y_table.begin(), y_table.end(), y_lc, compY);
      sp_high = lower_bound(y_table.begin(), y_table.end(), y_hc, compY);

      //calculate the distance and compare
      for(sp=sp_low; sp!=sp_high; sp++)
      {
        if(sp->first == co[start].first && sp->second == co[start].second)
             continue;
        double new_d = cal_distance(co[start], *sp);
        if(new_d < d_min) d_min = new_d;
      }

   }

   return d_min;
}


Alternative approach: We can also try to use K-d tree, more precisely, 2-d tree. K-d tree is basically a binary tree. It is constructed by recursively partition the points based on a particular axis (dimension). For example, for a 2-d tree, we first partition by x-coordinate (it is also ok to partition by y-coordinate), given all points, we find the points which x-coordinate is the median of the x-coordinates of all other points. Then we take the points which has this x-coordinate as root. The remaining points are divided into two groups: a) those which x-coordinates are smaller than root.x b) those which x-coordinates are bigger than root.x. Then we recursively partition these two groups, but by y-coordinate. This will help us find the nodes on the second level (assume root is the first level). To find the nodes at the third level, we partition by  x-coordinate. Repeat this until we can't partition any more.

Then we need to apply the Nearest Neighbor Search (NNS) in 2-d tree. Here I just briefly explain the algorithm.

  • First, we need to locate the query point  query point q. This process is similar to a binary search based on  q's coordinates. We stop when we encounter a leaf node. Calculate the distance between this leaf node and  q. This is the current best (nearest neighbor).
  • However, we need to move forward since the current best may not be optimal. We just rewind the recursion, go to the parent of the leaf node.  Calculate the distance between this (parent) node and  q. If necessary , update the current best. We also need to see if there remains some nearer neighbors in the other plain divided by this (parent) node. This can be done by checking if the circle centered at  q with the radius current best across the splitting line. If yes, we need to check the other branch of this node. Basically, we repeat previous steps, searching q in that subtree.
  • If we are done with the (parent) node, we just keep going up, check the parent of this (parent) node, until we encounter the root node.

The empirical performance for a NNS in 2-d tree is O(logn). Then do the NNS for all the nodes is O(nlogn). Considering the cost for building 2-d tree is O(nlogn), the total cost for this problem is also  O(nlogn).

See Also: The closest pair problem