Building a geohash table to quickly look up venues with Dynamodb

With a recent project I have been working on, we needed a way to be able to find venues based on user’s location. We needed a way to be able to zoom in and refine our search our zoom out and get a broader picture of what’s going on at a particular location. The other question is how can we do this easily and efficiently and use Dynamodb in the process. There are great gems that does this type of thing already Geocoder but in our case we can’t use it. We had to build some sort of solution on our own and what we found was that geohash fits our needs. Geohash is a way to be able to take any longitude and latitude and map it to a string you can check more of the details at wikipedia. Why this technique works well for us is that Dynamodb is a key-value store so we can use the hash buckets Geohash gives us to easily hold information about venues for a particular bucket. When I say bucket I mean the mapping from coordinates to a bounding box you can see this in action here.  Since I am using ruby and the dynamoid gem I’ll dive right into an example to show you how it works.

class GeoBucket
 include Dynamoid::Document
 table :name => :geo_bucket, :key => :geohash

 field :geohash, :string
 field :venue_ids, :set
end

What this class does is store the hash of the bucket and all venue_ids which belong to this bucket. I should also mention a bucket with the hash of dpwxp is contained inside the bucket with 1 lower resolution dpwx. This gives us a nice way to control what resolution of buckets we want to use and keep track of now for the venue class

class Venue
 include Dynamoid::Document
 table :name => :venue, :key => :venue_id
 field :venue_id, :string
 field :name, :string
 field :latitude, :float
 field :longitude, :float
 field :country, :string
 after_create :index_geohash
def index_geohash
 hash = GeoHash.encode(self.latitude, self.longitude)
 levelOne = GeoBucket.new(:geohash => hash[0..3])
 levelTwo = GeoBucket.new(:geohash => hash[0..4])
 levelThree = GeoBucket.new(:geohash => hash[0..5])
 levelOne.update({:return => :none}) do |t|
   t.add(:venue_ids => ["#{self.venue_id}"])
 end
 levelTwo.update({:return => :none}) do |t|
   t.add(:venue_ids => ["#{self.venue_id}"])
 end
 levelThree.update({:return => :none}) do |t|
   t.add(:venue_ids => ["#{self.venue_id}"])
 end
end
end

We first build the hash for the venues coordinates and then slice it so we are using a geohash resolution of 4, 5, and 6. We don’t want Dynamodb to return anything about our update call so we specify none. The nice thing about dynamodb is when adding an element to a set it makes sure that the set is unique so we aren’t storing redundant info. Now we have an efficient way to quickly lookup a bucket and see what venues lie within it’s boundaries. The question still remains given a user’s current location how can we lookup these venues around them? This is done easily by first mapping the current user’s location to a bucket and then checking the neighbouring buckets if our search radius overlaps them (ie when the user is zooming in and out of a map). Here I’m going to define a function within the GeoBucket class which given a current user’s location and a radius we select the proper resolution to use and find which buckets we should be pulling from. I am using the Geohash git because the official one’s fixes hasn’t been pushed to ruby gems and the Geocoder gem although if not needed to keep everything lightweight you could implement the bounding box calculation yourself.

class GeoBucket
 include Dynamoid::Document
 ...

# current_locations is a hash of {:lng => ..., :lat => ...} 
# and radius is specified in km
def self.get_buckets(current_location, radius = 0)
  resolution =
  if radius <= 1.6
   6
  elsif radius <= 9
   5
  else
   4
  end
  center_point = [current_location[:lat], current_location[:lng]]
  user_query = Geocoder::Calculations.bounding_box(center_point, radius)
  user_query = [[user_query[0],user_query[1]],[user_query[2],user_query[3]]]
  hash = GeoHash.new(current_location[:lat],current_location[:lng],resolution)
  neighbors = hash.neighbors
  neighbors_bbox = neighbors.map{|h| GeoHash.new(h).to_bbox}
  buckets = neighbors.select.with_index{|bucket,i| bounding_box_overlap(user_query,neighbors_bbox[i])}
  buckets += [hash.to_s]
  return GeoBucket.find_all(buckets)
 end
# Determines if two bounding boxes overlap each other
def self.bounding_box_overlap(bbox1, bbox2)
 return false if bbox1[0][1] > bbox2[1][1]
 return false if bbox1[1][1] < bbox2[0][1]
 return false if bbox1[1][0] < bbox2[0][0]
 return false if bbox1[0][0] > bbox2[1][0]
 return true
end
end

You can see that we first select the resolution depending on how big of a radius the user is searching and then proceed to check if the user’s query which is also a bounding box overlaps the current buckets neighbours. We now have a way to efficiently grab all venue_ids for a current location. Later we need some way to know which venues are more popular by simulating a checkin for when a user performs an action at a venue. This will be achieved with a mapreduce!

What I’ve learned developing for DynamoDB with Ruby

I’ve recently switched from using Amazon’s RDS to Amazon’s DynamoDB. I’ve never worked with NoSQL before, growing up I’ve been working with MySQL due to my father. The concept was dead simple each table has a hash key and a range key but I won’t go into detail on how this works, I’m going to talk about my experience developing for this.

A little info about my environment, we are running our web application servers (ruby on rails) on amazon ec2 behind an elastic load balancer and using a modified version of Dynamoid.

Things I have learned

1) Split data up as much as you can. It’s easier to explain this with a simple example. Let’s take a User class for example where we store a user’s info plus their settings. This is super simple example a user could have more than one device tied to their account.

class User
 include Dynamoid::Document
 table :name => :user
 field :first_name, :string
 field :last_name, :string
 field :email, :string
 # A token used for sending push notifications
 field :device_token, :string
 # These are true false flags to check to see if a user wants to receive
 # certain notification types
 field :notify_new_post, :boolean
 field :notify_new_comment, :boolean
end

Now say for example we had our application servers rendering users profiles and we had background workers accessing the same table checking to see if it should send the user a notification. First we can see both applications don’t care about the other’s data. The web servers just wants to know the first_name, last_name, and email while the background workers just need to know the settings and the device token. Since we are storing too much info for a single class we run into a scaling issue. Yes dynamodb can easily handle that many reads per second but what happens when there is even more info attached to this user class that could be separated. You are essentially  splitting the read capacity between these 2 applications. Say for example we had 20 reads per second. That means the background workers and the web app need to share the 20 reads per second. The idea is to split up the data into their own tables as so.

class User
 include Dynamoid::Document

 field :first_name, :string
 field :last_name, :string
 field :email, :string

end
class UserSettings
 include Dynamoid::Document

 table :name => :user_settings, :key => :user_id

 field :user_id, :string
 # A token used for sending push notifications field :device_token, :string
 # These are true false flags to check to see if a user wants to receive
 # certain notification types
 field :notify_new_post, :boolean
 field :notify_new_comment, :boolean
end

We now have two different tables where both of their hash keys are identical giving us a mapping from one table to the other if needed. Now we can independently scale our reads and writes based on what is being used more. If the notification workers need more reads per second we don’t need to scale the user table to achieve this.

Locks

When multiple threads are trying to access the same data it is important to enforce mutual exclusion. If we didn’t we would get some really funky outcomes that we really don’t want. One way to force that only one thread can access this data or data structure is a structure called a lock. For example:


Thread A {
int x = remove_from_front_of_queue (someQueue);}
Thread B {int y = put_in_queue (someQueue,item);}

By not forcing mutual exclusion we can get some messy results. I will be doing an implementation of locks in os161

Our Structure of the lock, we keep track of the name and the current thread that owns the lock


struct lock {

char *name;

volatile struct thread *owner; //thread currently owning the lock

};

/*lock_create will create the lock and return it when finished*/struct lock * lock_create(const char*name) {

struct lock *lock;

lock = kmalloc(sizeof(struct lock));

if (lock == NULL) {

return NULL;

}

lock->name = kstrdup(name);

if (lock->name == NULL) {

kfree(lock);

return NULL;

}

lock->owner=NULL;

return lock;

}

/* just freeing our memory we have allocated and destroying the lock */void lock_destroy(struct lock *lock) {

assert(lock != NULL);

kfree(lock->name);

kfree(lock);

}

/* when a thread wants to acquire the lock */void lock_acquire(structlock *lock) {

/* we must make sure interupts are turned off and the lock exists */

int spl; assert(lock !=NULL);

spl splhigh();

/* when a lock is already acquired we cannot proceed so the thread must sleep on the lock until the owner is

finished when the thread is woken up it will continue to call the while loop to check again if the lock is acquired

we could have multiple threads all trying to acquire the lock this is why we have a while loop we can then turn interrupts on again*/

while(lock->owner !=NULL) thread_sleep(lock); // in thread_sleep keep track of this lock and the address its sleeping on

lock->owner=curthread; // curthread is a global variable in os161 which holds the current thread running

splx(spl);

}

/* when a thread is finished with a lock it will call lock_release the thread calling this must be the owner of the lock */void lock_release(struct lock *lock) {

int spl=splhigh();

assert(lock !=NULL);

assert(curthread==lock->owner);//check that we own lock

lock->owner=NULL;

thread_wakeup_single(lock);// must implement – loop through the sleeping threads and check if a thread is sleeping on this lock if so wake it up. 

splx(spl);

}