About the data:
The data set was constructed by combining and matching data from the New York City Deparment of Health and Mental Hygeine (link), Yelp's API (link), and Seamless' website (link), all collected around late October, 2014.
The Yelp data started with 42,513 listings that are 1) in the "restaurant", "nightlife", or "food" categories, 2) in the NYC area (in the geo-square between 40.496032, -74.257450 and 40.9102, -73.697319), and 3) have 1 or more reviews (due to Yelp's API). The latter clause means that un-reviewed restaurants may be listed on Yelp, but weren't included in this data set. The entries were then filtered to remove restaurants that are exclusively in non-relevant categories (e.g. "hotels," "dance clubs," etc.). After the above there are 35,163 entries from Yelp.
The Seamless data came from the 6,850 restaurant listings in Seamless' NYC website directory. About 254 were closed or otherwise unavailable, leaving 6,596 entries.
The NYC-DHMH data came from the 23,940 restaurants inspected in the last year. 622 listings were thrown out due to irregular zip codes or no sanitation score (i.e. it was pending, or some other state), leaving 23,318 restaurants. This data was used to determine both a restaurant's sanitation score and it's borough.
The Yelp and Seamless restaurant entries were then matched to the corresponding NYC-DHMH entry. Matches were determined based on a scoring function, comparing attributes like restaurant name, phone number, street address, etc. The weights of the scoring function were optimized to favor zero false positives against a small test set (and then spot checked by hand). Of the 23,318 NYC listings, the algorithm found matches for 17,591 Yelp listings and 5,282 Seamless listings (or 4,747 Seamless listings with 1+ rating).
You can view both the matched and unmatched restaurants in the above visualization. Use the link under the "Yelp" and "Seamless" header to toggle matches on and off.
Thanks:
Thanks to NYC-DHMH and to Yelp & Seamless for their patience.. Also to Square for the fantastic crossfilter library.