Playing around with RCNN, State of the Art Object Detector

I was playing around with a state of the art Object Detector, the recently released RCNN by Ross Girshick. The method is described in detail in this arXiv paper, and soon to be a CVPR 2014 paper. It takes an ImageNet pretrained Convolutional Network of Krizhevsky et al. (i.e. the paper that rocked computer vision last year) and fine-tunes the network on PASCAL VOC detection data (20 object classes, and 1 background class). At test time RCNN uses Selective Search to extract ~2000 boxes that likely contain objects and evaluates the ConvNet on each one of them, followed by non-maximum suppression within each class. This usually takes on order of 20 seconds per image with a Tesla K40 GPU. RCNN uses Caffe (a very nice C++ ConvNet library we use at Stanford too) to train the ConvNet models, and both are available under BSD on Github.
Other CNN-based detection systems I'm aware of include Overfeat (from Pierre Sermanet et al at NYU) and Generic Object Detection with Dense Neural Patterns and Regionlets." (from Will Zou et al at Stanford), but neither have nice code (for detection) available online.
Below are some example results of running RCNN on some random images from Flickr. Keep in mind that the training data in PASCAL VOC contains only 20 classes (Aeroplanes, Bicycles, Birds, Boats, Bottles, Buses, Cars, Cats, Chairs, Cows, Dining tables, Dogs, Horses, Motorbikes, People, Potted plants, Sheep, Sofas, Trains, TV/Monitors), examples of the training data can be found here. The number in the bracket next to every detection is the SVM score for that class (high = more confident) and I only display detections with score > -0.5. Also keep in mind that these are raw SVM scores, so technically they aren't exactly comparable across classes.