NeuralTalk: Multimodal Recurrent Neural Network for generating image captions

NeuralTalk Model Zoo

See main Github page for docs about NeuralTalk.

Below are some NeuralTalk model checkpoints. You can evaluate these with the evaluation script in NeuralTalk.

checkpoint	description	test perplexity	BLEU scores
flickr8k_cnn_lstm_v1.p	First attempt to reproduce Google's LSTM results, so all settings are as described in Google paper, except VGG Net is used for CNN features instead of GoogLeNet. Not quite there yet, since Google reports BLEU scores B-1, B-2, B-3: [63, 41, 27].	15.687797 (vocab size 2538)	B-1: 0.582093 B-2: 0.378414 B-3: 0.189930
coco_cnn_lstm_v2.p	An LSTM trained on COCO with 512 hidden units (as presented in Google paper), but uses the VGGNet instead of GoogLeNet. Uses beam size of 1 and only one model (no ensemble).	11.555093 (vocab size 8791)	B-1: 0.649 B-2: 0.464 B-3: 0.321

Web Demo

See the NeuralTalk web demo for 1,000 example predictions. The demo uses the LSTM network above on COCO.