flags Reid Pryzant

JESC

Japanese-English Subtitle Corpus
English | 日本語

About
JESC aims to support the research and development of machine translation systems, information extraction, and other language processing techniques. It consists of 3.2M parallel sentences.

JESC is the product of a collaboration between Stanford University, Google Brain, and Rakuten Institute of Technology. It was created by crawling the internet for movie and tv subtitles and aligining their captions. It is one of the largest freely available EN-JA corpus, and covers the poorly represented domain of colloquial language.

You can download the scripts, tools, and crawlers used to create this dataset on Github.

You can read the paper here.

These data are released under a Creative Commons (CC) license.

Contents
  • A large corpus consisting of over 3.2 million sentences.
  • Translations of casual language, colloquialisms, expository writing, and narrative discourse. These are domains that are hard to find in JA-EN MT.
  • Pre-processed data, including tokenized train/dev/test splits.
  • Code for making your own crawled datasets and tools for manipulating MT data.


Split Phrase Pairs
Raw 3,243,887
Train 3,239,888
Dev 2000
Test 3001

Download

Cite
@ARTICLE{pryzant_jesc_2017,
   author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.},
    title = "{JESC: Japanese-English Subtitle Corpus}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1710.10639},
 keywords = {Computer Science - Computation and Language},
     year = 2017,
    month = oct,
}