JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT.
It was created by largely crawling the web and automatically aligning parallel sentences.
For more details, see our paper.

Parallel corpus
  • English-Japanese training set (v3.0, 22M sentences, deduped): Download
  • Chinese-Japanese training set (83k sentences, deduped): Download
  • Older versions
    • English-Japanese training set (v1.0, 8.7M sentences, wihout dedup): Download
    • English-Japanese training set (v2.0, 10.0M sentences, deduped): Download
NMT Models (based on v3.0)

JParaCrawl and the trained models are distributed under the following license.
For commercial use, please contact us.

Terms of Use for Bilingual Data, Monolingual Data and Trained Models

Nippon Telegraph and Telephone Corporation (Hereinafter referred to as "our company".) will provide bilingual data, monolingual data and trained models (Hereinafter referred to as "this data.") subject to your acceptance of these Terms of Use. We assume that you have agreed to these Terms of Use when you start using this data (including downloads).

Article 1 (Use conditions)
This data can only be used for research purposes involving information analysis (Including, but not limited to, replication and distribution. Hereinafter the same in this article.). The same applies to the derived data created based on this data. However, this data is not available for commercial use, including the sale of translators trained using this data.

Article 2 (Disclaimer)
Our company does not warrant the quality, performance or any other aspects of this data. We shall not be liable for any direct or indirect damages caused by the use of this data. Our company shall not be liable for any damage to the system caused by the installation of this data.

Article 3 (Other).
This data may be changed in whole or in part, or provision of this data may be interrupted or stopped at our company’s discretion without prior notice.







If you find our work useful, please cite the following article.
JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

    title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
    author = "Morishita, Makoto  and
      Suzuki, Jun  and
      Nagata, Masaaki",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "",
    pages = "3603--3609",
    ISBN = "979-10-95546-34-4",


We have used Bitextor created by the ParaCrawl project. We gratefully acknowledge the ParaCrawl project for releasing the software and fruitful discussions.
We also would like to thank Hisashi Itoh and Takumi Asai for their technical support.

Take down

If we include your copyrighted works and you want us to delete it, please contact us with the following information.

  • Your name, affiliation and e-mail address.
  • Detailed information of your copyrighted works.
  • How we can locate your work in our data such as your domain name.

For any inquiries about JParaCrawl, please contact us by email.

NTT Communication Science Laboratories
Makoto Morishita
jparacrawl-ml -a-