Exhibition Program

Science of Communication and Computation

09

Tuning machine translation with small tuning data

Domain adaptation with JParaCrawl, a large parallel corpus

Abstract

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them.
We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences.
Our collected corpus, called JParaCrawl, amassed over 10 million sentence pairs. JParaCrawl is now freely available online for research purposes.
We show how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains and achieves good performance even if the target domain data is limited.

References

  1. M. Morishita, J. Suzuki, M. Nagata, “JParaCrawl: A large scale web-based Japanese-English parallel corpus,” in Proc. 12th International Conference on Language Resources and Evaluation (LREC), 2020.

Poster

Contact

Makoto Morishita / Linguistic Intelligence Research Group, Innovative Communication Laboratory
Email: