A Japanese Word Familiarity Database

Summary

Word familiarity is a measure of how familiar words are through experiments with subjects. Word familiarity is expressed as a number between 1 and 7, and the higher that number is, the more familiar a word is.
NTT compiled the results of surveys on nearly 80,000 words starting from 1995, including surveys on word familiarity, and in 1999 this compilation was published by Sanseido as the first volume in the NTT Database Series [Nihongo-no Goitokusei: Lexical Properties of Japanese].
In 2002, NTT conducted an additional survey of about 30,000 words that were not included in Volume 1, and this was published as Volume 9 of the same series (These volumes are now out of print, and are hereinafter referred to as the "Heisei Edition").
These volumes were widely used, but time has passed since the initial surveys, word familiarity can change over time, and many words have not been included in the databases so far. For these reasons, we conducted surveys of all the words included in Volume 1 and Volume 9 again, as well as additional surveys of new words. We have compiled the results, including a total of more than 160,000 words, as the "Reiwa Edition Japanese word familiarity database".

Vocabulary Size Estimation Tests

One thing that can be done with the Japanese word familiarity database is "vocabulary size estimation". With vocabulary size estimation, you can estimate the size of the subject's vocabulary simply by checking whether they know a few dozen words presented to them.
Words with high familiarity can be considered words that many people know, and words with low familiarity can be considered words that many people do not know. Vocabulary size estimation checks presents to subjects words with several different levels of familiarity, from high to low, and checks whether the subject knows them, in order to estimate the size of the subject's vocabulary by the level of familiarity of words they know.
it is possible to estimate the subject's vocabulary simply by checking a small number of words, so the burden on the subject of the survey is small and it is easy to estimate the size of their vocabulary.
NTT has release a vocabulary size estimation test based on the Heisei Edition, and this test has been used by many people. However, the upper limit on the size of vocabulary that can be estimated depends on the size of the Japanese word familiarity database, so the Heisei Edition could not be used to estimate the size of vocabularies of more than 77,000 words.
Now, we have created and published a new vocabulary size estimation test based on the Reiwa Edition of the Japanese word familiarity database (available since June 4, 2020). By expanding the Japanese word familiarity database used as a basis for this test, we have significantly raised the upper limit of the size of vocabularies that can be estimated from that of the Heisei Edition.

Trial of Vocabulary Estimation Test (Public Version)

Open the link to try the public version of this test.
(Cannot be used in Internet Explorer (IE). Please try in a browser other than IE.)

The estimated vocabulary size may differ significantly between the Heiwa Edition and the Reiwa Edition. Generally, using the Reiwa Edition will result in an estimate of more vocabulary. This is because the Japanese word familiarity database used as a basis for the test is large in the Reiwa Edition.

References

Please see below for the Reiwa Edition of the word familiarity survey.

  • Sanae Fujita, Tetsuo Kobayashi (2020)
    "Resurvey of Word Familiarity and Comparison with Past Data", The Association for Natural Language Processing 26th Annual Meeting (NLP-2020)

Please see below for the vocabulary size survey of elementary, junior high, and senior high school students.

  • Sanae Fujita, Tetsuo Kobayashi, Takeshi Yamada, Shingo Sugawara, Niwako Arai, Noriko Arai (2020)
    "Vocabulary Survey of Elementary, Junior High, and Senior High School Students and Analysis of Relationship to Word Familiarity", The Association for Natural Language Processing 26th Annual Meeting (NLP-2020)