We are working on machine learning algorithms and large-scale semantic databases for accurate language analysis to implement advanced text applications that require an understanding of language, such as information extraction, language translation, summarization, and text classification.
With the growth of the Internet, huge amounts of text, blogs, and SNS data written by a wide range of people on a variety of topics and of variable quality are readily available. Demand is increasing for text analysis technology that can process such a vast amount of text data for business and service applications, including sentiment analysis and information retrieval.
Globalization has increased opportunities for ordinary citizens to access first-hand, up-to-date information written in foreign languages and to communicate with foreigners in person. This has revived the demand for machine translation.
However, the casually and often hastily written colloquial text found on the Internet and the spontaneous spoken languages used in human conversation include lexical and grammatical errors. Things that can be understood from the context are not explicitly mentioned in such texts; this situation greatly complicates understanding.
We are approaching the problem by building a semantic knowledge database based on large amounts of text and devising a sophisticated machine learning algorithm to implement accurate language analysis technology for text applications requiring language understanding.
Who did what to whom (5W1H) is important to understand the state and action expressed by a sentence. We have developed a learning method that obtains a set of rules for determining the subject and object of a verb from a large amount of annotated training data. Even without dependency between a predicate and an argument, or if the argument is omitted, it can determine the predicate argument relation from the context. The technology can be used for sentiment analysis and information retrieval.
We have developed various semantic databases for advanced language analysis. The following have been published in book form: "Nihongo Goi-Taikei" (thesaurus), "Nihongo no Goi-Tokusei" (psycholinguistic database), and "Kihongo Database" (semantic database).