Basic research on machine learning

We focus on a wide variety of data (big data) and develop innovate big-data analysis technology to create new value by extracting the important latent information that lies hidden behind data.
Some basic research topics in our laboratory are as follows:

Semi-supervised learning for maximizing partial AUC

The partial area under a receiver operating characteristic curve (pAUC) is a performance measurement for binary classification problems that summarizes the true positive rate with the specific range of the false positive rate. Obtaining classifiers that achieve high pAUC is important in a wide variety of applications, such as anomaly detection and medical diagnosis. Although many methods have been proposed for maximizing the pAUC, existing methods require many labeled data for training. We propose a semi-supervised learning method for maximizing the pAUC, which trains a classifier with a small amount of labeled data and a large amount of unlabeled data. To exploit the unlabeled data, we derive two approximations of the pAUC: the first is calculated from positive and unlabeled data, and the second is calculated from negative and unlabeled data. A classifier is trained by maximizing the weighted sum of the two approximations of the pAUC and the pAUC that is calculated from positive and negative data.
Open House 2020 exhibition 05

Tensor factorization for spatio-temporal data analysis

Analysis of spatio-temporal data is a common research topic that requires the interpolations of unobserved locations and the predictions of feature observations by utilizing information about where and when the data were observed. One of the most difficult problems is to make future predictions of unobserved locations. Tensor factorization methods are popular in this field because of their capability of handling multiple types of spatiotemporal data, dealing with missing values, and providing computationally efficient parameter estimation procedures. We propose a new tensor factorization method that estimates low-rank latent factors by simultaneously learning the spatial and temporal correlations. We introduce new spatial autoregressive regularizers based on existing spatial autoregressive models and provide an efficient estimation procedure.
Open House 2019 exhibition 06

MOFM: low-rank regression for learning common factors

Multi-output Factorization Machines (MOFM) are an extension of Convex Factorization Machines that can learn the model of several tasks simultaneously. MOFM can find combinations of factors that are predictive across tasks. MOFM decompose the potentially very large weight matrix associated with each task using a small number of common basis vectors. Hence, MOFM are able to scale to very high-dimensional data. In addition, we propose a convex formulation for learning this decomposition with optimality guarantee. MOFM find applications to numerous real-world problems, including medical diagnosis, recommendation systems and genomic selection of plants. In future work, we plan to further study the theoretical properties of MOFM.
Open House 2018 exhibition 02

Knowledge discovery based on probabilistic latent variable models

With the rapid growth of Internet and sensors, we can easily obtain and accumulate a huge amount of data. Automatic discovery of useful knowledge from data, therefore, becomes an important challenge in big data analysis. In this talk, I am going to explain a generative model approach that can automatically find intrinsic latent features from the given data. Then, I will provide guidelines for modeling data by introducing specific models for some applications, such as topic extraction and object matching.
Open House 2017 Research Talk

CFM: low-rank regression with global optimality guarantees

Convex Factorization Machines (CFM) are a high-accuracy regression model that can handle a large number of feature combinations. CFM is general-purpose and can be applied to a wide range of tasks: e.g., house price prediction, recommender systems and genome analysis. The proposed method can handle a large number of feature combinations by using a low-rank constraint. Moreover, it is guaranteed to obtain a global optimum. In future work, to further improve predictive accuracy, we plan to support higher-order feature combinations. Besides recommender systems, applications include predicting combinations of genes that are responsible for diseases, which would help find effective cures.
Open House 2016 exhibition 01