Description
High-dimensional Statistical Modeling Team (https://riken-yamada.github.io/)
Speaker 1 (15min): Makoto Yamada
Title: Introduction to High-dimensional Statistical Modeling Team
Abstract: In this talk, I will introduce the research activity of our team and our recent research results including high-dimensional nonlinear feature selection, optimal transport, and its application to medical data.
Speaker 2 (25min): Benjamin Poignard (https://sites.google.com/site/poignardbenjamin)
Title: Sparse Hilbert-Schmidt Independence Criterion Regression
Abstract: Feature selection is a fundamental problem in statistics and machine learning and has been widely studied over the past decades. However, the majority of feature selection algorithms are based on linear models. On the contrary, nonlinear feature selection problems have not been well studied in comparison, in particular within the high-dimensional setting. We propose the sparse Hilbert-Schmidt Independence Criterion regression, which is a versatile nonlinear feature selection algorithm based on the Hilbert-Schmidt Independence Criterion (HSIC). This is a continuous optimization variant of the minimum redundancy maximum relevance feature selection algorithm. Our proposed method encompasses the two following components to perform feature selection: the HSIC based loss function and the regularization term, where we consider potentially non-convex penalty functions, providing the sparse HSIC estimator. We derive some large sample results together with explicit error bounds for the latter estimator. Moreover, we provide the conditions to satisfy the support recovery property. Based on synthetic and real-world experiments, we illustrate these theoretical properties and highlight that the proposed algorithm performs well in the high-dimensional setting
Speaker 3 (25min): Dinesh Singh (https://sites.google.com/site/dineshsinghindian/)
Title: FsNet: Feature Selection Network on High-dimensional Biological Data
Abstract: Biological data including gene expression data are generally high-dimensional and require efficient, generalizable, and scalable machine-learning methods to discover their complex nonlinear patterns. The recent advances in machine learning can be attributed to deep neural networks (DNNs), which excel in various tasks in terms of computer vision and natural language processing. However, standard DNNs are not appropriate for high-dimensional datasets generated in biology because they have many parameters, which in turn require many samples. In this paper, we propose a DNN-based, nonlinear feature selection method, called the feature selection network (FsNet), for high-dimensional and small number of sample data. Specifically, FsNet comprises a selection layer that selects features and a reconstruction layer that stabilizes the training. Because a large number of parameters in the selection and reconstruction layers can easily result in overfitting under a limited number of samples, we use two tiny networks to predict the large, virtual weight matrices of the selection and reconstruction layers. Experimental results on several real-world, high-dimensional biological datasets demonstrate the efficacy of the proposed method.
Speaker 4 (25min): Tam Le (https://tamle-ml.github.io/)
Title: An Introduction to Tree-(Sliced)-Wasserstein Geometry
Abstract: Optimal transport (OT) theory defines a powerful set of tools to compare probability distributions. However, OT has a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature. There are two notable approaches to reduce the time complexity of OT. (i) The first approach is to use regularization to approximate solutions of OT, e.g., entropy, which results in a problem that can be solved using Sinkhorn iterations. (ii) The second one is the sliced formulation which exploits the closed-form formula between univariate distributions by projecting high-dimensional measures onto random lines. In our work, we follow the second direction by considering a more general family of ground metrics, namely tree metrics which also yield fast closed-form computations, negative definite, and of which the sliced-Wasserstein is a particular case (the tree is a chain). We named this approach tree-sliced-Wasserstein (TSW). Exploiting the negative definiteness of TSW, we derive a positive definite kernel under TSW geometry and apply it for neural architecture search. Furthermore, we also leverage tree metrics to scale up other OT problems for large-scale applications: (i) Wasserstein barycenter problem which finds the closest probability measure from m given ones, (ii) a variant of Gromov-Wasserstein for probability measures in different spaces, (iii) entropy partial transport (EPT) problem for measures having different masses. Notably, we propose a novel regularization for the dual formulation of EPT which admits a closed-form solution. To our knowledge, the proposed regularized EPT is the first approach that yields a closed-form solution among available variants of unbalanced OT for general positive measures.
Speaker 5 (25min): Hector Climente Gonzalez (https://hclimente.eu/)
Title: Biological networks and GWAS: comparing and combining network methods to understand the genetics of familial breast cancer susceptibility
Abstract: Genome-wide association studies (GWAS) scan thousands of genomes to identify variants associated with a complex trait. Over the last 15 years, GWAS have advanced our understanding of the genetics of complex diseases, and in particular of hereditary cancers. However, they have led to an apparent paradox: the more we perform such studies, the more it seems that the entire genome is involved in every disease. The omnigenic model offers an appealing explanation: only a limited number of core genes are directly involved in the disease, but gene functions are deeply interrelated, and so many other genes can alter the function of the core genes. These interrelations are often modeled as networks, and multiple algorithms have been proposed to use these networks to identify the subset of core genes involved in a specific trait. This study applies and compares six such network methods on GENESIS, a GWAS dataset for familial breast cancer in the French population. Combining these approaches allows us to identify potentially novel breast cancer susceptibility genes and provides a mechanistic explanation for their role in the development of the disease. We provide ready-to-use implementations of all the examined methods.