Title | : | Deployable AI: Solutions to Label Sparsity and Classification Model Selection |
Speaker | : | Sudarsun Santhiappan (IITM) |
Details | : | Mon, 18 Mar, 2024 10:30 AM @ SSB-233 |
Abstract: | : | Deploying artificial intelligence (AI) for practical problem-solving comes with challenges well-recognized by global academic and industrial research communities. While the new generation of AI has generated excitement, laboratory successes have only been translated to applications in narrow domains. These domains typically have modest expectations for reliability from AI systems, small costs of failure, and strong incentives for users to make AI systems succeed.
In this thesis, we provide an overview of the challenges in Deployable AI constituting broadly five sub-categories, namely Societal-centric, Organization-centric, Privacy & Trust-centric, System centric, and Data-centric challenges. We scope our research to a subset of data-centric challenges such as label sparsity and model selection. We address label sparsity in two forms:- a) class imbalance through directed sampling & boosting and b) label scarcity through semi-supervised classification trees. We also develop a method for predicting the empirical classification complexity of the dataset and extend it to an automatic model selection method that maps the dataset characteristics to empirical classification model fitness. As the first problem, we address the challenge of classifying imbalanced binary datasets. Directed data sampling and data-level cost-sensitive methods use the data point importance information to sample from the dataset such that the essential data points are retained and possibly oversampled. We propose a novel topic-modelling-based weighting framework to compute the importance of the data points in an imbalanced dataset based on the topic-posterior probabilities estimated through topic modelling. We propose TODUS, a topics-oriented directed undersampling algorithm that follows the estimated data distribution to draw samples from the dataset, which aims to minimize the loss of important information during random undersampling. We also propose TOMBoost, a topic-modeled boosting scheme based on the weighting framework, particularly tuned for learning with class imbalance. As the second problem, we address the label sparsity challenge with our novel semi-supervised classification tree algorithm. The natural order of data availability is usually unlabeled, as labels are task-specific. Label sparsity indicates the scarce availability of labeled data. In a classification tree learning task, when the class ratio of the unlabeled part of the dataset is made available, it becomes feasible to use the unlabeled data alongside the labeled data to train the tree in a semi-supervised style. We are motivated to use the abundantly available unlabeled data to facilitate building classification trees. We propose a semi-supervised approach to growing classification trees, where we apply maximum mean discrepancy (MMD) for estimating the class ratio at every node split. As the third problem, we address the challenges of estimating the dataset's classification complexity and selecting a suitable model class for building a classifier with the best empirical model fitness for a given dataset. We propose a prediction system to estimate the empirical classification complexity of a dataset for a given set of model classes by learning a discriminant function that associates the data characteristics to the classification complexity. We also propose a novel method for automated classification model selection from a set of candidate model classes by determining the empirical model fitness for a dataset based only on its clustering indices. We propose a regression task for a given model class on the dataset's clustering indices to the expected classification performance. We compute the test dataset's clustering indices and directly predict the expected classification performance using the learned regressor for each candidate model class to recommend a suitable model class for dataset classification. |