ALIRO: AI Driven Data Science

ALIRO is an easy-to-use data science assistant. It allows researchers without machine learning or coding expertise to run supervised machine learning analysis through a clean web interface. It provides results visualization and reproducible scripts so that the analysis can be taken anywhere. And, it has an AI assistant that can choose the analysis to run for you. Dataset profiles are generated and added to a knowledgebase as experiments are run, and the AI assistant learns from this to give more informed recommendations as it is used. Aliro comes with an initial knowledgebase generated from the PMLB benchmark suite.

 

View Resource

Deep Learning for Toxicology (DTox)

In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox (Deep learning for Toxicology), an interpretation framework for knowledge-guided neural networks, which can predict compound response to toxicity assays and infer toxicity pathways of individual compounds. We demonstrate that DTox can achieve the same level of predictive performance as conventional models with a significant improvement in interpretability. Using DTox, we were able to rediscover mechanisms of transcription activation by three nuclear receptors, recapitulate cellular activities induced by aromatase inhibitors and PXR agonists, and differentiate distinctive mechanisms leading to HepG2 cytotoxicity. Virtual screening by DTox revealed that compounds with predicted cytotoxicity are at higher risk for clinical hepatic phenotypes. In summary, DTox provides a framework for deciphering cellular mechanisms of toxicity in silico.

 

View Resource

Extended Supervised Tracking and Classification System (scikit-ExSTraCS)

The scikit-ExSTraCS package includes a sklearn-compatible Python implementation of ExSTraCS 2.0. ExSTraCS 2.0, or Extended Supervised Tracking and Classifying System, implements the core components of a Michigan-Style Learning Classifier System (where the system’s genetic algorithm operates on a rule level, evolving a population of rules with each their own parameters) in an easy to understand way, while still being highly functional in solving ML problems. It allows the incorporation of expert knowledge in the form of attribute weights, attribute tracking, rule compaction, and a rule specificity limit, that makes it particularly adept at solving highly complex problems. In general, Learning Classifier Systems (LCSs) are a classification of Rule Based Machine Learning Algorithms that have been shown to perform well on problems involving high amounts of heterogeneity and epistasis. Well designed LCSs are also highly human interpretable. LCS variants have been shown to adeptly handle supervised and reinforced, classification and regression, online and offline learning problems, as well as missing or unbalanced data. These characteristics of versatility and interpretability give LCSs a wide range of potential applications, notably those in biomedicine.

 

View Resource

Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES)

GAMETES is an algorithm for the generation of complex single nucleotide polymorphism (SNP) models for simulated association studies. GAMETES is designed to generate epistatic models which we refer to as pure and strict. These models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n loci are included in the disease model. The user-friendly GAMETES software rapidly and precisely generates epistatic multi-locus models, and using these models, can also generate simulated datasets exhibiting epistasis. Version 2.2 adds the ability to generate heterogeneous datasets by applying multiple independent models to different subsets of the simulated data. Further additional features include the facility to create additive datasets by applying multiple independent models to the entire dataset, as well as functionality for the design of continuous endpoints. Additionally, we have added a custom model generation feature, so that users may directly specify and examine the properties of any 2 or 3 locus SNP model. Simple Mendelian models may also be generated with this feature.

 

View Resource

Multifactor Dimensionality Reduction (scikit-MDR)

A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. This project is still under active development and we encourage you to check back on this repository regularly for updates. MDR is an effective feature construction algorithm that is capable of modeling higher-order interactions and capturing complex patterns in data sets. MDR currently only works with categorical features and supports both binary classification and regression problems. We are working on expanding the algorithm to cover more problem types and provide more convenience features.

 

View Resource

Relief-based Algorithm Training Environment (REBATE)

This package includes a scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. These Relief-Based algorithms (RBAs) are designed for feature weighting/selection as part of a machine learning pipeline (supervised learning). Presently this includes the following core RBAs: ReliefF, SURF, SURF*, MultiSURF*, and MultiSURF. Additionally, an implementation of the iterative TuRF mechanism and VLSRelief is included. It is still under active development and we encourage you to check back on this repository regularly for updates. These algorithms offer a computationally efficient way to perform feature selection that is sensitive to feature interactions as well as simple univariate associations, unlike most currently available filter-based feature selection methods. The main benefit of Relief algorithms is that they identify feature interactions without having to exhaustively check every pairwise interaction, thus taking significantly less time than exhaustive pairwise search.

 

View Resource

Semi-automated Term Harmonization Pipeline

This repository includes a set of Python-based Jupyter notebooks that comprise a semi-automated term harmonization pipeline applied to harmonize medical history terms across 28 clinical trials of pulmonary arterial hypertension. These notebooks pair with the paper ‘A Semi-Automated Term Harmonization Pipeline Applied to Pulmonary Arterial Hypertension Clinical Trials’. Below, we offer an overview of these pipelines and provide guidance for users on how to adapt these notebooks to their own target harmonization tasks.

 

View Resource

Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE)

STREAMLINE is an end-to-end automated machine learning (AutoML) pipeline that empowers anyone to easily run, interpret, and apply a rigorous and customizable analysis for data mining or predictive modeling. Notably, this tool is currently limited to supervised learning on tabular, binary classification data but will be expanded as our development continues. The development of this pipeline focused on (1) overall automation, (2) avoiding and detecting sources of bias, (3) optimizing modeling performance, (4) ensuring complete reproducibility (under certain STREAMLINE parameter settings), (5) capturing complex associations in data (e.g. feature interactions), and (6) enhancing interpretability of output. Overall, the goal of this pipeline is to provide a transparent framework to learn from data as well as identify the strengths and weaknesses of ML modeling algorithms or other AutoML algorithms.

 

View Resource

Tree-based Pipeline Optimization Tool (TPOT)

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT is built on top of scikit-learn, so all of the code it generates should look familiar… if you’re familiar with scikit-learn, anyway.

 

View Resource