Deep Learning for Toxicology (DTox)

In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox (Deep learning for Toxicology), an interpretation framework for knowledge-guided neural networks, which can predict compound response to toxicity assays and infer toxicity pathways of individual compounds. We demonstrate that DTox can achieve the same level of predictive performance as conventional models with a significant improvement in interpretability. Using DTox, we were able to rediscover mechanisms of transcription activation by three nuclear receptors, recapitulate cellular activities induced by aromatase inhibitors and PXR agonists, and differentiate distinctive mechanisms leading to HepG2 cytotoxicity. Virtual screening by DTox revealed that compounds with predicted cytotoxicity are at higher risk for clinical hepatic phenotypes. In summary, DTox provides a framework for deciphering cellular mechanisms of toxicity in silico.

 

View Resource

Diverse and Generative ML benchmark (DIGEN)

A modern machine learning benchmark, which includes: 40 datasets in tabular numeric format specially designed to differentiate the performance of some of the leading Machine Learning (ML) methods, and a package to perform reproducible benchmarking that simplifies comparison of performance of the methods. DIGEN provides comprehensive information on the datasets, including: ground truth – a mathematical formula presenting how the target was generated for each of the datasets, the results of exploratory analysis, which includes feature correlation and histogram showing how binary endpoint was calculated, multiple statistics on the datasets, including the AUROC, AUPRC and F1 scores, each dataset comes with Receiver-Operating Characteristics (ROC) and Precision-Recall (PRC) charts for tuned ML methods, and a boxplot with projected performance of the leading methods after hyper-parameter tuning (100 runs of each method started with different random seed), Apart from providing a collection of datasets and tuned ML methods, DIGEN provides tools to easily tune and optimize parameters of any novel ML method, as well as visualize its performance in comparison with the leading ones. DIGEN also offers tools for reproducibility.

 

View Resource

Extended Supervised Tracking and Classification System (scikit-ExSTraCS)

The scikit-ExSTraCS package includes a sklearn-compatible Python implementation of ExSTraCS 2.0. ExSTraCS 2.0, or Extended Supervised Tracking and Classifying System, implements the core components of a Michigan-Style Learning Classifier System (where the system’s genetic algorithm operates on a rule level, evolving a population of rules with each their own parameters) in an easy to understand way, while still being highly functional in solving ML problems. It allows the incorporation of expert knowledge in the form of attribute weights, attribute tracking, rule compaction, and a rule specificity limit, that makes it particularly adept at solving highly complex problems. In general, Learning Classifier Systems (LCSs) are a classification of Rule Based Machine Learning Algorithms that have been shown to perform well on problems involving high amounts of heterogeneity and epistasis. Well designed LCSs are also highly human interpretable. LCS variants have been shown to adeptly handle supervised and reinforced, classification and regression, online and offline learning problems, as well as missing or unbalanced data. These characteristics of versatility and interpretability give LCSs a wide range of potential applications, notably those in biomedicine.

 

View Resource

Foundations of Artificial Intelligence

Educational lectures for the course: “Foundations of Artificial Intelligence” developed by Dr. Ryan Urbanowicz in 2020 at the University of Pennsylvania’s Perelman School of Medicine. This is the first of three courses covering topics in artificial intelligence for application within the context of informatics and biomedical research. The course is divided into modules that cover (1) introductory/background materials, (2) logic, (3) other knowledge representation, (4) essentials of expert systems, (5) search, (6) uncertainty, and (7) advanced/auxiliary topics. These topics offer a global foundation for branches of AI application and research, including concepts that will later support a deeper understanding of inductive reasoning and machine learning. In a practical sense, this course focuses on how biomedical data can be organized, represented, interpreted, searched, and applied in order to derive knowledge, make decisions, and ultimately make predictions while avoiding bias. This course was assembled using content from a wide variety of textbooks, slides, and lectures by various authors and speakers on the relevant topics. Some lectures were prepared and given by guest lecturers and thus have not been posted. At the time of posting, this course is in its second year so any feedback is welcome regarding any mistakes or suggested improvements.

View Resource

Frontotemporal Degeneration Center (FTDC)

The Penn Frontotemporal Degeneration Center brings together an energetic team of creative clinicians and researchers dedicated to the investigation and treatment of early onset neurodegenerative conditions. The research expertise at the Penn FTD Center spans many levels of neuroscience ranging from detailed clinico-pathological studies, biomarker discovery, genetics, neuropsychological studies, functional and structural neuroimaging, and cognitive neuroscience investigations of language, memory, and social cognition.

View Resource

Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES)

GAMETES is an algorithm for the generation of complex single nucleotide polymorphism (SNP) models for simulated association studies. GAMETES is designed to generate epistatic models which we refer to as pure and strict. These models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n loci are included in the disease model. The user-friendly GAMETES software rapidly and precisely generates epistatic multi-locus models, and using these models, can also generate simulated datasets exhibiting epistasis. Version 2.2 adds the ability to generate heterogeneous datasets by applying multiple independent models to different subsets of the simulated data. Further additional features include the facility to create additive datasets by applying multiple independent models to the entire dataset, as well as functionality for the design of continuous endpoints. Additionally, we have added a custom model generation feature, so that users may directly specify and examine the properties of any 2 or 3 locus SNP model. Simple Mendelian models may also be generated with this feature.

 

View Resource

Gradient Boosting

Gradient Boost is one of the most popular Machine Learning algorithms in use. And get this, it’s not that complicated! This video is the first part in a series that walks through it one step at a time. This video focuses on the main ideas behind using Gradient Boost to predict a continuous value, like someone’s weight. We call this, “using Gradient Boost for Regression”. In the next video, we’ll work through the math to prove that Gradient Boost for Regression really is this simple. In part 3, we’ll walk though how Gradient Boost classifies samples into two different categories, and in part 4, we’ll go through the math again, this time focusing on classification.

View Resource