Oxidative Stress

Mapping toxicity pathways of per- and polyfluoroalkyl substances using interpretable classification-based machine learning models.

SAR and QSAR in environmental research

Abstract

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants associated with adverse human health outcomes. However, experimental toxicity data remain unavailable for the vast majority of PFAS, limiting comprehensive risk assessment. In this study, ML-based classification Quantitative Structure Toxicity Relationship (QSTR) models were developed to predict PFAS toxicity across three biologically relevant high-throughput screening endpoints. These included AID-1030 (ALDH1A1 inhibition associated with reproductive toxicity), AID-504444 (Nrf2 pathway inhibition responsible for vascular disruption, hepatic steatosis, lung carcinogenesis, and infertility), and AID-588855 (inhibition of TGF-β/Smad3 signalling linked to developmental toxicity and tumour progression). To address pronounced class imbalance in these datasets, multiple data balancing techniques (ADASYN, SMOTE, Borderline-SMOTE, SVMSMOTE, and random oversampling) were applied. Fourteen ML classifiers were trained for each balanced dataset, yielding 70 models per endpoint. Sum-of-Ranking-Differences (SRD) analysis identified the most robust models with Gradient Boosting, Random Forest, and Support Vector Classifier models emerging as optimal for AID-1030, AID-504444, and AID-588855, respectively. SHAP and substructure analyses provided mechanistic interpretability, by linking PFAS structural features and AOP progression. The optimized models were further applied to an independent external dataset of 2,361 PFAS, and a Python-based screening tool, PERSIST, was developed to screen PFAS.

Key Findings

  • Developed machine learning-based QSTR models to predict PFAS toxicity across three high-throughput screening endpoints including Nrf2 pathway inhibition.
  • Applied multiple data balancing techniques and trained 14 ML classifiers per balanced dataset, identifying Gradient Boosting, Random Forest, and Support Vector Classifier as optimal models for different endpoints.
  • Mechanistic interpretability was achieved via SHAP and substructure analyses linking PFAS structural features to adverse outcome pathways, and the models were validated on an external dataset of 2,361 PFAS.

Clinical Significance

This study provides interpretable ML models to predict PFAS-induced toxicity involving the Nrf2 oxidative stress pathway, enabling improved risk assessment and screening of environmental contaminants for potential human health impacts.

Citation

Sarkar S, Pore S, Roy K. Mapping toxicity pathways of per- and polyfluoroalkyl substances using interpretable classification-based machine learning models. SAR and QSAR in environmental research. 2026-Jun-29.

DOI: 10.1080/1062936X.2026.2693300