Perelman School of Medicine at the University of Pennsylvania

Long Research Group


Multiple Imputation for High-dimensional Incomplete Data

MIHD: R package for multiple imputation for high-dimensional incomplete data.


  • Y. Zhao and Q. Long, “Multiple imputation in the presence of high-dimensional data,” Statistical methods in medical research, p. 962280213511027, 2013.
  • Y. Deng, C. Chang, M. S. Ido, and Q. Long, “Multiple imputation for general missing data patterns in the presence of high-dimensional data,” Scientific reports, vol. 6, iss. 21689, 2016.


Manual: MIHD


Bootstrap Imputation with Variable Selection

BISS: R package for implementing boostrap imputation with variable selection.

Reference: Q. Long and B. A. Johnson, “Variable selection in the presence of missing data: resampling and imputation,” Biostatistics, vol. 16, iss. 3, pp. 596-610, 2015.

PackageBISSpkg 1.0



Knowledge-guided Sparse PCA

fgsPCA: matlab code to perform structured sparse PCA

Reference: Z. Li, S. Safo, and Q. Long, "Incorporating Biological Information in Sparse Principal Component Analysis with Application to Genomic Data", BMC bioinformatics 18.1 (2017): 332.

Matlab Code: fgsPCA


Scalable Bayesian Variable Selection for Structured High-dimensional Data

EMSHS: R code to perform an EM alrogithm for Bayesian shrinkage approach with the structural information incorporated

Reference: Chang, C., Kundu, S., & Long, Q. (2018). Scalable Bayesian variable selection for structured high‐dimensional data. Biometrics. (

PackageEMSHS R Package in CRAN 


Sparse Linear Discriminant Analysis in Structured Covariates Space

sSLDA: matlab code to perform structured sparse LDA

Reference: Safo, S.E., and Long, Q. (2016) Sparse linear discriminant analysis in structured covariates space. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(2), pp.56-69.

Matlab Code: sSLDA


Structured Sparse CCA

sSCCA: matlab code to perform structured sparse CCA

Reference: S. Safo, S. Li and Q. Long, "Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information", Biometrics 74.1 (2018): 300-312.

Matlab Code: sSCCA_v2


Penalized Co-Inertia Analysis

pCIA: R package for implementing penalized co-inertia analysis for two datasets.

Reference: E. Min, S. Safo, and Q. Long, “Penalized Co-Inertia Analysis with Applications to –Omics Data”, Bioinformatics, 2019, 35(6):1018-25.




Distributed Learning from Multiple EHR Databases

Distributed Learning Predictor: Python library for learning from multiple databases and building predictive models based on Distributed Noise Contrastive Estimation (Distributed NCE)

Reference: Li, Z., Roberts, K.E., Jiang, X., and Long, Q. Distributed Learning from Multiple EHR Databases: Contextual Embedding Models for Medical Events. Journal of Biomedical Informatics, 2019, 92, p.103138. 

Link for the software on github


Sparse Multiple Co-Inertia Analysis

pmCIA: R package to perform the sparse multiple co-inertia analysis for multiple datasets

Reference: Min, E.J. and Long, Q., 2020. Sparse multiple co-Inertia analysis with application to integrative analysis of multi-Omics data. BMC Bioinformatics, 21, pp.1-12.

Package: pmCIA_0.9


Graph-guided Bayesian SVM

Graph-guided Bayesian SVM: Matlab codes for the graph-guided Bayesian SVM

Reference: Wenli Sun, Changgee Chang, and Qi Long, "Graph-guided Bayesian SVM with Adaptive Structured Shrinkage Prior for high-dimensional data" 




Distribute Multiple Imputation

Distributed Multiple Imputation: R codes for the simulations reported in the paper

Reference: Changgee Chang, Yi Deng, Xiaoqian Jiang, and Qi Long. (2020) "Multiple Imputation for Analysis of Incomplete Data in Distributed Health Data Networks" Nature Communications, 11(1):5467.

R Codes:



Deep Learning with Gaussian Differential Privacy

Deep Learning with Gaussian Differential Privacy: Python codes

Reference: Bu, Z., Dong, J., Long, Q., and Su, W. (2020) Deep Learning with Gaussian Differential Privacy. Harvard Data Science Review, 2(3):1-48.


Python Library for TensorFlow Privacy including Gaussian DP:


Bayesian Graphical Models of Single-Cell RNA-Sequencing Data

Accounting for Technical Noise in Bayesian Graphical Models of Single-Cell RNA-Sequencing Data: Python codes

Reference: Oh, J., Chang, C. and Long, Q. (2021) Accounting for Technical Noise in Bayesian Graphical Models of Single-Cell RNA-Sequencing Data. Biostatistics, in press