Introduction to Penalized Co-Inertia Analysis

Eun Jeong Min

2018-03-05

Introduction

The package ‘pCIA’ contains functions related to the penalized co-inertia analysis (pCIA). This package provides functions that conduct spasre CIA and structure sparse CIA and generate output CIA plots based on the packages ‘ade4’ nad ‘made4’.

In this tutorial, we conduct two data analysises, one is using a synthetic data contained in the package , and the other is using a real data contained in the package . The synthetic dataset in the package contains 5 monte carlo samples and true parameter values used for generating monte carlo samples. This synthetic data does not have class information of samples, all samples are generated from the one same class. A real data contained in the package is NCI60 cell line data, two gene expression datasets generated from different platforms. NCI60 data are generated from 60 cell lines of nine cancer cells, which means nine cancers are class information. More details about the data can be found in the help page of the package .

Synthetic Data Analysis

Co-Inertia Analysis

First, call the library and the demo data.

library(pCIA)
#> Loading required package: scatterplot3d
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-13
#> Loading required package: ade4
data("demoData")
attach(demoData)
X <- setX[1:n,]; nX <- sweep(X, 2, apply(X, 2, mean))
Y <- setY[1:n,]; nY <- sweep(Y, 2, apply(Y, 2, mean))

Regarding the transformation of the data, we ceterise datasets columnwise since true parameters are assumed to have same unit.

Function conduct original co-inertia analysis. All weight matrices used for the CIA are required to use . In this example, we assume that true weight matrices are known and use true matrices in the analysis.

result.cia <- oCIA(nX, nY, D, Qx, Qy) # this conduct classic CIA

We can compare the true loading vectors and estimated loading vectors by plotting them on the same page.

par(mfrow=c(1,2))
plot(ta[,1], xlab="index", ylab="a", pch=1, cex=0.7)
points(result.cia$res.opta[,1], pch=2, cex=0.3, col="red")
plot(tb[,1], xlab="index", ylab="b", pch=1, cex=0.7)
points(result.cia$res.optb[,1], pch=2, cex=0.3, col="red")

Black circles are true values that were used to generate the synthetic data, and red triangle points are the estimated loading vectors from the funtcion .

Sparse Co-Inertia Anaysis

The function fits sparse co-inertia analysis for a given tuning parameter value set without cross valdiation procedure. For using this function, initial starting point is required. We use the result of CIA as a starting point in this example.

result.scia <- fitscia(nX, nY, result.cia$res.opta[,1], result.cia$res.optb[,1], c(6, 6))

By plotting the estimated loading vectors, we observe that estimated vectors are spasre.

par(mfrow=c(1,2))
plot(ta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
points(result.scia$res.opta, pch=2, cex=0.3, col="red")
plot(tb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)
points(result.scia$res.optb, pch=2, cex=0.3, col="red")

Black circles are true values that were used to generate the synthetic data, and red triangle points are the estimated loading vectors from the funtcion .

For conducting cross validation to select optimal tuning parameters, the function can be used intead. This function also estimate more than two loading pairs by setting the argmuent that let the function know how many loading vector pairs to estimate. Users can printout cross validation objective values at each grid point by setting the arguent . It can be turned off if is set to be .

result.scia.cv <- sCIA.cv(nX, nY, D, Qx, Qy, K=2, nfold=5, ngrid=10, flag=FALSE)
#> 1th dimension is done
#> 2th dimension is done
par(mfrow=c(1,2))
plot(ta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
points(result.scia.cv$resa[,1], pch=2, cex=0.3, col="red")
plot(tb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)
points(result.scia.cv$resb[,1], pch=2, cex=0.3, col="red")

Black circles are true values that were used to generate the synthetic data, and red triangle points are the estimated loading vectors from the funtcion with chosen optimal tuning parameters by cross validation.

Structured Sparse Co-Inertia Anaysis

To fit the structured sparse co-inertia analysis for a given tunidng parameter value set without cross valdiation procedure, the function is used. Like , initial starting point is required for this function, and we use the result of CIA. To incorporate the network information among variables, we need to calculate Laplacian matrices for each data. The function takes data and graph informations to generate required matrices \(\tilde{L}\), matrices generated using Laplacian matrices, so that we can plug in the values in the as follows.

netmats <- getNetMatrix(p, q, eX, eY, Qx, Qy)
tLx <- netmats$Sx # tilde{L}_x
tLy <- netmats$Sy # tilde{L}_y
result.sscia <- fitsscia(nX, nY, result.cia$res.opta[,1], result.cia$res.optb[,1], tLx, tLy, c(6, 1, 6, 0.5))

By plotting the estimated loading vectors, we observe that estimated vectors are spasre.

par(mfrow=c(1,2))
plot(ta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
points(result.sscia$res.opta, pch=2, cex=0.3, col="red")
plot(tb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)
points(result.sscia$res.optb, pch=2, cex=0.3, col="red")

Black circles are true values that were used to generate the synthetic data, and red triangle points are the estimated loading vectors from the funtcion .

For conducting cross validation to select optimal tuning parameters, function can be used intead. This function also estimate more than two loading pairs with argmuent . If arguent has default value , to printout calculated cross validation objective value at each grid point. It can be turned off if is set to be .

result.sscia.cv <- ssCIA.cv(nX, nY, D, Qx, Qy, tLx, tLy, K=2, nfold=5, ngrid=c(5, 5, 5 ,5), flag=FALSE)
#> 1th dimension is done
#> 2th dimension is done
par(mfrow=c(1,2))
plot(ta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
points(result.sscia.cv$res.opta[,1], pch=2, cex=0.3, col="red")
plot(tb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)
points(result.sscia.cv$res.optb[,1], pch=2, cex=0.3, col="red")

Black circles are true values that were used to generate the synthetic data, and red triangle points are the estimated loading vectors from the funtcion with chosen optimal tuning parameters by cross validation.

NCI60 Data Analysis (Class information)

Co-Inertia Analysis

To start over the analysis with different data, clean the workspace and call the NCI60 data from the package .

rm(list=ls())
library(pCIA)
data(list="NCI60", package="made4")
X <- t(NCI60$Ross)
Y <- t(NCI60$Affy)

Function conduct original co-inertia analysis. To conduct the analysis, we first calculate weight matrices, \(Q_x, Q_y\) and \(D\). In this anlaysis, we will set equal weights on sample space while we will use absolute column sum divided by absolute grand sum as a diagonal element of \(Q_x\) and \(Q_y\) each. Also, we columnwise center each data so that all column means to be zeros.

D <- diag(dim(X)[1])
Qx <- diag(apply(abs(X), 2, sum) / sum(abs(X)))
Qy <- diag(apply(abs(Y), 2, sum) / sum(abs(Y)))
nX <- sweep(X, 2, apply(X, 2, mean))
nY <- sweep(Y, 2, apply(Y, 2, mean))
result.cia <- oCIA(nX, nY, D, Qx, Qy) # this conduct classic CIA

We can generate CIA plots as follows.

pinput <- genpInput(result.cia, NCI60$Annot[,2], NCI60$Annot[,4], NCI60$classes[,1], NCI60$classes[,2], nX, nY, D, Qx, Qy)
plot(pinput, nlab=5, labels = NULL)

Sparse Co-Inertia Anaysis

Using the function with initial starting point using the result of CIA and arbitrary tuning parameters, we estimate the first spasre loading vector pairs of sparse CIA as follows.

result.scia <- fitscia(nX, nY, result.cia$res.opta[,1], result.cia$res.optb[,1], c(6, 6))

By plotting the estimated loading vectors, we observe that estimated vectors are spasre.

par(mfrow=c(1,2))
plot(result.scia$res.opta, xlab="index", ylab="estimated a", pch=1, cex=0.7)
plot(result.scia$res.optb, xlab="index", ylab="estimated b", pch=1, cex=0.7)

We conduct cross validation to select optimal tuning parameters for sparsity using the function as follows.

result.scia.cv = sCIA.cv(nX, nY, D, Qx, Qy, K=2, nfold=5, ngrid=10, flag=FALSE)
#> 1th dimension is done
#> 2th dimension is done
par(mfrow=c(1,2))
plot(result.scia.cv$res.opta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
plot(result.scia.cv$res.optb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)

Like above plot, we observe that estimated vectors are spasre.

Also, several informative plots from CIA results can be generated by .

pinput = genpInput(result.scia.cv, NCI60$Annot[,2], NCI60$Annot[,4], NCI60$classes[,1], NCI60$classes[,2], nX, nY, D, Qx, Qy)
plot(pinput)

Structured Sparse Co-Inertia Anaysis

To fit the structured sparse co-inertia analysis for a given tunidng parameter value set without cross valdiation procedure, the function is used with the result of CIA as a starting point in this example. We use the function to generate from the graph informations to generate required matrices calculated from the Laplacian matrices of two datasets as follows. Since data has same list of genes, pathway information is same for both datasets. This pathway information is included in the package .

data(NCI60path)
netmats <- getNetMatrix(dim(nX)[2], dim(nY)[2], NCI60path, NCI60path, Qx, Qy)
tLx <- netmats$Sx # tilde{L}_x
tLy <- netmats$Sy # tilde{L}_y
result.sscia <- fitsscia(nX, nY, result.cia$res.opta[,1], result.cia$res.optb[,1], tLx, tLy, c(6, 1, 6, 0.5))

By plotting the estimated loading vectors, we observe that estimated vectors are spasre.

par(mfrow=c(1,2))
plot(result.sscia$res.opta, xlab="index", ylab="estimated a", pch=1, cex=0.7)
plot(result.sscia$res.optb, xlab="index", ylab="estimated b", pch=1, cex=0.7)

We use the function to estimate two loading pairs by selecting optimal tuning parameter pairs chosen by cross validation procedure.

result.sscia.cv <- ssCIA.cv(nX, nY, D, Qx, Qy, tLx, tLy, K=2, nfold=5, ngrid=c(5, 5, 5 ,5), flag=FALSE)
#> 1th dimension is done
#> 2th dimension is done
par(mfrow=c(1,2))
plot(result.sscia.cv$res.opta[,1], xlab="index", ylab="estimated a", pch=1, cex=0.7)
plot(result.sscia.cv$res.optb[,1], xlab="index", ylab="estimated b", pch=1, cex=0.7)

Also, several informative plots from CIA results can be generated by .

pinput <- genpInput(result.sscia.cv, NCI60$Annot[,2], NCI60$Annot[,4], NCI60$classes[,1], NCI60$classes[,2], nX, nY, D, Qx, Qy)
plot(pinput)