Integration (iRECODE)

We demonstrate RECODE integration for scRNA-seq data produced by 10X Chromium. We use scRNA-seq datasets encompassing three batches and two cell lines (HEK293T and Jurkat). The test data is directly available from GitHub page of JinmiaoChenLab.

We use scanpy to read/write 10X data. Import numpy and scanpy in addlition to screcode.

[1]:
import screcode
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import sklearn
import warnings
warnings.simplefilter('ignore')
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

Read in the count matrix into an AnnData object.

[2]:
input_dir = 'data/dataset6'
input_filenames = ['b1_exprs','b2_exprs','b3_exprs']
input_filenames_meta = ['b1_celltype','b2_celltype','b3_celltype']

data_input = pd.concat([pd.read_csv('%s/%s.txt.gz' % (input_dir,file),index_col=0,delimiter='\t').T for file in input_filenames])
data_input_meta = pd.DataFrame()
for i in range(len(input_filenames)):
    df_ = pd.read_csv('%s/%s.txt.gz' % (input_dir,input_filenames_meta[i]),index_col=0,delimiter='\t')
    df_["batch"] = np.full(len(df_.index),input_filenames[i],dtype=object)
    data_input_meta = pd.concat([data_input_meta,df_])
adata = anndata.AnnData(X=data_input,obs=data_input_meta)
adata = adata[:,np.sum(adata.X,axis=0)>0]
adata.layers["Raw"] = adata.X
adata
[2]:
AnnData object with n_obs × n_vars = 9531 × 21474
    obs: 'CellType', 'batch'
    layers: 'Raw'

Apply iRECODE

Apply RECODE integration to the count matrix. The anndata or ndarray data formats are available. If your data format is ndarray, prepare meta_data including batch infromation by DataFrame format.

[3]:
recode = screcode.RECODE()
adata = recode.fit_transform_integration(adata)
start RECODE integration for scRNA-seq data
2024-03-28 18:47:25,553 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2024-03-28 18:47:26,044 - harmonypy - INFO - sklearn.KMeans initialization complete.
end RECODE for scRNA-seq
log: {'seq_target': 'RNA', '#significant genes': 14462, '#non-significant genes': 7012, '#silent genes': 0, 'ell': 154, 'Elapsed time': '0h 0m 44s 503ms', 'solver': 'full'}
[4]:
## Case of ndarray format
## data: single-cell count matrix [ndarray], meta_data: meta data including batch labels [DataFrame]
# recode = screcode.RECODE()
# data_RECODE = recode.fit_transform_integration(data, meta_data=meta_data, batch_key="batch")

With anndata format, outputs of RECODE are included in anndata objects: - denoised matrix -> adata.obsm[‘RECODE’] - noise variance -> adata.var[‘noise_variance_RECODE’] - normalized variance (NVSN variance) -> adata.var[‘normalized_variance_RECODE’] - clasification of genes (significant/non-significant/silent) -> adata.var[‘significance_RECODE’]

Performance check

[5]:
recode.report()
../_images/Tutorials_Tutorial_Integration_11_0.png

Downstream analysis

log normalization

[6]:
target_sum = 1e4

adata.layers["Raw_log"] = np.log(target_sum*adata.layers["Raw"]/np.sum(adata.layers["Raw"],axis=1)[:,np.newaxis]+1)
adata.layers["RECODE_log"] = np.log(target_sum*adata.layers["RECODE"]/np.sum(adata.layers["RECODE"],axis=1)[:,np.newaxis]+1)

PCA

[7]:
n_components = 50
adata.obsm["Raw_PCA"] = sklearn.decomposition.PCA(n_components=n_components).fit_transform(adata.layers["Raw_log"])
adata.obsm["RECODE_PCA"] = sklearn.decomposition.PCA(n_components=n_components).fit_transform(adata.layers["RECODE_log"])
[8]:
import matplotlib
import matplotlib.pyplot as plt

def plot_2d(
    adata,
    data_lables = ["Raw", "RECODE"],
    data_type = "PCA",
    color_labels = ['batch','CellType'],
    fig_size = 4,
    fs_title = 16,
    fs_label = 14,
    fs_legend = 14,
    labels = ["PC1","PC2"],
    cmaps = ["tab10","tab10_r"],
    vmaxs = [10,10]
    ):
    fig,ax = plt.subplots(2,2,figsize=((len(data_lables)+0.5)*fig_size,len(color_labels)*fig_size),tight_layout=True)
    for i in range(len(color_labels)):
        colors = np.copy(adata.obs[color_labels[i]].values)
        color_set = np.unique(colors)
        colors_num = np.copy(colors)
        [np.place(colors_num,colors_num==color_set[i],i) for i in range(len(color_set))]
        for j in range(len(data_lables)):
            ax_ = ax[i,j]
            name_  = data_lables[j] + "_" + data_type
            plot_data = adata.obsm[name_]
            idx_rdm_ = np.random.permutation(plot_data.shape[0])
            ax_.scatter(plot_data[idx_rdm_,0],plot_data[idx_rdm_,1],c=colors_num[idx_rdm_],s=1,cmap=cmaps[i],vmax=vmaxs[i],zorder=100)
            ax_.tick_params(bottom=False, left=False, right=False, top=False, labelbottom=False, labelleft=False, labelright=False, labeltop=False)
            if j==len(data_lables)-1:
                for c_ in np.unique(colors):
                    idx_ = colors == c_
                    ax_.scatter(plot_data[idx_,0],plot_data[idx_,1],c=colors_num[idx_],s=1,cmap=cmaps[i],vmin=0,vmax=vmaxs[i],label=c_,zorder=0)
                    ax_.legend(bbox_to_anchor=(1.0,0.5), loc="center left", borderaxespad=0, fontsize=fs_legend,markerscale=10,ncol=1,frameon=False,handletextpad=0.,borderpad=0)
            if i == 0:
                ax_.set_title(data_lables[j],fontsize=fs_title)
    plt.gcf().text(0.435,0.0,labels[0], ha="center",va="center",fontsize=fs_label)
    plt.gcf().text(0.0,0.49,labels[1], ha="center",va="center",rotation=90, fontsize=fs_label)


plot_2d(
    adata,
    data_lables = ["Raw", "RECODE"],
    data_type = "PCA",
    color_labels = ['batch','CellType'],
    labels = ["PC1","PC2"]
)
../_images/Tutorials_Tutorial_Integration_17_0.png

UMAP

[9]:
import umap
n_components = 2
adata.obsm["Raw_UMAP"] = umap.UMAP(n_components=n_components).fit_transform(adata.obsm["Raw_PCA"])
adata.obsm["RECODE_UMAP"] = umap.UMAP(n_components=n_components).fit_transform(adata.obsm["RECODE_PCA"])
[10]:
plot_2d(
    adata,
    data_lables = ["Raw", "RECODE"],
    data_type = "UMAP",
    color_labels = ['batch','CellType'],
    labels = ["UMAP1","UMAP2"]
)
../_images/Tutorials_Tutorial_Integration_20_0.png
[11]:
def plot_gex(
    adata,
    genes,
    data_lables = ["Raw_log", "RECODE_log"],
    color_labels = ['batch','CellType'],
    fig_size = 4,
    fs_title = 16,
    fs_label = 14,
    fs_legend = 14,
    cmaps = ["tab10","tab10_r"],
    vmaxs = [10,10]
    ):
    for gene_ in genes:
        fig,ax = plt.subplots(2,len(data_lables),figsize=((len(data_lables)+0.5)*fig_size,len(color_labels)*fig_size),tight_layout=True)
        for i in range(len(color_labels)):
            colors = np.copy(adata.obs[color_labels[i]].values)
            color_set = np.unique(colors)
            colors_num = np.copy(colors)
            [np.place(colors_num,colors_num==color_set[i],i) for i in range(len(color_set))]
            for j in range(len(data_lables)):
                ax_ = ax[i,j]
                name_  = data_lables[j]
                plot_data = adata.layers[name_]
                idx_x_ = adata.var.index == gene_[0]
                idx_y_ = adata.var.index == gene_[1]
                idx_rdm_ = np.random.permutation(plot_data.shape[0])
                ax_.scatter(plot_data[idx_rdm_][:,idx_x_],plot_data[idx_rdm_][:,idx_y_],c=colors_num[idx_rdm_],s=1,cmap=cmaps[i],vmax=vmaxs[i],zorder=100)
                ax_.tick_params(bottom=False, left=False, right=False, top=False, labelbottom=False, labelleft=False, labelright=False, labeltop=False)
                if j==len(data_lables)-1:
                    for c_ in np.unique(colors):
                        idx_ = colors == c_
                        ax_.scatter(plot_data[idx_][:,idx_x_],plot_data[idx_][:,idx_y_],c=colors_num[idx_],s=1,cmap=cmaps[i],vmin=0,vmax=vmaxs[i],label=c_,zorder=0)
                        ax_.legend(bbox_to_anchor=(1.0,0.5), loc="center left", borderaxespad=0, fontsize=fs_legend,markerscale=10,ncol=1,frameon=False,handletextpad=0.,borderpad=0)
                if i == 0:
                    ax_.set_title(data_lables[j],fontsize=fs_title)
                ax_.grid(ls="--",alpha=0.5,zorder=0)
        plt.gcf().text(0.435,0.0,"$\it{%s}$" % gene_[0], ha="center", va="center", fontsize=fs_label)
        plt.gcf().text(0.0,0.490,"$\it{%s}$" % gene_[1], ha="center", va="center", fontsize=fs_label,rotation=90)

genes = [["CD3D", "CDKN2A"]]
plot_gex(adata,genes)
../_images/Tutorials_Tutorial_Integration_21_0.png

Adjust parameters of batch correction

[12]:
recode = screcode.RECODE(RECODE_key="RECODE_theta")
adata = recode.fit_transform_integration(adata,integration_method_params = {"theta":0.1})
start RECODE integration for scRNA-seq data
2024-03-28 18:49:20,795 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2024-03-28 18:49:21,261 - harmonypy - INFO - sklearn.KMeans initialization complete.
end RECODE for scRNA-seq
log: {'seq_target': 'RNA', '#significant genes': 14462, '#non-significant genes': 7012, '#silent genes': 0, 'ell': 154, 'Elapsed time': '0h 0m 44s 714ms', 'solver': 'full'}
[13]:
adata = recode.lognormalize(adata,key="RECODE_theta")
Normalized data are stored in "RECODE_theta_norm" and "RECODE_theta_log"
[14]:
genes = [["CD3D", "CDKN2A"]]
plot_gex(adata,genes,data_lables = ["Raw_log", "RECODE_log", "RECODE_theta_log"])
../_images/Tutorials_Tutorial_Integration_25_0.png
[ ]: