Transcriptome (scRNA-seq data) + HVG selection

We show an example of scRNA-seq data produced by 10X Chromium. We are using scRNA-seq data 3k PBMCs from a Healthy Donor (2,700 cells and 32,738 genes) from 10X Genomics Datasets.

This analysis follow Preprocessing and clustering 3k PBMCs in Scanpy tutorials except for RECODE processing.

We use scanpy to read/write 10X data. Import numpy and scanpy in addlition to screcode.

[1]:
import screcode
import numpy as np
import scanpy as sc
import warnings
warnings.simplefilter('ignore')

Read in the count matrix into an AnnData object.

[2]:
adata = sc.read_10x_mtx(
    "data/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/",  # the directory with the `.mtx` file
    var_names="gene_symbols",  # use gene symbols for the variable names (variables-axis index)
    cache=True,  # write a cache file for faster subsequent reading
)
adata.var_names_make_unique()
adata.layers["Raw"] = np.array(adata.X.toarray(),dtype=int)
adata
[2]:
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'
    layers: 'Raw'

Preprocessing

[3]:
sc.pl.highest_expr_genes(adata, n_top=20)
../_images/Tutorials_Tutorial_RNA_HVG_7_0.png
[4]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
[5]:
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
[6]:
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
../_images/Tutorials_Tutorial_RNA_HVG_10_0.png
[7]:
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
../_images/Tutorials_Tutorial_RNA_HVG_11_0.png
../_images/Tutorials_Tutorial_RNA_HVG_11_1.png
[8]:
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()

Apply RECODE

Apply RECODE to the count matrix. The anndata or ndarray data format is available.

[9]:
recode = screcode.RECODE()#target_sum=1e4
adata = recode.fit_transform(adata)
start RECODE for scRNA-seq data
Normalized data are stored as "RECODE" in adata.layers
Normalized data are stored as "RECODE_norm" and "RECODE_log" in adata.layers
end RECODE for scRNA-seq
log: {'seq_target': 'RNA', '#significant genes': 8119, '#non-significant genes': 5594, '#silent genes': 1, 'ell': 468, 'Elapsed time': '0h 0m 9s 546ms', 'solver': 'full'}

Performance check

[10]:
recode.report()
../_images/Tutorials_Tutorial_RNA_HVG_16_0.png

Highly variable gene selection using RECODE

[ ]:
recode.highly_variable_genes(adata)
adata
Highly variable genes are stored in adata.var['RECODE_highly_variable']
AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'RECODE_noise_variance', 'RECODE_NVSN_variance', 'RECODE_significance', 'RECODE_denoised_variance', 'RECODE_means', 'RECODE_highly_variable'
    uns: 'RECODE_essential'
    layers: 'Raw', 'RECODE', 'RECODE_NVSN', 'RECODE_norm', 'RECODE_log'
[12]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
ranked_genes = adata.var['RECODE_denoised_variance'].sort_values(ascending=False)
for gene in marker_genes:
    if gene in ranked_genes.index:
        rank = ranked_genes.index.get_loc(gene) + 1  # 1-based index
        print(f"{gene}: {rank}")
    else:
        print(f"{gene}: no data available")
IL7R: 60
CD79A: 22
MS4A1: 91
CD8A: 585
CD8B: 278
LYZ: 9
CD14: 82
LGALS3: 27
S100A8: 3
GNLY: 5
NKG7: 4
KLRB1: 1021
FCGR3A: 28
MS4A7: 184
FCER1A: 532
CST3: 7
PPBP: 30
[13]:
adata.X = adata.layers["RECODE_log"]
adata = adata[:, adata.var.RECODE_highly_variable]
adata
[13]:
View of AnnData object with n_obs × n_vars = 2638 × 2000
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'RECODE_noise_variance', 'RECODE_NVSN_variance', 'RECODE_significance', 'RECODE_denoised_variance', 'RECODE_means', 'RECODE_highly_variable'
    uns: 'RECODE_essential'
    layers: 'Raw', 'RECODE', 'RECODE_NVSN', 'RECODE_norm', 'RECODE_log'

Downstream analysis based on scanpy

PCA

[14]:
n_comps = 1000
sc.tl.pca(adata, svd_solver='arpack',n_comps=n_comps)
[15]:
sc.pl.pca_variance_ratio(adata, log=True)
../_images/Tutorials_Tutorial_RNA_HVG_24_0.png
[16]:
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
    sc.pl.pca(adata, color=marker_genes[i:i+n_cols])
../_images/Tutorials_Tutorial_RNA_HVG_25_0.png
../_images/Tutorials_Tutorial_RNA_HVG_25_1.png
../_images/Tutorials_Tutorial_RNA_HVG_25_2.png
../_images/Tutorials_Tutorial_RNA_HVG_25_3.png
../_images/Tutorials_Tutorial_RNA_HVG_25_4.png
../_images/Tutorials_Tutorial_RNA_HVG_25_5.png
../_images/Tutorials_Tutorial_RNA_HVG_25_6.png
../_images/Tutorials_Tutorial_RNA_HVG_25_7.png
../_images/Tutorials_Tutorial_RNA_HVG_25_8.png

UMAP

Note that we do not use the PCA dimentionaly reduction as a preprocessing of UMAP (n_pca=0).

[17]:
sc.pp.neighbors(adata, n_pcs=100)
sc.tl.umap(adata)
[18]:
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
    sc.pl.umap(adata, color=marker_genes[i:i+n_cols])
../_images/Tutorials_Tutorial_RNA_HVG_29_0.png
../_images/Tutorials_Tutorial_RNA_HVG_29_1.png
../_images/Tutorials_Tutorial_RNA_HVG_29_2.png
../_images/Tutorials_Tutorial_RNA_HVG_29_3.png
../_images/Tutorials_Tutorial_RNA_HVG_29_4.png
../_images/Tutorials_Tutorial_RNA_HVG_29_5.png
../_images/Tutorials_Tutorial_RNA_HVG_29_6.png
../_images/Tutorials_Tutorial_RNA_HVG_29_7.png
../_images/Tutorials_Tutorial_RNA_HVG_29_8.png

Clustering

[19]:
# pip install igraph leidenalg
[20]:
sc.tl.leiden(adata,resolution=0.9)
if 'leiden_colors' in adata.uns:
    del adata.uns['leiden_colors']
sc.pl.umap(adata, color=['leiden'],legend_loc='on data')
../_images/Tutorials_Tutorial_RNA_HVG_32_0.png

Finding marker genes

[21]:
sc.tl.rank_genes_groups(adata, "leiden", method="t-test")
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
../_images/Tutorials_Tutorial_RNA_HVG_34_0.png
[22]:
adata.var_names_make_unique()
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
    sc.pl.violin(adata, marker_genes[i:i+n_cols], groupby='leiden')
../_images/Tutorials_Tutorial_RNA_HVG_35_0.png
../_images/Tutorials_Tutorial_RNA_HVG_35_1.png
../_images/Tutorials_Tutorial_RNA_HVG_35_2.png
../_images/Tutorials_Tutorial_RNA_HVG_35_3.png
../_images/Tutorials_Tutorial_RNA_HVG_35_4.png
../_images/Tutorials_Tutorial_RNA_HVG_35_5.png
../_images/Tutorials_Tutorial_RNA_HVG_35_6.png
../_images/Tutorials_Tutorial_RNA_HVG_35_7.png
../_images/Tutorials_Tutorial_RNA_HVG_35_8.png
[23]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden",expression_cutoff=3)
../_images/Tutorials_Tutorial_RNA_HVG_36_0.png
[24]:
cell_types = [
    "CD4 T",             # Cluster 0: high IL7R expression → CD4 T cells
    "CD4 T",             # Cluster 1: IL7R expression suggests another CD4 T cell
    "CD14+ Monocytes",   # Cluster 2: high LYZ, CD14, S100A8 → CD14+ Monocytes
    "B",                 # Cluster 3: high CD79A, MS4A1 → B cells
    "CD8 T",             # Cluster 4: high IL7R, NKG7 → CD8 T cells
    "FCGR3A+ Monocytes", # Cluster 5: expression of FCGR3A, MS4A7 (and some CD14) → FCGR3A+ monocytes
    "NK",                # Cluster 6: high GNLY, NKG7 → NK cells
    "Dendritic",         # Cluster 7: elevated FCER1A, CST3 → dendritic cells
    "Megakaryocytes",     # Cluster 8: high PPBP expression → megakaryocyte cells
]

adata.obs["leiden_name"] = [cell_types[int(i)] for i in adata.obs["leiden"].values]
if 'leiden_name_colors' in adata.uns:
    del adata.uns['leiden_name_colors']
sc.pl.umap(adata, color='leiden_name', legend_loc='on data', frameon=False)
../_images/Tutorials_Tutorial_RNA_HVG_37_0.png