Transcriptome (scRNA-seq data) + HVG selection
We show an example of scRNA-seq data produced by 10X Chromium. We are using scRNA-seq data 3k PBMCs from a Healthy Donor
(2,700 cells and 32,738 genes) from 10X Genomics Datasets.
This analysis follow Preprocessing and clustering 3k PBMCs in Scanpy tutorials except for RECODE processing.
We use scanpy to read/write 10X data. Import numpy and scanpy in addlition to screcode.
[1]:
import screcode
import numpy as np
import scanpy as sc
import warnings
warnings.simplefilter('ignore')
Read in the count matrix into an AnnData object.
[2]:
adata = sc.read_10x_mtx(
"data/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/", # the directory with the `.mtx` file
var_names="gene_symbols", # use gene symbols for the variable names (variables-axis index)
cache=True, # write a cache file for faster subsequent reading
)
adata.var_names_make_unique()
adata.layers["Raw"] = np.array(adata.X.toarray(),dtype=int)
adata
[2]:
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
layers: 'Raw'
Preprocessing
[3]:
sc.pl.highest_expr_genes(adata, n_top=20)

[4]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
[5]:
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
[6]:
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)

[7]:
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")


[8]:
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
Apply RECODE
Apply RECODE to the count matrix. The anndata or ndarray data format is available.
[9]:
recode = screcode.RECODE()#target_sum=1e4
adata = recode.fit_transform(adata)
start RECODE for scRNA-seq data
Normalized data are stored as "RECODE" in adata.layers
Normalized data are stored as "RECODE_norm" and "RECODE_log" in adata.layers
end RECODE for scRNA-seq
log: {'seq_target': 'RNA', '#significant genes': 8119, '#non-significant genes': 5594, '#silent genes': 1, 'ell': 468, 'Elapsed time': '0h 0m 9s 546ms', 'solver': 'full'}
Performance check
[10]:
recode.report()

Highly variable gene selection using RECODE
[ ]:
recode.highly_variable_genes(adata)
adata
Highly variable genes are stored in adata.var['RECODE_highly_variable']
AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'RECODE_noise_variance', 'RECODE_NVSN_variance', 'RECODE_significance', 'RECODE_denoised_variance', 'RECODE_means', 'RECODE_highly_variable'
uns: 'RECODE_essential'
layers: 'Raw', 'RECODE', 'RECODE_NVSN', 'RECODE_norm', 'RECODE_log'
[12]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
ranked_genes = adata.var['RECODE_denoised_variance'].sort_values(ascending=False)
for gene in marker_genes:
if gene in ranked_genes.index:
rank = ranked_genes.index.get_loc(gene) + 1 # 1-based index
print(f"{gene}: {rank}")
else:
print(f"{gene}: no data available")
IL7R: 60
CD79A: 22
MS4A1: 91
CD8A: 585
CD8B: 278
LYZ: 9
CD14: 82
LGALS3: 27
S100A8: 3
GNLY: 5
NKG7: 4
KLRB1: 1021
FCGR3A: 28
MS4A7: 184
FCER1A: 532
CST3: 7
PPBP: 30
[13]:
adata.X = adata.layers["RECODE_log"]
adata = adata[:, adata.var.RECODE_highly_variable]
adata
[13]:
View of AnnData object with n_obs × n_vars = 2638 × 2000
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'RECODE_noise_variance', 'RECODE_NVSN_variance', 'RECODE_significance', 'RECODE_denoised_variance', 'RECODE_means', 'RECODE_highly_variable'
uns: 'RECODE_essential'
layers: 'Raw', 'RECODE', 'RECODE_NVSN', 'RECODE_norm', 'RECODE_log'
Downstream analysis based on scanpy
PCA
[14]:
n_comps = 1000
sc.tl.pca(adata, svd_solver='arpack',n_comps=n_comps)
[15]:
sc.pl.pca_variance_ratio(adata, log=True)

[16]:
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
sc.pl.pca(adata, color=marker_genes[i:i+n_cols])









UMAP
Note that we do not use the PCA dimentionaly reduction as a preprocessing of UMAP (n_pca=0
).
[17]:
sc.pp.neighbors(adata, n_pcs=100)
sc.tl.umap(adata)
[18]:
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
sc.pl.umap(adata, color=marker_genes[i:i+n_cols])









Clustering
[19]:
# pip install igraph leidenalg
[20]:
sc.tl.leiden(adata,resolution=0.9)
if 'leiden_colors' in adata.uns:
del adata.uns['leiden_colors']
sc.pl.umap(adata, color=['leiden'],legend_loc='on data')

Finding marker genes
[21]:
sc.tl.rank_genes_groups(adata, "leiden", method="t-test")
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

[22]:
adata.var_names_make_unique()
n_cols = 2
for i in range(0, len(marker_genes),n_cols):
sc.pl.violin(adata, marker_genes[i:i+n_cols], groupby='leiden')









[23]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden",expression_cutoff=3)

[24]:
cell_types = [
"CD4 T", # Cluster 0: high IL7R expression → CD4 T cells
"CD4 T", # Cluster 1: IL7R expression suggests another CD4 T cell
"CD14+ Monocytes", # Cluster 2: high LYZ, CD14, S100A8 → CD14+ Monocytes
"B", # Cluster 3: high CD79A, MS4A1 → B cells
"CD8 T", # Cluster 4: high IL7R, NKG7 → CD8 T cells
"FCGR3A+ Monocytes", # Cluster 5: expression of FCGR3A, MS4A7 (and some CD14) → FCGR3A+ monocytes
"NK", # Cluster 6: high GNLY, NKG7 → NK cells
"Dendritic", # Cluster 7: elevated FCER1A, CST3 → dendritic cells
"Megakaryocytes", # Cluster 8: high PPBP expression → megakaryocyte cells
]
adata.obs["leiden_name"] = [cell_types[int(i)] for i in adata.obs["leiden"].values]
if 'leiden_name_colors' in adata.uns:
del adata.uns['leiden_name_colors']
sc.pl.umap(adata, color='leiden_name', legend_loc='on data', frameon=False)
