Resolution of curse of dimensionality (RECODE) is a noise reduction method for single-cell sequencing data. This vignette briefly demonstrates how to run RECODE on single-cell RNA-seq data and integrate the result into Seurat analysis pipeline.
This tutorial uses the screcode Python package via the reticulate R package, and requires Python, the screcode Python package, and the reticulate R package installed in advance. You can install the screcode into your preferred Python environment using recodeinstaller. You can install reticulate and recodeinstaller as follows.
install.packages("reticulate") # For interfacing with Python
remotes::install_github("yusuke-imoto-lab/recodeinstaller") # For installing a screcode-installed Python environment.
Then, the following commands install the Python version of RECODE.
library(recodeinstaller)
install_screcode()
You can choose your preferred installation type according to the guidance. After the installation is done, a directory called “recodeloader” is created in the working directory. This directory contains an R script for loading the RECODE-enabled Python.
You can also install the R packages as follows.
install.packages("Seurat") # For the downstream analysis
install.packages("Matrix") # For processing sparse matrices
install.packages("ggplot2") # For data visualization
In the following instructions, this tutorial uses Seurat v4, but it is also compatible with Seurat v3. Import Seurat and other libraries with the following commands.
library(Seurat)
library(Matrix)
library(ggplot2)
library(reticulate)
set.seed(0)
You can load the created RECODE-enabled Python by running
source("recodeloader/load_recodeenv.R")
## Load installed Python RECODE.
## Warning: Previous request to
## `use_python("C:\Users\expou\AppData\Local\r-miniconda\envs\recode2/python.exe", required = TRUE)`
## will be ignored. It is superseded by request to
## `use_python("C:\Users\expou\AppData\Local\r-miniconda\envs\recode/python.exe")
py_config() # Confirm which Python to be used
## python: C:/Users/expou/AppData/Local/r-miniconda/envs/recode/python.exe
## libpython: C:/Users/expou/AppData/Local/r-miniconda/envs/recode/python38.dll
## pythonhome: C:/Users/expou/AppData/Local/r-miniconda/envs/recode
## version: 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:01:59) [MSC v.1929 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Users/expou/AppData/Local/r-miniconda/envs/recode/Lib/site-packages/numpy
## numpy_version: 1.23.5
##
## NOTE: Python version was forced by use_python function
in the working directory. Please make sure that the above source command is run interactively (not in your R script) and the “recodeloader” is in the working directory.
Instead of using recodeloader/load_recodeenv.R
, you can
select the Python environment to work in using the
use_python
function in the reticulate
R
package.
reticulate::use_python("C:/Users/ashbi/anaconda3/envs/r", # Path to python.exe
required=TRUE)
py_config() # Confirm which Python to be used
You can also select the Python environment to work in using the
use_condaenv
function.
reticulate::use_python(envname = "recode_installed_condaenv")
py_config() # Confirm which Python to be used
theme_set(theme(text = element_text(size = 18),
panel.background = element_rect(fill = "white", color = "gray20"),
panel.grid = element_line(color = "gray92"),
legend.key = element_rect(colour = NA, fill = NA)))
Sample data: For this tutorial, we use sample 10k
Human PBMCs, 3’ v3.1, Chromium Controller (11,485 cells and 36,601
genes) in 10X
Genomics Datasets. The test data is deposited as
Feature / cell matrix HDF5 (filterd)
here
(registration required).
seurat <- Read10X_h5("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.h5")
seurat <- CreateSeuratObject(seurat)
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = "^MT-")
We apply RECODE on a count matrix where each row is a gene and each column is a cell (gene x cell). The output is a denoised count matrix (gene x cell).
plt <- reticulate::import(module="matplotlib.pyplot")
screcode <- reticulate::import(module="screcode.screcode")
recode<-screcode$RECODE()
data<-t(as.matrix(seurat[["RNA"]]@counts))
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size 3.1 GiB
data_RECODE <- recode$fit_transform(data)
rownames(data_RECODE)<-colnames(seurat)
colnames(data_RECODE)<-rownames(seurat)
recode$report()
plt$show()
We store the denoised count matrix in
seurat[["RECODE"]]@counts
.
seurat[["RECODE"]] <- CreateAssayObject(counts = Matrix(t(data_RECODE), sparse = TRUE))
DefaultAssay(seurat) <- "RECODE"
We remove low-quality cells using the following thresholds.
700 \(\leq\) No. of detected genes
\(\leq\) 6000
% Mitochondrial transcript \(\leq\)
15
1000 \(\leq\) Total UMI \(\leq\) 25000
ngene_low <- 700
ngene_high <- 6000
mito_high <- 15
numi_low <- 1000
numi_high <- 25000
meta.data <- seurat@meta.data
meta.data$quality <- rep("High", nrow(meta.data))
meta.data$quality[meta.data$nFeature_RNA<ngene_low|ngene_high<meta.data$nFeature_RNA] <- "Low"
meta.data$quality[meta.data$nCount_RNA<numi_low|numi_high<meta.data$nCount_RNA] <- "Low"
meta.data$quality[meta.data$percent.mt>mito_high] <- "Low"
meta.data$quality <- factor(meta.data$quality)
p1 <- ggplot(meta.data, aes(x = nFeature_RNA, y = percent.mt, color = quality))+
geom_point(size = 0.1, show.legend = F)+
scale_color_manual(values = c("High"="black", "Low"="red"))+
xlab("nGene")
p2 <- ggplot(meta.data, aes(x = nCount_RNA, y = nFeature_RNA, color = quality))+
geom_point(size = 0.1)+
scale_color_manual(values = c("High"="black", "Low"="red"))+
xlab("nUMI")+
ylab("nGene")+
guides(color = guide_legend(override.aes = list(size = 5)))
p1+p2
VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3,
pt.size = 0)
seurat <- subset(seurat, subset = nFeature_RNA > ngene_low & nFeature_RNA < ngene_high & percent.mt < mito_high & nCount_RNA > numi_low & nCount_RNA < numi_high)
dim(seurat)
## [1] 36601 11005
Then, we perform library size normalization and run PCA on 2000 highly variable genes. To retain as much biological information as possible, We do not scale gene expression variance to 1 in this tutorial.
seurat <- NormalizeData(seurat,scale.factor=10^5)
seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2000)
seurat <- ScaleData(seurat, features = rownames(seurat), do.scale = FALSE, do.center = TRUE)
## Centering data matrix
seurat <- RunPCA(seurat, features = VariableFeatures(seurat))
## PC_ 1
## Positive: GZMM, CD247, CTSW, CCL5, KLRB1, SAMD3, GRAP2, GZMK, GZMA, NCR3
## STMN1, C12orf75, VEGFB, NKG7, CST7, GNLY, KLRG1, MATK, HOPX, LINC01871
## PYHIN1, TRGC2, PRF1, TRDC, MAF, MYBL1, GZMH, KLRD1, TPD52, LAIR2
## Negative: LYZ, S100A9, S100A8, MNDA, CST3, FGL2, CTSS, FOS, SERPINA1, IFI30
## VCAN, AC020656.1, AIF1, LST1, S100A12, IGSF6, CD14, CD36, MARCKS, CFD
## WARS, HLA-DRA, FCER1G, CPVL, RGS2, IFITM3, PLBD1, CSF3R, MAFB, HCK
## PC_ 2
## Positive: NKG7, GNLY, CCL5, GZMA, CTSW, GZMM, CD247, CST7, PRF1, KLRD1
## KLRB1, S100A9, GZMH, SAMD3, NEAT1, KLRG1, FGFBP2, TRDC, LAIR2, TRGC2
## GZMK, MATK, LYZ, LINC01871, S100A8, IL2RB, HOPX, PLEKHF1, KLRF1, SPON2
## Negative: IGHM, IGKC, HLA-DRA, MS4A1, CD79A, HLA-DQA1, TCL1A, IGHD, HLA-DRB1, IGLC2
## BANK1, CD74, HLA-DPB1, CD79B, HLA-DPA1, LINC00926, IGLC3, BCL11A, RALGPS2, HLA-DRB5
## FCER2, AFF3, TNFRSF13C, CD22, VPREB3, LINC02397, NIBAN3, HLA-DQB1, SPIB, PAX5
## PC_ 3
## Positive: GNLY, NKG7, GZMA, CST7, PRF1, GZMB, CCL5, FCGR3A, KLRD1, GZMH
## FGFBP2, HOPX, CLIC3, SPON2, RHOC, KLRF1, FCER1G, HLA-DPA1, LAIR2, HLA-DPB1
## CCL4, KLRG1, ADGRG1, ABI3, HLA-DRB1, TRDC, PTGDR, XCL2, SLAMF7, HLA-DQA1
## Negative: SLC40A1, PADI4, TSHZ2, S100A9, VCAN, S100A12, CKAP4, RETN, F5, S100A8
## APP, CYP1B1, THBS1, MGST1, QPCT, MCEMP1, CRISPLD2, CD14, LINC00937, MARC1
## CD163, WLS, CLEC4E, CLEC4D, RBP7, MAST4, RNASE2, PGD, STEAP4, HP
## PC_ 4
## Positive: CDKN1C, FCGR3A, IFITM3, HES4, AIF1, CTSL, TCF7L2, HLA-DQA1, BATF3, RHOC
## MS4A7, CSF1R, ABI3, APOBEC3A, RRAS, LYPD2, CASP5, LST1, SIGLEC10, HLA-DPA1
## CXCL10, WARS, SMIM25, PTP4A3, FAM110A, CALML4, C1QA, NEURL1, AC064805.1, CKB
## Negative: S100A8, S100A12, VCAN, S100A9, ALOX5AP, FOS, CYP1B1, CES1, AC020916.1, ITGAM
## THBS1, PADI4, RNASE6, LYZ, CKAP4, CR1, AC020656.1, GNLY, MEGF9, MCEMP1
## MGST1, QPCT, CD36, CXXC5, RNASE2, CSF3R, CD14, LRRK2, F5, MARC1
## PC_ 5
## Positive: ITM2C, PLD4, LILRA4, PPP1R14B, CLEC4C, GZMB, TPM2, SERPINF1, CCDC50, DNASE1L3
## FCER1A, C12orf75, SCT, MZB1, IL3RA, UGCG, JCHAIN, LAMP5, CST3, PTGDS
## LYZ, TCF4, RUNX2, GAS6, FABP5, DAB2, LGMN, LRRC26, ZFAT, PTCRA
## Negative: IGHD, FCGR3A, LINC00926, MTSS1, MS4A1, CD79B, FCRL3, RGS2, MS4A7, IGHM
## CD79A, FCER2, PLPP5, BCL2A1, TNFRSF13C, IFIT3, IGLC2, FCRL1, NEAT1, TNFSF10
## PAX5, S100A8, SWAP70, FOS, FGFBP2, VPREB3, CD22, SCIMP, CDKN1C, IFIT2
seurat <- FindNeighbors(seurat, dims = 1:50)
## Computing nearest neighbor graph
## Computing SNN
seurat <- FindClusters(seurat, resolution = 1.2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 11005
## Number of edges: 385012
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8236
## Number of communities: 17
## Elapsed time: 2 seconds
seurat <- RunUMAP(seurat, dims = 1:50, seed.use = 42)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:59:53 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:59:53 Read 11005 rows and found 50 numeric columns
## 10:59:53 Using Annoy for neighbor search, n_neighbors = 30
## 10:59:53 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:59:55 Writing NN index file to temp file C:\Users\expou\AppData\Local\Temp\RtmpgJgbdr\file78e07f4e2066
## 10:59:55 Searching Annoy index using 1 thread, search_k = 3000
## 10:59:59 Annoy recall = 100%
## 10:59:59 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:00:01 Initializing from normalized Laplacian + noise (using irlba)
## 11:00:01 Commencing optimization for 200 epochs, with 478460 positive edges
## 11:00:16 Optimization finished
DimPlot(seurat, reduction = "umap", label = TRUE, label.size = 5)
DefaultAssay(seurat) <- "RECODE"
FeaturePlot(seurat, features = c("CD3D", "CD4", "CD8B", "CCR7", "GZMK", "PRF1", "KLRB1", "TRDC", "NKG7", "CD14", "FCGR3A", "CD19", "TCL1A", "FCER1A", "LILRA4"), ncol = 4)
use.switch <- function(x)
{
sapply(x, function(y){
switch(y,
"2" = "Naive CD8+ T cells",
"1" = "Naive CD4+ T cells",
"3" = "Memory CD4+ T cells",
"16" = "Memory CD4+ T cells",
"11" = "Central memory CD8+ T cells",
"7" = "Effector memory CD8+ T cells",
"14" = "gd T cells",
"10" = "Natural Killer cells",
"6" = "Memory B cells",
"9" = "Naive B cells",
"0" = "CD14+ monocytes",
"4" = "CD14+ monocytes",
"5" = "CD14+ monocytes",
"8" = "FCGR3A+ monocytes",
"13" = "Myeloid dendritic cells",
"15" = "Plasmacytoid dendritic cells",
"18" = "Plasmacytoid dendritic cells",
"12" = "Doublets",
"17" = "Doublets",
y
)
})
}
seurat$cluster <- as.character(seurat$seurat_clusters)
seurat$cluster <- use.switch(seurat$cluster)
seurat$cluster <- factor(seurat$cluster, levels = c("Naive CD8+ T cells", "Naive CD4+ T cells", "Memory CD4+ T cells", "Central memory CD8+ T cells", "Effector memory CD8+ T cells", "gd T cells", "Natural Killer cells", "Memory B cells", "Naive B cells", "CD14+ monocytes", "FCGR3A+ monocytes", "Myeloid dendritic cells", "Plasmacytoid dendritic cells", "Doublets"))
DimPlot(seurat, group.by = "cluster")
seurat <- NormalizeData(seurat, assay = "RNA",scale.factor=10^5)
genes <- c("CD14", "FCGR3A")
# Check gene expression value before RECODE
dat <- data.frame(A = seurat[["RNA"]]@data[genes[1], ],
B = seurat[["RNA"]]@data[genes[2], ],
cluster = seurat$cluster)
dat <- dat[(grep("monocytes", dat$cluster)), ]
p1 <- ggplot(dat, aes(x = A, y = B, color = cluster))+
geom_point(size = 0.7)+
xlab(genes[1])+
ylab(genes[2])+
guides(color = guide_legend(override.aes = list(size = 5)))+
ggtitle("Original")
# Check gene expression value after RECODE
dat <- data.frame(A = seurat[["RECODE"]]@data[genes[1], ],
B = seurat[["RECODE"]]@data[genes[2], ],
cluster = seurat$cluster)
dat <- dat[(grep("monocytes", dat$cluster)), ]
p2 <- ggplot(dat, aes(x = A, y = B, color = cluster))+
geom_point(size = 0.7)+
xlab(genes[1])+
ylab(genes[2])+
guides(color = guide_legend(override.aes = list(size = 5)))+
ggtitle("RECODE")
p1/p2
The distribution of two monocyte populations (CD14+ cells and FCGR3A+ cells) is clearly captured with RECODE.
genes <- c("CCR7", "S100A4")
# Check gene expression value before RECODE
dat <- data.frame(A = seurat[["RNA"]]@data[genes[1], ],
B = seurat[["RNA"]]@data[genes[2], ],
cluster = seurat$cluster)
dat <- dat[(grep("CD8", dat$cluster)), ]
p1 <- ggplot(dat, aes(x = A, y = B, color = cluster))+
geom_point(size = 0.7)+
xlab(genes[1])+
ylab(genes[2])+
guides(color = guide_legend(override.aes = list(size = 5)))+
ggtitle("Original")
# Check gene expression value after RECODE
dat <- data.frame(A = seurat[["RECODE"]]@data[genes[1], ],
B = seurat[["RECODE"]]@data[genes[2], ],
cluster = seurat$cluster)
dat <- dat[(grep("CD8", dat$cluster)), ]
p2 <- ggplot(dat, aes(x = A, y = B, color = cluster))+
geom_point(size = 0.7)+
xlab(genes[1])+
ylab(genes[2])+
guides(color = guide_legend(override.aes = list(size = 5)))+
ggtitle("RECODE")
p1/p2
This data includes three CD8+ T cell subsets: naive cells (CCR7+, S100A4-), central memory cells (CCR7+, S100A4+), and effector memory cells (CCR7-, S100A4+). This expression pattern is visible with RECODE.
While the original expression values are sparse and discontinuous due to random sampling effects, RECODE can reconstruct the continuous denoised data, which more accurately reflects the biological information. These two simple examples demonstrate the successful noise reduction by RECODE.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Japanese_Japan.utf8 LC_CTYPE=Japanese_Japan.utf8 LC_MONETARY=Japanese_Japan.utf8
## [4] LC_NUMERIC=C LC_TIME=Japanese_Japan.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reticulate_1.28 ggplot2_3.4.1 Matrix_1.5-3 Seurat_4.3.0 rmarkdown_2.20
## [6] SeuratObject_4.1.3 sp_1.6-0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.1-0 deldir_1.0-6 ellipsis_0.3.2
## [5] ggridges_0.5.4 rprojroot_2.0.3 rstudioapi_0.14 spatstat.data_3.0-0
## [9] farver_2.1.1 leiden_0.4.3 listenv_0.9.0 bit64_4.0.5
## [13] ggrepel_0.9.3 fansi_1.0.4 codetools_0.2-18 splines_4.2.2
## [17] cachem_1.0.6 knitr_1.42 polyclip_1.10-4 jsonlite_1.8.4
## [21] ica_1.0-3 cluster_2.1.4 png_0.1-8 uwot_0.1.14
## [25] shiny_1.7.4 sctransform_0.3.5 spatstat.sparse_3.0-0 compiler_4.2.2
## [29] httr_1.4.4 fastmap_1.1.0 lazyeval_0.2.2 cli_3.6.0
## [33] later_1.3.0 htmltools_0.5.4 tools_4.2.2 igraph_1.4.0
## [37] gtable_0.3.1 glue_1.6.2 RANN_2.6.1 reshape2_1.4.4
## [41] dplyr_1.1.0 rappdirs_0.3.3 Rcpp_1.0.10 scattermore_0.8
## [45] jquerylib_0.1.4 vctrs_0.5.2 nlme_3.1-160 spatstat.explore_3.0-6
## [49] progressr_0.13.0 lmtest_0.9-40 spatstat.random_3.1-3 xfun_0.37
## [53] stringr_1.5.0 globals_0.16.2 mime_0.12 miniUI_0.1.1.1
## [57] lifecycle_1.0.3 irlba_2.3.5.1 goftest_1.2-3 future_1.31.0
## [61] MASS_7.3-58.1 zoo_1.8-11 scales_1.2.1 promises_1.2.0.1
## [65] spatstat.utils_3.0-1 parallel_4.2.2 RColorBrewer_1.1-3 yaml_2.3.7
## [69] pbapply_1.7-0 gridExtra_2.3 sass_0.4.5 stringi_1.7.12
## [73] highr_0.10 rlang_1.0.6 pkgconfig_2.0.3 matrixStats_0.63.0
## [77] evaluate_0.20 lattice_0.20-45 ROCR_1.0-11 purrr_1.0.1
## [81] tensor_1.5 labeling_0.4.2 patchwork_1.1.2 htmlwidgets_1.6.1
## [85] bit_4.0.5 cowplot_1.1.1 tidyselect_1.2.0 here_1.0.1
## [89] parallelly_1.34.0 RcppAnnoy_0.0.20 plyr_1.8.8 magrittr_2.0.3
## [93] R6_2.5.1 generics_0.1.3 pillar_1.8.1 withr_2.5.0
## [97] fitdistrplus_1.1-8 survival_3.4-0 abind_1.4-5 tibble_3.1.8
## [101] future.apply_1.10.0 crayon_1.5.2 hdf5r_1.3.8 KernSmooth_2.23-20
## [105] utf8_1.2.3 spatstat.geom_3.0-6 plotly_4.10.1 grid_4.2.2
## [109] data.table_1.14.8 digest_0.6.31 xtable_1.8-4 tidyr_1.3.0
## [113] httpuv_1.6.9 munsell_0.5.0 viridisLite_0.4.1 bslib_0.4.2