In this example workflow, we demonstrate a single cell classifier we recently developed in our preprint
For illustration, we’ve chosen a T cell dataset that we recently published to get started. The TPM expression matrix can be downloaded here.
suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(scibet))
suppressMessages(library(viridis))
suppressMessages(library(ggsci))
path_da <- "~/test.rds.gz"
expr <- readr::read_rds(path = path_da)
For expression matrix (TPM), rows should be cells and the last column should be "label"
.
expr[1:10, 1:10]
Based on our unified model, we developed E-test for supervised gene selection. This step is implemented with SelectGene
function. Our use of E-test involves an assumption that there is no heterogeneity within each population and hence 𝑆 could be directly calculated by feeding its corresponding 𝐸 into the 𝑆-𝐸 formula.
etest_gene <- SelectGene(expr, k = 50)
etest_gene
## [1] "CXCL13" "CCR7" "FGFBP2" "SELL" "CCL4" "GZMH"
## [7] "GZMB" "IFNG" "CCL3" "FCGR3A" "GPR15" "RGS1"
## [13] "GZMK" "CX3CR1" "KLRC2" "CXCR6" "GZMA" "RGS2"
## [19] "MAL" "TMIGD2" "LEF1" "KLRC1" "PLEK" "HAVCR2"
## [25] "KLRF1" "GNLY" "S100B" "CD160" "NR4A2" "KLRG1"
## [31] "ITGAE" "NR4A1" "FOS" "FCER1G" "LDLRAP1" "NKG7"
## [37] "HLA-DRB5" "VCAM1" "CCL5" "CST7" "PDCD1" "FCRL6"
## [43] "C1orf162" "CD82" "TNFAIP3" "GPR183" "LAT2" "CD69"
## [49] "S1PR5" "PLAC8"
To verify these genes, we can examine their expression patterns across different cell types with Marker_heatmap
.
Marker_heatmap(expr, etest_gene)
For accurate, fast, and robust single cell identification. We developed SciBet using a multinomial-distribution model. This step is implemented with SciBet
function.
1. For reference set, rows should be cells, column should be genes and the last column should be "label"
(TPM). 2. For query set, rows should be cells and column should be genes (TPM).
tibble(
ID = 1:nrow(expr),
label = expr$label
) %>%
dplyr::sample_frac(0.7) %>%
dplyr::pull(ID) -> ID
train_set <- expr[ID,] #construct reference set
test_set <- expr[-ID,] #construct query set
prd <- SciBet(train_set, test_set[,-ncol(test_set)])
We can evaluate how well our predicted cell type annotations match the reference with C_heatmap
. For this dataset, we find that there is a high agreement in cell type identification.
C_heatmap(test_set$label, prd)
Due to the incomplete nature of reference scRNA-seq data collection, cell types excluded from the reference dataset may be falsely predicted to be a known cell type. By applying a null dataset as background, SciBet controls the potential false positives while maintaining high prediction accuracy for cells with types covered by the reference dataset (positive cells).
For illustration, we’ve chosen a recent melanoma dataset with immunde cells as “positive cells” and the other cells (CAF, maligant cells and endothelial cells) as “negative cells”.
For the purposes of this example, these three datasets are used to get started.
null <- readr::read_rds('~/null.rds.gz')
reference <- readr::read_rds('~/reference.rds.gz')
query <- readr::read_rds('~/query.rds.gz')
For query set, “negative cells” account for more than 60%.
ori_label <- query$label
table(ori_label)
## ori_label
## B.cell Macrophage Neg.cell NK T.CD4 T.CD8
## 245 126 2228 28 257 528
The confidence score of each query cell is calculated with the function conf_score
.
query <- query[,-ncol(query)]
c_score <- conf_score(ref = reference, query = query, null_expr = null, gene_num = 500)
The visualization of above result could be implemented with the function N_heatmap
.
tibble(
ori = ori_label,
prd = SciBet(reference, query),
c_score = c_score
) -> res
N_heatmap(res, cutoff = 0.45)