In this example workflow, we demonstrate a single cell classifier we recently developed in our preprint

For illustration, we’ve chosen a T cell dataset that we recently published to get started. The TPM expression matrix can be downloaded here.

Library

suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(scibet))
suppressMessages(library(viridis))
suppressMessages(library(ggsci))

Load the data

path_da <- "~/test.rds.gz"
expr <- readr::read_rds(path = path_da) 

For expression matrix (TPM), rows should be cells and the last column should be "label".

expr[1:10, 1:10]

E(ntropy)-test for supervised gene selection

Based on our unified model, we developed E-test for supervised gene selection. This step is implemented with SelectGene function. Our use of E-test involves an assumption that there is no heterogeneity within each population and hence 𝑆 could be directly calculated by feeding its corresponding 𝐸 into the 𝑆-𝐸 formula.

etest_gene <- SelectGene(expr, k = 50)
etest_gene
##  [1] "CXCL13"   "CCR7"     "FGFBP2"   "SELL"     "CCL4"     "GZMH"    
##  [7] "GZMB"     "IFNG"     "CCL3"     "FCGR3A"   "GPR15"    "RGS1"    
## [13] "GZMK"     "CX3CR1"   "KLRC2"    "CXCR6"    "GZMA"     "RGS2"    
## [19] "MAL"      "TMIGD2"   "LEF1"     "KLRC1"    "PLEK"     "HAVCR2"  
## [25] "KLRF1"    "GNLY"     "S100B"    "CD160"    "NR4A2"    "KLRG1"   
## [31] "ITGAE"    "NR4A1"    "FOS"      "FCER1G"   "LDLRAP1"  "NKG7"    
## [37] "HLA-DRB5" "VCAM1"    "CCL5"     "CST7"     "PDCD1"    "FCRL6"   
## [43] "C1orf162" "CD82"     "TNFAIP3"  "GPR183"   "LAT2"     "CD69"    
## [49] "S1PR5"    "PLAC8"

To verify these genes, we can examine their expression patterns across different cell types with Marker_heatmap.

Marker_heatmap(expr, etest_gene)

SciBet: Single Cell Identifier Based on Entropy Test

For accurate, fast, and robust single cell identification. We developed SciBet using a multinomial-distribution model. This step is implemented with SciBet function.

1. For reference set, rows should be cells, column should be genes and the last column should be "label" (TPM). 2. For query set, rows should be cells and column should be genes (TPM).

tibble(
  ID = 1:nrow(expr),
  label = expr$label
) %>%
  dplyr::sample_frac(0.7) %>%
  dplyr::pull(ID) -> ID

train_set <- expr[ID,]      #construct reference set
test_set <- expr[-ID,]      #construct query set

prd <- SciBet(train_set, test_set[,-ncol(test_set)])

We can evaluate how well our predicted cell type annotations match the reference with C_heatmap. For this dataset, we find that there is a high agreement in cell type identification.

C_heatmap(test_set$label, prd)

False positive control

Due to the incomplete nature of reference scRNA-seq data collection, cell types excluded from the reference dataset may be falsely predicted to be a known cell type. By applying a null dataset as background, SciBet controls the potential false positives while maintaining high prediction accuracy for cells with types covered by the reference dataset (positive cells).

For illustration, we’ve chosen a recent melanoma dataset with immunde cells as “positive cells” and the other cells (CAF, maligant cells and endothelial cells) as “negative cells”.

For the purposes of this example, these three datasets are used to get started.

null <- readr::read_rds('~/null.rds.gz')
reference <- readr::read_rds('~/reference.rds.gz')
query <- readr::read_rds('~/query.rds.gz')

For query set, “negative cells” account for more than 60%.

ori_label <- query$label
table(ori_label)
## ori_label
##     B.cell Macrophage   Neg.cell         NK      T.CD4      T.CD8 
##        245        126       2228         28        257        528

The confidence score of each query cell is calculated with the function conf_score.

query <- query[,-ncol(query)]
c_score <- conf_score(ref = reference, query = query, null_expr = null, gene_num = 500)

The visualization of above result could be implemented with the function N_heatmap.

tibble(
  ori = ori_label,
  prd = SciBet(reference, query),
  c_score = c_score
) -> res

N_heatmap(res, cutoff = 0.45)