本文集中收集了单细胞的软件, 原理, 文献, 但是不是完整的, 我后面会补充和整理
本文不是我原创的, 各位看官请留意,而是从Github
上搬运的, 虽然我一直也在twitter
上收集各自关于单细胞的知识, 但是一直都没有做到很好的总结. 但是我后面在twitter
上看到这篇文章, 觉得非常赞, 所以会拿过来当作自己的一个学习笔记. 并不是想抄袭. 只是拿来作为一个学习笔记. 并且我会在此基础上进行修改和删增的.我还是删掉蛮多的, 因为一些文献对我来说学习成本太高了.
另外其实我的blog里的文章也是CC-BY-NC 4.0协议, 所以啦, 各位亲, 转载需要做好说明出处, 修改需要做出说明.
背景
单细胞目前是越来越火了, 上周了花了好几天时间整理了一个PPT, 但是内容不方便展示, 因为毕竟那是我在公司的工作内容.
不过话说我的文章都是我在自己业务时间整理了, 这部分当然就无所谓了. 但是公司内部的东西我是一概无涉及的.
总之, 单细胞测序技术是越来越火热, 单细胞技术虽然火热, 但是传统转录组是一个样本一个转录组的, 但是单细胞测序技术确实一个细胞一个转录组的. 那么以前做100个转录组就已经算是大数据量了, 但是在单细胞测序中, 动不动数千和上万, 乃至上百万的细胞表达谱.
这就直接催生了大量的分析软件, 这些软件不同于传统组学的分析软件, 单细胞领域的分析软件真的是大数据
, 高纬度
的数据, 是降维分析
和机器学习
的领域.
但是这些东西对于生物学家可以是极其不友好的知识呀, 还需要一直保持学习
单细胞学习笔记
10x官网的一些资料(不断更新)
单细胞是什么?
详细的单细胞分析教程, 十分具体, 涉及分析的代码和原理,强烈推荐:
- Current best practices in single‐cell RNA‐seq analysis: a tutorial github repo https://github.com/theislab/single-cell-tutorial
R Bioconductor的关于单细胞的会议,我还没看,但是感觉还是挺高大上:
这是一本书啊, 我个天:
这篇文章类似于概括我觉得还不错, mark一下以后慢慢看:
神器啊, 这个文章是关于单细胞技术之间的比较, 大爱, 我现在就要看!
当t-sne聚类结果不好的时候, 应该怎么调整参数:
大佬的文章, 这个大佬是真的大佬, 他的文章每个都值得看-关于降维分析
大佬的另一篇-关于分群中的机器学习
在医学中单细胞的应用:
在免疫学中的应用, 好巧这篇文章我最再看,准备翻译的
单细胞的实验设计
都是文章啊, 这种就属于很重要, 但是不看好像也可以的感觉, 暂时先不看吧
- paper: Design and Analysis of Single-Cell Sequencing Experiments
- paper: Experimental design for single-cell RNA sequencing
- paper: How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives
- GT-TS: Experimental design for maximizing cell type discovery in single-cell data
- Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies
单细胞的数据处理
暂时按照serut和monocle来进行, 不想学习
- paper: Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq
- paper: Assessment of single cell RNA-seq normalization methods
- paper: A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications
- Normalizing single-cell RNA sequencing data: challenges and opportunities Nature Methods
- SinQC: A Method and Tool to Control Single-cell RNA-seq Data Quality.
- Scone Single-Cell Overview of Normalized Expression data
对单细胞测序检测基因读数的矫正
不是很明白, 暂时用不到
-
SAVER: gene expression recovery for single-cell RNA sequencing an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.
-
DeepImpute: an accurate, fast and scalable deep neural network method to impute single-cell RNA-Seq data https://www.biorxiv.org/content/early/2018/06/22/353607
-
MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.
-
bayNorm: Bayesian gene expression recovery, imputation and normalisation for single cell RNA-sequencing data github page
-
Zero-preserving imputation of scRNA-seq data using low-rank approximation
批次效应处理
需要的时候就赶紧过来看看:
-
Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data
-
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors
-
Panoramic stitching of heterogeneous single-cell transcriptomic data Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration.
-
Fast Batch Alignment of Single Cell Transcriptomes Unifies Multiple Mouse Cell Atlases into an Integrated Landscape github link
-
Scalable integration of single cell RNAseq data for batch correction and meta analysis
-
ligerR package for integrating and analyzing multiple single-cell datasets
基因差异表达
- A discriminative learning approach to differential expression analysis for single-cell RNA-seq by Lior Patcher group.
- scde bioconductor package maintained by Jean Fan in Xiaowei Zhuang’s lab at Harvard. Need to talk to her once I get a chance.
- PrestoFast Wilcoxon and auROC for single cell RNAseq and scATACseq data. take a look!
- How to compare clusters with multiple samples? https://twitter.com/RoryKirchner/status/1082752967806210048 . work in progess https://github.com/HelenaLC/muscat by Helena from Mark Robinson lab. bioc2019 workshop http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/ and a blog post by VALENTINE SVENSSON from Lior Patcher’s group http://www.nxn.se/valent/2019/2/15/handling-confounded-samples-for-differential-expression-in-scrna-seq-experiments
单细胞分析的软件
-
paper: Design and computational analysis of single-cell RNA-sequencing experiments
-
paper: Power Analysis of Single Cell RNA‐Sequencing Experiments
-
paper: The contribution of cell cycle to heterogeneity in single-cell RNA-seq data
-
paper: Batch effects and the effective design of single-cell gene expression studies
-
review: Single-cell genome sequencing: current state of the science
-
Ginkgo A web tool for analyzing single-cell sequencing data.
-
SingleCellExperiment bioc package Defines a S4 class for storing data from single-cell experiments. This includes specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size factors for each cell, along with the usual metadata for genes and libraries.
-
ASAP: a Web-based platform for the analysis and inter-active visualization of single-cell RNA-seq data
-
Seurat is an R package designed for the analysis and visualization of single cell RNA-seq data. It contains easy-to-use implementations of commonly used analytical techniques, including the identification of highly variable genes, dimensionality reduction (PCA, ICA, t-SNE), standard unsupervised clustering algorithms (density clustering, hierarchical clustering, k-means), and the discovery of differentially expressed genes and markers.
-
R package for the statistical assessment of cell state hierarchies from single-cell RNA-seq data
-
Monocle Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.
-
Single Cell Differential Expression: bioconductor package scde
-
Sincera:A Computational Pipeline for Single Cell RNA-Seq Profiling Analysis. Bioconductor package will be available soon.
-
MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data
-
scDD: A statistical approach for identifying differential distributions in single-cell RNA-seq experiments
-
Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts by Lior Pachter et.al
-
cellity: Classification of low quality cells in scRNA-seq data using R.
-
bioconductor: using scran to perform basic analyses of single-cell RNA-seq data
-
scater: single-cell analysis toolkit for expression with R
-
Monovar: single-nucleotide variant detection in single cells
-
Single-cell mRNA quantification and differential analysis with Census
-
CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data
-
CellView: Interactive Exploration Of High Dimensional Single Cell RNA-Seq Data
-
Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with datasets of more than one million cells.
细胞亚型的鉴定
这个网页中附带的预印本是关于不同技术的比较 非常有用
这个R包好像可以根据已有的细胞分群进行训练, 将未被定义的细胞进行得到注释.
-
MatchScore2 paper: Benchmarking Single-Cell RNA Sequencing Protocols for Cell Atlas Projects
-
scMatch: a single-cell gene expression profile annotation tool using reference datasets
聚类分析
- Single Cell Clustering Comparison A blog post.
- A systematic performance evaluation of clustering methods for single-cell RNA-seq data F1000 paper by Mark Robinson. tl;dr version: “SC3 and Seurat show the most favorable results”.
- Geometry of the Gene Expression Space of Individual Cells
- pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.
- CountClust: Clustering and Visualizing RNA-Seq Expression Data using Grade of Membership Models. Fits grade of membership models (GoM, also known as admixture models) to cluster RNA-seq gene expression count data, identifies characteristic genes driving cluster memberships, and provides a visual summary of the cluster memberships
- FastProject: A Tool for Low-Dimensional Analysis of Single-Cell RNA-Seq Data
- SNN-Cliq Identification of cell types from single-cell transcriptomes using a novel clustering method
- Compare clusterings for single-cell sequencing bioconductor package.The goal of this package is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns.
- CIDR: Ultrafast and accurate clustering through imputation for single cell RNA-Seq data
- SC3- consensus clustering of single-cell RNA-Seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells.
- GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection
- FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data
- matchSCore: Matching Single-Cell Phenotypes Across Tools and Experiments In this work we introduce matchSCore (https://github.com/elimereu/matchSCore), an approach to match cell populations fast across tools, experiments and technologies. We compared 14 computational methods and evaluated their accuracy in clustering and gene marker identification in simulated data sets.
- Cluster Headache: Comparing Clustering Tools for 10X Single Cell Sequencing Data
- The celaref (cell labelling by reference) package aims to streamline the cell-type identification step, by suggesting cluster labels on the basis of similarity to an already-characterised reference dataset - wheather that’s from a similar experiment performed previously in the same lab, or from a public dataset from a similar sample.
- souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes souporcell, a robust method to cluster cells by their genetic variants without a genotype reference and show that it outperforms existing methods on clustering accuracy, doublet detection, and genotyping across a wide range of challenging scenarios while accurately estimating the amount of ambient RNA in the sample
降维和聚类分析
- Principal Component Analysis Explained Visually
- PCA, MDS, k-means, Hierarchical clustering and heatmap. I wrote it.
- horseshoe effect from PCA Spurious structures in latent space decomposition and low-dimensional embedding methods
- also read chapter 9 of http://web.stanford.edu/class/bios221/book/Chap-MultivaHetero.html
- A tale of two heatmaps. I wrote it.
- Heatmap demystified. I wrote it.
- Cluster Analysis in R - Unsupervised machine learning very practical intro on STHDA website.
- I wrote on PCA, and heatmaps on Rpub
- A most read for clustering analysis for high-dimentional biological data:Avoiding common pitfalls when clustering biological data
- How does gene expression clustering work? A must read for clustering.
- How to read PCA plots for scRNAseq by VALENTINE SVENSSON.
paper: Outlier Preservation by Dimensionality Reduction Techniques
“MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters”
-
Rtsne R package for T-SNE
-
rtsne An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding) a bug was in
rtsne
: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32 -
t-SNE-Heatmaps Beta version of 1D t-SNE heatmaps to visualize expression patterns of hundreds of genes simultaneously in scRNA-seq.
-
How to tune hyperparameters of tSNE For single-cell RNAseq: The optimal perplexity can be calculated from the number of cells according to the simple power law -Perplexity ~ N^(1/2)-. Finally, the optimal number of iterations should provide the largest distance between the data points of ~100 units.
-
when to use PCA instead of t-SNE https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne/249520#249520
-
projection to new data https://twitter.com/EduEyras/status/1032215352623747072
-
Interpretable dimensionality reduction of single cell transcriptome data with deep generative models
-
PCA loadings can be used to project new data
e.g. from this paper Multi-stage Differentiation Defines Melanoma Subtypes with Differential Vulnerability to Drug-Induced Iron-Dependent Oxidative Stress Fig 1D.
-
Sleepwalk: Walk through your embedding So, can you be sure that the visualisation you get by using t-SNE, UMAP, MDS or the like really give you a faithful representation of your data? Are the points that lie almost on top of each other really all similar? Does the large distance on your 2D representation always mean lots of dissimilarities? Our sleepwalk package for the R statistical programming environment can help you answer these questions.
-
Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks standard methods, such as t-stochastic neighbor embedding (t-SNE), are not scalable to datasets with millions of cells and the resulting visualizations cannot be generalized to analyze new datasets. Here we introduce -net-SNE-, a generalizable visualization approach that trains a neural network to learn a mapping function from high-dimensional single-cell gene-expression profiles to a low-dimensional visualization.
-
PHATE dimensionality reduction method paper: http://biorxiv.org/content/early/2017/03/24/120378 PHATE also uncovers and emphasizes progression and transitions (when they exist) in the data, which are often missed in other visualization-capable methods. Such patterns are especially important in biological data that contain, for example, single-cell phenotypes at different phases of differentiation, patients at different stages of disease progression, and gut microbial compositions that vary gradually between individuals, even of the same enterotype.
-
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data. Run from R: https://gist.github.com/crazyhottommy/caa5a4a4b07ee7f08f7d0649780832ef
-
umapr UMAP dimensionality reduction in R
-
uwot An R package implementing the UMAP dimensionality reduction method. UMAP multi-threaded.
-
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE) The FIt-SNE implementation is generally faster than UMAP when you have more than 3,000 cells. In the realm of 10,000’s of cells FIt-SNE scales at the same rate as UMAP. However, note that this is a log-log scale. Even if FI-tSNE starts scaling at the rate of UMAP, it is still consistently about 4 times faster. In other words, a dataset that takes an hour for UMAP will take 15 minutes for FIt-SNE. see the benchmark here https://nbviewer.jupyter.org/gist/vals/a138b6b13ae566403687a241712e693b by Valentine Svensson.
-
Parallel opt-SNE implementation with Python wrapper preprint:Automated optimal parameters for T-distributed stochastic neighbor embedding improve visualization and allow analysis of large datasets
网络调控
- Scribe: Towards inferring causal regulations with single cell dynamics-coupled measurements
- single cell gene regulatory network analysis https://github.com/aertslab/SCENIC
- Single-Cell Transcriptomics Unveils Gene Regulatory Network Plasticity
分析可变多聚腺苷
- We have written a Python+R pipeline called “polyApipe” for identifying alternative polyadenylation (APA) sites in 10X Genomics scRNA-seq, based on the presence of polyadenylated reads. Once sites are identified, UMIs are counted for each site and the APA state of genes in cells can be determined. Given the sparse and noisy nature of this data, we have developed an R package “weitrix” to identify principal components of variation in APA based on measurements of varying accuracy and with many missing values. We then use varimax rotation to obtain independently interpretable components. In an embryonic mouse brain dataset, we identify 8 distinct components of APA variation, and assign biological meaning to each component in terms of the genes, cell type, and cell phase.
单细胞常用数据库
- CellMarker: a manually curated resource of cell markers in human and mouse
- scRNAseq bioc package Gene-level counts for a collection of public scRNA-seq datasets, provided as SingleCellExperiment objects with cell- and gene-level metadata.
- human cell atlas database
- EMBL-EBI atlas
- (PanglaoDB)[https://panglaodb.se/) is a database for the scientific community interested in exploration of single cell RNA sequencing experiments from mouse and human. We collect and integrate data from multiple studies and present them through a unified framework.
- scRNASeqDBdatabase, which contains 36 human single cell gene expression data sets collected from Gene Expression Omnibus (GEO)
- JingleBellA repository of standardized single cell RNA-Seq datasets for analysis and visualization at the single cell level.
- Broad single cell portal
- The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. It is implemented in shiny and provides access to consistently processed public single-cell RNA-seq data sets.
值得一读的文献
- A single-cell molecular map of mouse gastrulation and early organogenesis
- The single-cell transcriptional landscape of mammalian organogenesis
- Comparative analysis of droplet-based ultra-high-throughput single-cell RNA-seq systems
- scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods github repo
- A Single-Cell Transcriptome Atlas of the Aging Drosophila Brain
- Bias, robustness and scalability in single-cell differential expression analysis by Mark Robinson
- Cell type transcriptome atlas for the planarian Schmidtea mediterranea
- Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics
- Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis
- The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution
- Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo
- Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species
- The contribution of cell cycle to heterogeneity in single-cell RNA-seq data
可视化分析
- cellxgene An interactive explorer for single-cell transcriptomics data. Leveraging modern web development techniques to enable fast visualizations of at least -1 million cells-, we hope to enable biologists and computational researchers to explore their data.
- scSVA from Aviv Regev lab: an interactive tool for big data visualization and exploration in single-cell omics. scSVA is memory efficient for more than -hundreds of millions of cells-, can be run locally or in a cloud, and generates high-quality figures.
- ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data
- iSEE Provides functions for creating an interactive Shiny-based graphical user interface for exploring data stored in SummarizedExperiment objects, including row- and column-level metadata. Particular attention is given to single-cell data in a SingleCellExperiment object with visualization of dimensionality reduction results.
- VISION A high-throughput and unbiased module for interpreting scRNA-seq data.
数据的拆分合并
- scMerge
- Seurat V3
- Cons Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Single-cell RNA sequencing is often applied in study designs that include multiple individuals, conditions or tissues. To identify recurrent cell subpopulations in such heterogeneous collections, we developed Conos, an approach that relies on multiple plausible inter-sample mappings to construct a global graph connecting all measured cells. The graph enables identification of recurrent cell clusters and propagation of information between datasets in multi-sample or atlas-scale collections. published in Nature Methods
- scAlign Bioconductor package. a tool for alignment, integration, and rare cell identification from scRNA-seq data
单细胞CNV分析
- Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. tool HoneyBADGER
不断更新的单细胞测序技术
-
Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding no isolation of single cells needed!
-
Dynamics and Spatial Genomics of the Nascent Transcriptome by Intron seqFISH
-
Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transcriptional Spaces blog post by Lior Patcher The benefits of multiplexing. Need to re-read carefully.
-
Three-dimensional intact-tissue sequencing of single-cell transcriptional states
单细胞多组学
-
Multi-omic profiling of transcriptome and DNA methylome in single nuclei with molecular partitioning
-
Linking transcriptome and chromatin accessibility in nanoliter droplets for single-cell sequencing
-
Simultaneous quantification of protein-DNA contacts and transcriptomes in single cells scDamID&T.
-
Self-reporting transposons enable simultaneous readout of gene expression and transcription factor binding in single cells piggyBac transposase.
单细胞等位基因特异性表达
- scBASE A set of tools for quantitation of allele-specific expression from scRNA-Seq data
- paper: Genomic encoding of transcriptional burst kinetics
拟时分析的原理
-
Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data
-
all different algorithms https://github.com/agitter/single-cell-pseudotime
-
Genomic trajectories with heterogeneous genetic and environmental backgrounds
-
A descriptive marker gene approach to single-cell pseudotime inference
-
A collection of 50 trajectory inference methods within a common interface take a look of this!
-
velocyto RNA abundance is a powerful indicator of the state of individual cells, but does not directly reveal dynamic processes such as cellular differentiation. Here we show that RNA velocity - the time derivative of RNA abundance - can be estimated by distinguishing unspliced and spliced mRNAs in standard single-cell RNA sequencing protocols. paper comment
-
STREAM is an interactive computational pipeline for reconstructing complex celluar developmental trajectories from sc-qPCR, scRNA-seq or scATAC-seq data from Luca Pinello Lab.
单细胞大型数据分析
估计我也用不到
- bigSCale: an analytical framework for big-scale single-cell data. github link for millions of cells (starts with a count matrix) bigScale2
- Alevin: An integrated method for dscRNA-seq quantification based on Salmon.
- How to Use Alevin with Seurat Alevin-Seurat Connection blog post
- Kallisto BUStools paper https://www.biorxiv.org/content/10.1101/673285v1
- SCope: Visualization of large-scale and high dimensional single cell data
- Scumi Summarizing single-cell RNA-sequencing data with unified molecular identifiers. scumi is a flexible Python package to process fastq files generated from different single-cell RNA-sequencing (scRNA-seq) protocols to produce a gene-cell sparse expression matrix for downstream analyses: CEL-Seq2, 10x Chromium, Drop-seq, Seq-Well, CEL-Seq2, inDrops, and SPLiT-seq
单细胞领域保持更新
scRNA Tools网站以及用scRNA-tools发表的文献:Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database