1 CALISTA USER MANUAL
User Manual:
Open the PDF directly: View PDF
.
Page Count: 38
| Download | |
| Open PDF In Browser | View PDF |
CALISTA: User Manual Version 1.2.0 (1 December 2018) Authors: Nan Papili Gao and Rudiyanto Gunawan Institute for Chemical and Bioengineering ETH Zurich Contact e-mail: nanp@ethz.ch and rgunawan@buffalo.edu Table of contents 1 Overview ........................................................................................................... 2 2 System requirements ........................................................................................ 2 3 CALISTA package ............................................................................................ 2 4 Examples ........................................................................................................... 3 4.1 Example 1. iPSC differentiation into mesodermal and endodermal cells ........................ 3 4.1.1 Data Import and Preprocessing .................................................................................... 3 4.1.2 Single-cell clustering ................................................................................................... 4 4.1.3 Reconstruction of lineage progression.......................................................................... 5 4.1.4 Determination of transition genes ................................................................................ 7 4.1.5 Pseudotemporal ordering of cells ................................................................................. 8 4.1.6 Path analysis ............................................................................................................... 8 4.2 Example 2. Hematopoietic stem cell differentiation ....................................................... 10 4.2.1 Data Import and Preprocessing .................................................................................. 10 4.2.2 Single-cell clustering ................................................................................................. 10 4.2.3 Reconstruction of lineage progression........................................................................ 11 4.2.4 Determination of transition genes .............................................................................. 12 4.2.5 Pseudotemporal ordering of cells ............................................................................... 12 4.2.6 Path analysis ............................................................................................................. 13 4.3 Example 3. Mouse embryonic fibroblast differentiation into neurons (Manual data import)........................................................................................................................................ 13 4.3.1 Data Import and Preprocessing .................................................................................. 13 4.3.2 Single-cell clustering ................................................................................................. 16 4.3.3 Reconstruction of lineage progression........................................................................ 17 4.3.4 Determination of transition genes .............................................................................. 17 4.3.5 Pseudotemporal ordering of cells ............................................................................... 17 4.4 Example 4. Human embryonic stem cell differentiation into endodermal cells............. 18 4.4.1 Data Import and Preprocessing .................................................................................. 18 4.4.2 Single-cell clustering ................................................................................................. 18 4.4.3 Reconstruction of lineage progression........................................................................ 19 4.4.4 Determination of transition genes .............................................................................. 21 4.4.5 Pseudotemporal ordering of cells ............................................................................... 21 4.5 Example 5. Running CALISTA without time or cell stage information ........................ 22 4.5.1 Data Import and Preprocessing .................................................................................. 22 4.5.2 Single-cell clustering ................................................................................................. 22 4.5.3 Reconstruction of lineage progression and pseudotemporal ordering of cells .............. 23 4.6 Example 6. Removing undesired clusters ....................................................................... 25 4.6.1 Data Import and Preprocessing .................................................................................. 25 4.6.2 Single-cell clustering ................................................................................................. 25 4.6.3 Single-cell clustering after removing undesired clusters ............................................. 26 4.6.4 Reconstruction of lineage progression........................................................................ 27 4.6.5 Determination of transition genes .............................................................................. 28 4.6.6 Pseudotemporal ordering of cells ............................................................................... 28 4.7 Running CALISTA GUI ................................................................................................. 28 4.7.1 Data Import and Preprocessing .................................................................................. 28 1 4.7.2 Single-cell clustering ................................................................................................. 29 4.7.3 Reconstruction of lineage progression........................................................................ 30 4.7.4 Determination of transition genes .............................................................................. 31 4.7.5 Pseudotemporal ordering of cells ............................................................................... 32 4.8 Example 8. Reconstruction of developmental trajectories during zebrafish embryogenesis ............................................................................................................................ 32 4.8.1 Data Import and Preprocessing .................................................................................. 32 4.8.2 Single-cell clustering ................................................................................................. 33 4.8.3 Reconstruction of lineage progression........................................................................ 34 4.8.4 Determination of transition genes .............................................................................. 34 4.8.5 Pseudotemporal ordering of cells ............................................................................... 34 4.8.6 Path analysis ............................................................................................................. 35 4.9 Example 9. Identification of mouse spinal cord neurons activity during behavior ....... 35 4.9.1 Data Import and Preprocessing .................................................................................. 35 4.9.2 Single-cell clustering ................................................................................................. 36 4.10 Example 9. Analysis of peripherical blood mononuclear cells (PBMCs) ....................... 37 4.10.1 Data Import and Preprocessing .................................................................................. 37 4.10.2 Single-cell clustering ................................................................................................. 37 5 Questions and comments .................................................................................38 1 Overview This user manual is for the MATLAB distribution of CALISTA (Clustering And Lineage Inference in Single Cell Transcriptional Analysis). CALISTA provides a user-friendly toolbox for the analysis of single cell expression data. CALISTA accomplishes three major tasks: (1) Identification of cell clusters in a cell population based on single-cell gene expression data; (2) Reconstruction of lineage progression and produce transition genes; (3) Pseudotemporal ordering of cells along any given developmental paths in the lineage progression. For detailed information about CALISTA, please refer to the following manuscript. Papili Gao N., Hartmann T, Fang T., and Gunawan R., CALISTA: Clustering and lineage inference in singlecell transcriptional analysis, bioRxiv, 2018. https://doi.org/10.1101/257550 2 System requirements This distribution of CALISTA is written for and developed in MATLAB1. CALISTA has been successfully tested on MATLAB 2016b, 2017a, 2018a and 2018b. 3 CALISTA package CALISTA package contains the following files and folders: 1. 2. 3. 4. 5. 6. 1 This CALISTA_USER_MANUAL.doc file License.txt modified BSD license for CALISTA MAIN.m CALISTA main script (use this script to run CALISTA on your own dataset) MAIN_GUI.m GUI version of CALISTA (use this script to run CALISTA_GUI on your own dataset) Example scripts on how to use CALISTA subroutines and GUI version of CALISTA Save_to_matlab.R R script describing how to convert the dataset (especially large text files) in Matlab files http://www.mathworks.com 2 7. The folder Two-state model parameters containing: a. Parameters.mat steady-state distribution functions of mRNA level 8. The folder subfunctions containing the following main subroutines (and other subroutines): a. import_data.m : upload single-cell expression data and perform preprocessing. b. CALISTA_clustering_main.m: single-cell clustering in CALISTA. c. CALISTA_transition_main.m: infer lineage progression among cell clusters. d. CALISTA_transition_genes_main.m: identify the key genes in lineage progression. e. CALISTA_ordering_main.m: perform pseudotemporal ordering of cells. f. CALISTA_landscape_plotting_main.m: landscape plots of single cells in the dataset based on cell-likelihood values g. CALISTA_path_main.m: perform post-analysis along developmental path(s). 9. The folder EXAMPLES containing single-cell expression datasets used in the examples below. 10. The folder GUI containing the subroutines used in MAIN_GUI.m 11. The folder SUPPLEMENTARY EXAMPLES containing the additional analysis For further information on running the main subroutines in CALISTA, please use Matlab ‘help’ command followed by function_name(for example ‘help import_data’). 4 Examples In the following, we describe the main steps of CALISTA applied to publicly available single-cell gene expression data. For each dataset, ONLY the most important results are reported. Please refer to the file MAIN.m for an example MATLAB script of CALISTA implementation. 4.1 Example 1. iPSC differentiation into mesodermal and endodermal cells Analysis of RT-qPCR data of Bargaje et al. (Bargaje, et al, Cell population structure prior to bifurcation predicts efficiency of directed differentiation in human induced pluripotent cells. Proc. Natl. Acad. Sci. U. S. A. 114, 2271– 2276 (2017)). 4.1.1 Data Import and Preprocessing We begin with changing the current directory in MATLAB to the CALISTA folder. Then, we run Example_1_BARGAJE_scRT_qPCR.m script in the main folder of CALISTA and import Bargaje dataset (available in the subfolder EXAMPLES/BARGAJE). The following are screenshots from running CALISTA on MATLAB. 3 4.1.2 Single-cell clustering In this case, the number of clusters is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 5. The following are screenshots from CALISTA single-cell clustering analysis. 4 Cell Clustering -8 -6 -6 -4 -4 -2 -2 0 PC2 PC2 Original time/cell stage info -8 2 Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage 4 6 8 10 15 10 5 PC1 0 -5 -10 -15 0 24 36 48 60 72 96 120 0 2 4 Cluster Cluster Cluster Cluster Cluster 6 8 1 2 3 4 5 10 15 10 5 PC1 0 -5 -10 -15 If desired, users can remove cells from specific clusters from further analysis. In this example, we do not want to remove any clusters. Hence, we enter 0 (no cluster removal) and then 1 to proceed with lineage inference. 4.1.3 Reconstruction of lineage progression During the lineage inference step, CALISTA automatically generates and displays a lineage graph, obtained by adding an edge between two clusters in increasing cluster distances, until all clusters are connected to at least one other cluster. Subsequently, users can manually add or remove one edge at time based on the cluster distances. ATTENTION: to add an edge (press “p”), remove an edge (press “m”) or finalize the lineage progression graph (press “enter”), the MATLAB figure of the graph must appear in foreground without any modification (e.g., zooming, rotation). Note that the addition/removal of the edges are performed according to increasing/decreasing order of cluster distance. ATTENTION: the final graph must be connected (i.e. there exists a path from any node/cluster to any other node/cluster in the graph), otherwise a warning will be returned. 5 Lineage Progression 10 5 COMP2 0.53909 0 0.41808 0.5 6 0. 49 99 3 222 0.478 -5 Cluster: Cluster: Cluster: Cluster: Cluster: data1 -10 20 10 0 -10 -20 COMP1 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.75 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 0.4 0.2 0 0.6 0.8 1 Cluster pseudotime Since the transition from cluster 1 to cluster 5 is inconsistent with the capture time info (i.e. cluster pseudotime values for cluster 1 and 5 are 0 and 1 respectively) we remove the spurious edge between cluster 1 and 5, by entering 1 and entering [1 5], upon the following query. Lineage Progression Cluster: Cluster: 10 Cluster: Cluster: 8 Cluster: data1 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.75 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 6 23 4 0.53909 COMP2 4 0.4 180 2 22 26 0. 0 1 -2 -4 8 0. 5 93 9 49 5 0.478 -6 20 0 -8 0 0.2 0.4 0.6 0.8 -20 1 Cluster pseudotime COMP1 The final inferred lineage relationships are displayed below. Lineage Progression 10 5 234 5 0.53909 PC2 0 26 4 0. 1 -5 Cluster: Cluster: Cluster: Cluster: Cluster: -10 20 0 -20 PC1 0 0.2 0.4180 8 0.5 22 3 9 99 0.4 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.75 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 0.6 0.8 1 Cluster pseudotime In addition, CALISTA provides the following. - Cell clustering plot based on the cluster pseudotime 6 CALISTA cluster pseudotime 0 -10 -5 PC2 PC2 -5 0 0 5 5 10 10 -10 -5 0 5 10 -10 PC1 CALISTA cluster pseudotime 0.75 5 10 5 0 -5 -10 0 -5 -10 PC1 CALISTA cluster pseudotime 1 -5 PC2 PC2 10 -10 -5 0 0 5 5 10 10 -10 -5 0 5 10 - CALISTA cluster pseudotime 0.5 -10 PC1 PC1 Boxplot, mean, median entropy values calculated for each cluster Boxplot for the entropy Entropy Mean and Median 2.8 MeanEntropy MedianEntropy 4.5 2.6 4 3.5 Entropy value Entropy 3 2.5 2 1.5 2.4 2.2 2 1 1.8 0.5 0 1.6 1 2 3 4 5 1 2 3 Cluster - 5 Plot of mean expression values for each gene based on cell cluster expression level ACVR1B ACVR2A 5 3.5 3 ACVRL1 2 0.5 1 1.5 -0.5 0 DLL1 0.5 1 1.5 0.4 0.5 1 1.5 -0.5 0 FOXC1 0.5 1 1.5 1 1.5 3 -0.5 1 1.5 1 1.5 -0.5 0.5 1 1.5 -0.5 1.5 0.5 1 1.5 -0.5 0.5 1 1.5 -0.5 0 1.5 -0.5 0.5 1 1.5 -0.5 0.5 1 1.5 1 1.5 1 1.5 -0.5 0 0.5 1 1.5 2.4 -0.5 4 4 3 -0.5 0.5 1 1.5 1 1.5 4 3 2 1 0.5 1 1.5 0.5 1 1.5 -0.5 1.5 1.5 3 -0.5 6 4 -0.5 0.5 1 1.5 -0.5 0.5 1 1.5 0.5 1 1.5 6 2 0.5 1 1.5 -0.5 0.5 1 1.5 1 1.5 1.5 0.5 1 1.5 -0.5 -0.5 0.5 1 1.5 0.5 1 1.5 -0.5 1 1.5 -0.5 0 1.5 0.5 1 1.5 0 0.5 1 1.5 -0.5 0.5 1 1.5 2 -0.5 -0.5 1 1.5 0.5 1 1.5 -0.5 0.5 1 1.5 1.4 6 1.2 2 1 1.5 0.6 1.5 -0.5 0.5 1 1.5 -0.5 1 1.5 -0.5 1 1.5 1 1.5 0.5 1 1.5 0.6 4 0.4 1 1.5 0.4 4 1.5 -0.5 0.5 1 1.5 1 1.5 -0.5 1.5 -0.5 0.5 1 1.5 1 1.5 -0.5 1 1.5 1 1.5 -0.5 0.6 2 0.4 1 1 1.5 1 1.5 4 0.6 3 0.4 -0.5 1 1.5 -0.5 0.5 1 1.5 0.6 1 1.5 -0.5 0.5 1 1.5 0.5 1 1.5 1.5 -0.5 0.2 -0.5 0 0.5 1 1.5 1 1.5 1 1.5 1 1.5 PTX2 1 -0.5 0 0 0.5 1 1.5 -0.5 0 0.5 TBX5 1.5 1 2 -0.5 0 0 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 0.5 2 0.3 1 1.5 -0.5 0.5 0.2 0 0.5 0 WNT5B 4 0.4 1 0 -0.5 WNT5A 2 0 1 1.5 2 WNT4 0.05 0.5 1 0.5 0 0.1 0 0.5 0.4 0 TBX20 3 0 0.5 0 NKX2.5 6 0.2 0 1.5 4 WNT3A 0.8 2 0.5 0.5 -0.5 0.8 0.5 0 0 0 WNT11 5 0 -0.5 1 0 0.5 8 -0.5 0.5 2 0 TBX2 0 0.5 VEGFA 0 1 0.5 TBX1 1 0 -0.5 2 4 0 T 2 -0.5 1.5 6 -0.5 1.5 KIT 2.5 1 1.5 1.5 PTX1 2 1 1 4 PTCH1 8 0.5 1 2 0 PDGFRB 3 0 0.5 4 1 0.5 0 0 NANOG 0 0.5 0 GATA6 0 0 2 0 -0.5 5 MYOCD 0 1 1.5 2 0 2 0.5 1 KDR 1 MYL4 6 0 -0.5 4 -0.5 0.5 0 0.5 2 1.5 1.5 10 ISL1 1 1 0.2 0 0.2 0.5 0.5 5 0 GATA5 6 0.6 0 -0.5 -0.5 3 0 0 FGFR2 6 0.3 -0.5 -0.5 7 6 -0.5 1.5 0.2 0.5 15 0.5 0.5 1 4 0 0 0 0 0.5 0 0 20 0 -0.5 0.5 8 GATA4 0 0.5 2 0 2 25 0 0 1.5 IRX4 6 TUBB 1 0.8 1 -0.5 1 0.1 0 SOX17 0.4 0.5 1 0.5 -0.5 -0.5 PDGFRA 0.6 0 1.5 0 10 6 0 0.5 -0.5 FGFR1 2 1 0.2 0 4 TNNT2 3 0 1.5 1.5 4 0 1 1 1 0 MYL3 0.5 0.5 0.8 TGFBR2 -0.5 0.5 1 0.5 4 0 0 -0.5 0 1 0 GAS1 0 0.5 1 0 0 1.5 0.2 0 SIRPA 0.01 1 2 FGF8 0.5 0 PDGFB 1.2 4 4 -0.5 -0.5 3 INHBA 5 4 3 2 1 SHH 0.02 0 -0.5 0.5 0.4 PDGFA 1 1.5 1.5 MSX2 0 0.5 1 1 0 8 1 0.5 2 0 0.5 FGF12 10 1 SFRP1 -0.5 6 4 3 0 FZD7 2 0 4 0 4 -0.5 8 3 2 2 HRT2 0 0 6 7 -0.5 1 2 3 0.5 SERCA 4 0 -0.5 1.5 0 MSX1 5 3 0 5 4 0 0.5 6 TGFBR1 8 -0.5 1 0.5 0 HNFA4 PARD3 4 TGFB1 4 0 -0.5 1.5 0 0 0.2 0 TGFB2 -0.5 4 0.3 10 0 1 0.5 DKK1 4 5 4 FGF10 6 0 MIXL1 10 5 0.5 15 0.5 0 FZD6 3 0.4 5 1.5 0 0 NUMB 5 1 20 7 1 4 0.5 RPL35A 1 1 0.5 6 0 -0.5 -0.5 CXCR4 6 1 0 0.5 2 20 0 0 7 RCOR2 -0.5 BMPR2 7 6 1.5 3 0 NOTCH3 2.8 2.6 2.4 2.2 2 1.8 0 -0.5 1.5 2 0 4 0.5 0 1 0 0.5 1 2 0 2 NOTCH2 0.5 EVX1 4 1 MESP2 0.5 0 NOTCH1 0 FZD4 HHIP 0.4 0.5 -0.5 6 1 0.5 0.6 0 1.5 2.6 0 -0.5 6 HEY1 1 2 1.5 1.5 0 MESP1 2.8 4 1 2 1 LTBP1 BMPR1A 8 4 2 FZD2 2 1 0 0 0.5 0 0.5 6 2 0 1 0 2 0 0 -0.5 EPCAM 5 0 0.8 LEFTY1 -0.5 1 10 8 6 4 2 FZD1 4 5 0 3 -0.5 0.5 EOMES 10 HAND2 10 5 -0.5 5 0 1.2 10 -0.5 -0.5 2 HAND1 8 4 2 15 0 0.5 BMP4 6 0 0.5 1 0 8 0 GSC -0.5 0.8 0 2 9 4 0.5 -0.5 10 5 0 -0.5 BMP2 6 ENG 11 6 0 1.5 FSTL1 7 1 1 3 FOXH1 2 0.5 EMILIN2 1 0.2 BAMBI 7 1 1 0 1.2 1 0.8 0.6 0.4 0.2 2 0 -0.5 DLL3 3 0.6 -0.5 2 0 0 0.8 0 -0.5 ANF 1.2 3 0.1 3 1.5 -0.5 ALCAM 0.2 4 2.5 4.1.4 4 Cluster 0.1 0 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph, based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain 7 percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 1-2 800 2-3 300 7 transition genes 11 transition genes 250 600 v gjk v gjk 200 400 150 100 200 50 0 GA TA GS C 6 EO M M DK IX L1 ES K1 GA TA 4 0 EV X1 C 2-5 800 GS M HA E W FZ M NU B S K N D1 YL ND OM M MP4 OX1 IT 4 B ES T4 7 1 IX L1 3-4 300 11 transition genes 8 transition genes 250 600 v gjk v gjk 200 400 150 100 200 50 0 4.1.5 M GS IX C L1 EO FZ FG D H K D F8 KK1 NFA DR M ES 1 4 LE M T FT ES P1 Y1 0 W NT 4 M M ES P2 ES P1 DK K1 TG FG FB 2 F1 2 TN N T2 EO M ES Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. Cell Ordering 10 8 6 4 PC2 2 0 -2 -4 -6 -8 20 0 PC1 -20 1 0.8 0.6 0.4 0.2 0 -0.2 1.2 Cell Ordering 4.1.6 Path analysis To perform path-specific analysis, users can enter 1 upon queried. In the following, we input two developmental paths of interest: [1 2 3 4] (mesodermal fate) and [1 2 5] (endodermal fate). Lineage Progression 10 5 234 5 0.53909 PC2 0 0. 1 -5 Cluster: Cluster: Cluster: Cluster: Cluster: -10 20 0 PC1 -20 0 0.2 0.4180 8 0.5 22 26 3 99 49 0.4 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.75 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 0.6 0.8 1 Cluster pseudotime 8 For each path, the post-analysis in CALISTA generates Clustergrams, moving-averaged gene expression profiles and co-expression networks for the transition genes detected previously based on cell orderings. Path num 2 0.6 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 Target Gene j Source Gene i 0.4 BMP4 DKK1 EOMES EVX1 FGF12 FGF8 FZD1 GATA4 GATA6 GSC HAND1 HNFA4 KDR KIT LEFTY1 MESP1 MESP2 MIXL1 MYL4 NUMB SOX17 T TGFB2 TNNT2 WNT4 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 BM P EODK 4 K M 1 E ES FGVX F1 1 FG 2 F F G ZD 8 A G TA1 AT 4 A H GS 6 AN C H D N 1 FA KD 4 LE K R I M FTYT ES 1 M P E 1 M SP2 IX M L1 N YL4 U SO M X1B TG 7 T TN FB N 2 W T2 N T4 1 0.8 BM P EODK 4 K M 1 E ES FGVX F1 1 FG 2 F F G ZD 8 A G TA1 AT 4 A H GS 6 A C H ND N 1 FA KD 4 LE K R I M FTYT E M SP 1 E 1 M SP2 IX M L1 N YL4 U SO M X1B TG 7 T TN FB N 2 W T2 N T4 Source Gene i Path num 1 BMP4 DKK1 EOMES EVX1 FGF12 FGF8 FZD1 GATA4 GATA6 GSC HAND1 HNFA4 KDR KIT LEFTY1 MESP1 MESP2 MIXL1 MYL4 NUMB SOX17 T TGFB2 TNNT2 WNT4 Target Gene j 9 4.2 Example 2. Hematopoietic stem cell differentiation Analysis of RT-qPCR data in Moignard et al., Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis, Nat. Cell Biol. 15, 363–72 (2013). 4.2.1 Data Import and Preprocessing We start by changing the current directory in MATLAB to the CALISTA folder. We run Example_2_MOIGNARD_scRT_qPCR.m script in the main folder of CALISTA and load Moignard dataset (in subfolder EXAMPLES/MOIGNARD). 4.2.2 Single-cell clustering Following the original publication, we set the number of clusters equals to 5. CALISTA single-cell clustering results are as follow. 10 Original time/cell stage info Cell Clustering Time/Stage Time/Stage Time/Stage 1 2 3 4 Cluster Cluster Cluster Cluster Cluster 6 2 4 0 2 0 -2 PC2 4 5 2 0 -6 -4 PC1 0 -2 -2 -4 1 2 3 4 5 PC2 -4 PC1 -6 -5 In this case, we do not need to remove any clusters (by pressing 0 upon queried). Then we proceed with further analysis (by pressing 1 upon queried). 4.2.3 Reconstruction of lineage progression During the lineage inference step, CALISTA automatically generates and displays a lineage graph, obtained by adding an edge between two clusters in increasing cluster distances, until all clusters are connected to at least one other cluster. Subsequently, users can manually add or remove one edge at time based on the cluster distances. ATTENTION: to add an edge (press “p”), remove an edge (press “m”) or finalize the lineage progression graph (press “enter”), the MATLAB figure of the graph must appear in foreground without any modification (e.g., zooming, rotation). Note that the addition/removal of the edges are performed according to increasing/decreasing order of cluster distance. ATTENTION: The final lineage progression graph must be connected (i.e. there is a path from any node/cluster to any other node/cluster in the graph) otherwise a warning will be returned. 11 Lineage Progression Cluster: Cluster: Cluster: Cluster: Cluster: data1 4 COMP2 2 0 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.50 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 0.53555 0.4379 9 0.53963 0.6 16 63 -2 -4 -6 10 5 1 0.5 0 COMP1 0 -5 Cluster pseudotime Here, we do not need to remove any spurious edges, and hence we enter 0 upon queried. In addition, CALISTA gives (not shown): - Cell clustering plot based on the cluster pseudotime Boxplot, mean, median entropy values calculated for each cluster Plot of mean expression values for each gene based on cell cluster expression level 4.2.4 Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph, based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 1-2 80 1-3 100 5 transition genes 4 transition genes 80 60 v gjk v gjk 60 40 40 20 0 20 M G ei s1 at Nf Fl 0 Er i1 e2 a2 2-4 80 M G M fi1 itf g Lm ei o2 s1 2-5 120 4 transition genes 3 transition genes 100 60 v gjk v gjk 80 40 60 40 20 20 0 M ei s1 4.2.5 Nf e2 Lm o2 0 G at a2 M G ei s1 fi1 b Lm o2 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. 12 Cell Ordering 4 2 PC2 0 -2 -4 -6 10 1.2 5 1 0.8 0.6 0 PC1 4.2.6 0.4 -5 0.2 0 Cell Ordering Path analysis Finally, we perform post-analysis by entering 1 upon queried. Here, we input three developmental paths: [1 3], [1 2 5], and [1 2 4]. For each path, the post-analysis in CALISTA generates Clustergrams, moving-averaged gene expression profiles and co-expression networks for the transition genes detected previously based on cell orderings (not shown). 4.3 Example 3. Mouse embryonic fibroblast differentiation into neurons (Manual data import) Analysis of RNA-seq data in Treutlein et al., Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq, Nature 534, 391–395 (2016). **Please unzip the file “3-TREUTLEIN_data_type_3_format_data_5_clusters_4.txt.zip” in EXAMPLES/TREUTLEIN/ before running CALISTA** 4.3.1 Data Import and Preprocessing Again, we change the current directory in MATLAB to the CALISTA folder. Here, we run Example_3_TREUTLEIN_scRNA_seq.m script in the main folder of CALISTA and load Treutlein dataset (in subfolder EXAMPLES/TREUTLEIN). 13 The text file containing the original dataset can be summarized as follows (preview with the first 25 rows and 12 column): CALISTA imports the dataset by splitting the text data from the expression (numbers) data. In particular we define the “imported text data” as: and “imported expression data” as: 14 CALISTA provides a preview and the dimensions of both imported text and expression data. Based on the expression data preview, we set the starting and ending rows and columns for the expression values: as [1 405] and [2 22525], respectively, when queried. We exclude the capture time info in the first column. We press 1 since columns refer genes and rows refer cells. We define the gene’s names using the text data preview [6 22529] (starting and ending columns). 15 We load the capture time/cell stage by pressing 1 (i.e. time/cell stage information is in expression data matrix is) and selecting column 1 in the data matrix. 4.3.2 Single-cell clustering In this case, the number of clusters is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 4. CALISTA single-cell clustering results are as follow. Original time/cell stage info Cell Clustering Time/Stage 0 Time/Stage 2 Time/Stage 5 Time/Stage 20 Time/Stage 22 Cluster Cluster Cluster Cluster 1 2 3 4 6 6 4 2 -5 0 -2 5 -4 -2 2 0 0 PC1 4 -4 0 2 PC2 PC1 -2 4 6 PC2 -4 We do not need to remove any cluster (by entering 0 upon queried), and continue with further analysis (by entering 1 upon queried): 16 4.3.3 Reconstruction of lineage progression During the lineage inference step, CALISTA automatically generates and displays a lineage graph, obtained by adding an edge between two clusters in increasing cluster distances, until all clusters are connected to at least one other cluster. Subsequently, users can manually add or remove one edge at time based on the cluster distances. ATTENTION: to add an edge (press “p”), remove an edge (press “m”) or finalize the lineage progression graph (press “enter”), the MATLAB figure of the graph must appear in foreground without any modification (e.g., zooming, rotation). Note that the addition/removal of the edges are performed according to increasing/decreasing order of cluster distance. Lineage Progression 8 Cluster: Cluster: Cluster: Cluster: data1 6 COMP2 4 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.09 3 Cluster pseudotime: 1.00 4 Cluster pseudotime: 1.00 2 4 3 0 14 1.31 COMP1 0.91266 0.5 10 2 1 8 -4 447 -2 5 0 0 0.2 0.4 0.6 0.8 1 Cluster pseudotime ATTENTION: the final graph must be connected (i.e. there is a path from any node/cluster to any other node/cluster in the graph) otherwise a warning will be returned. Here, we do not need to remove any spurious edges, and hence we enter 0 upon queried. In addition, CALISTA returns (not shown): - 4.3.4 Cell clustering plot based on the cluster pseudotime Boxplot, mean, median entropy values calculated for each cluster Plot of mean expression values for each gene based on cell cluster expression level Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph (results not shown here), based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 4.3.5 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the 17 likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. Cell Ordering 8 6 PC2 4 2 0 -2 -4 0 5 PC1 4.4 10 0 0.5 1 Cell Ordering Example 4. Human embryonic stem cell differentiation into endodermal cells Analysis of RNA-seq data in Chu et al., Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm, Genome Biol. 17, 173 (2016). **Please unzip the file “4-CHU__data_type_4_format_data_2_clusters_4.csv.zip” in EXAMPLES/CHU/ before running CALISTA** 4.4.1 Data Import and Preprocessing We first change the current directory in MATLAB to the CALISTA folder. We edit the Example_4_CHU_scRNA_seq.m script in the main folder of CALISTA and import Chu dataset (in subfolder EXAMPLES/CHU): 4.4.2 Single-cell clustering In this case, the number of clusters is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 4. 18 NOTE: CALISTA automatically returns the optimal number of clusters based on the MAXIMUM eigengap value. However, the user might choose the number of clusters to adopt based on the FIRST eigengap. CALISTA single-cell clustering result is shown below. Original time/cell stage info Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Cluster Cluster Cluster Cluster 1 2 3 4 5 0 PC3 PC3 5 -10 -5 -10 Cell Clustering 0 12 24 36 72 96 0 -5 0 PC1 5 10 10 0 -10 -5 -10 PC2 0 -5 0 PC1 5 10 10 PC2 We do not need to remove any clusters (by entering 0 upon queried), and continue with further analysis (by entering 1 upon queried). 4.4.3 Reconstruction of lineage progression During the lineage inference step, CALISTA automatically generates and displays a lineage graph, obtained by adding an edge between two clusters in increasing cluster distances, until all clusters are connected to at least one other cluster. Subsequently, users can manually add or remove one edge at time based on the cluster distances. 19 ATTENTION: to add an edge (press “p”), remove an edge (press “m”) or finalize the lineage progression graph (press “enter”), the MATLAB figure of the graph must appear in foreground without any modification (e.g., zooming, rotation). Note that the addition/removal of the edges are performed according to increasing/decreasing order of cluster distance. Lineage Progression 10 Cluster: Cluster: Cluster: Cluster: data1 COMP2 5 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.12 3 Cluster pseudotime: 0.38 4 Cluster pseudotime: 1.00 0 836 0.8 -5 0.52229 6 1.404 -10 20 0 COMP1 -20 0 0.4 0.2 0.6 1 0.8 Cluster pseudotime ATTENTION: the final graph must be connected (i.e. there is a path from any node/cluster to any other node/cluster in the graph), otherwise a warning will be returned. Since the transition from cluster 4 to cluster 3 is inconsistent with the capture time info (i.e. cluster pseudotime values for cluster 4 and 3 are 1 and 0.38 respectively), we add one further edge to produce the following lineage progression graph by pressing “p” and then “enter”: Lineage Progression Cluster: Cluster: Cluster: Cluster: data1 10 COMP2 66 3 0.88 5 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.12 3 Cluster pseudotime: 0.38 4 Cluster pseudotime: 1.00 0.52229 6 73 1.8 1.404 0 -5 20 0 -10 0 0.2 0.4 0.6 Cluster pseudotime 0.8 1 -20 COMP1 Based on our definition of branching point, we consider the previous inferred lineage graph still linear, since there is only one final cell cluster (cluster 4). Moreover, since the transition from cluster 2 to cluster 4 bypasses a cluster with intermediate pseudotime (i.e. cluster 3), we remove the spurious edge between cluster 2 and 4, by entering 1 and entering [2 4], upon the following query: 20 Lineage Progression 10 Cluster: Cluster: Cluster: Cluster: data1 1 3 2 0 836 0.8 COMP2 5 87 1. 6 -5 -10 36 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.12 3 Cluster pseudotime: 0.38 4 Cluster pseudotime: 1.00 0.52229 1.404 4 20 0 COMP1 -20 0.2 0 0.4 0.6 0.8 1 Cluster pseudotime The final inferred lineage relationships is shown below. Lineage Progression 0.8 1 3 2 6 836 10 5 PC2 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.12 3 Cluster pseudotime: 0.38 4 Cluster pseudotime: 1.00 73 1.8 4 0.52229 6 10 0 0 -5 PC1 Cluster: Cluster: Cluster: Cluster: -10 -10 0 0.2 0.4 -20 0.6 Cluster pseudotime 0.8 1 In addition, CALISTA returns (not shown): - 4.4.4 Cell clustering plot based on the cluster pseudotime Boxplot, mean, median entropy values calculated for each cluster Plot of mean expression values for each gene based on cell cluster expression level Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph (results not shown here), based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 4.4.5 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. 21 10 Cell Ordering 5 PC2 0 -5 PC1 -10 20 0 -20 0 4.5 0.2 0.4 Cell Ordering 0.6 0.8 1 Example 5. Running CALISTA without time or cell stage information Analysis of RT-qPCR data in Moignard et al. “Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis”. Nat. Cell Biol. 15, 363–72 (2013). Here, we report only the main steps of the analysis. For the complete analysis please check Example 4.2. 4.5.1 Data Import and Preprocessing We change the current directory in MATLAB to the CALISTA folder. We then edit the Example_5_MOIGNARD_scRT_qPCR_NO_TIME_INFO.m script in the main folder of CALISTA and load Moignard dataset (in subfolder EXAMPLES/MOIGNARD): 4.5.2 Single-cell clustering Following the original publication, we set the number of clusters equals to 5. 22 4.5.3 Reconstruction of lineage progression and pseudotemporal ordering of cells We follow the steps as outlined in the other examples above to infer the lineage progression and carry out pseudotemporal ordering of single cells. Without the time or cell stage info, CALISTA is still able to recover the cluster progression based on: The specification of the starting cell (e.g. cell 1): a. The final inferred lineage relationships are as follow. Plot after cluster relabelling 4 2 1 COMP2 0 0.53963 0.6 63 -2 -4 Cluster: Cluster: Cluster: 2 5 3 4 0.43799 0.53 555 16 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.57 -6 Cluster: 4 Cluster pseudotime: 0.91 10 Cluster: 5 Cluster pseudotime: 1.00 data1 0 -10 1 0.8 0.6 0.4 0.2 0 COMP1 Cluster pseudotime CALISTA pseudotemporal ordering gives the following outcome. Cell Ordering 4 2 PC2 0 -2 -4 -6 10 5 0 PC1 -5 0 0.2 0.4 1 0.8 0.6 1.2 Cell Ordering The specification of a marker gene (e.g. ‘Erg’) which is downregulated (press 2): b. The final inferred lineage relationships: Plot after cluster relabelling 4 COMP2 2 1 0 0.53963 16 63 -2 -4Cluster: Cluster: Cluster: Cluster: -6Cluster: data1 10 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.57 4 Cluster pseudotime: 0.91 5 Cluster pseudotime: 1.00 0 COMP1 -10 0 2 5 3 4 0.43799 0.5 355 5 0.6 0.2 0.4 0.6 0.8 1 Cluster pseudotime CALISTA pseudotemporal ordering of cells gives the following result. 23 Cell Ordering 4 3 2 1 PC2 0 -1 -2 -3 -4 -5 -6 10 5 0 -5 PC1 1.5 1 0.5 0 Cell Ordering Without any information of the time information, cell stage, starting cell and marker genes, CALISTA is still able to find the topology of the lineage graph, but the edges are undirected. CALISTA single-cell clustering result is as follows. Original time/cell stage info Cell Clustering Time/Stage Cluster Cluster Cluster Cluster Cluster 0 1 2 3 4 5 4 4 6 2 4 0 -2 -4 PC2 PC1 -4 -6 2 -2 0 PC2 4 0 2 -2 6 2 0 -2 -4 -6 -4 PC1 The final inferred lineage relationships are shown below. Plot after cluster relabelling 8 K= 1 pseudo-stage K= 2 pseudo-stage K= 3 pseudo-stage K= 4 pseudo-stage K= 5 pseudo-stage data1 6 COMP1 4 2 63 9 53 0. 4 -0.61 6 63 - 0 3 -0.4 -2 1 9 79 2 -0.5 -4 -6 -0.2 0 0.2 0.4 1 2 3 4 5 5 35 55 3 0.6 0.8 1 1.2 Cluster progression Therefore, CALISTA performs the pseudotemporal ordering of cells as follows: 24 Cell Ordering 8 6 4 PC1 2 0 -2 -4 -6 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Cell Ordering 4.6 Example 6. Removing undesired clusters Analysis of RT-qPCR data in Moignard et al., Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis, Nat. Cell Biol. 15, 363–72 (2013). 4.6.1 Data Import and Preprocessing We change the current directory in MATLAB to the CALISTA folder. We run Example_1_MOiGNARD_scRT_qPCR_GUI.m script in the main folder of CALISTA and load Moignard dataset (in subfolder EXAMPLES/RT-qPCR): 4.6.2 Single-cell clustering We set the number of clusters equals to 5 following the original publication. 25 CALISTA single-cell clustering result is shown below. Original time/cell stage info Cell Clustering Time/Stage Time/Stage Time/Stage 1 2 3 K= K= K= K= K= 4 8 6 2 4 0 2 10 2 5 0 -2 0 -2 -4 -4 -6 1 2 2 3 3 4 0 -2 COMP2 1 pseudo-stage 2 pseudo-stage 3 pseudo-stage 4 pseudo-stage 5 pseudo-stage COMP1 -6 PC2 -4 PC1 -6 -5 Let us proceed with removing cluster 3 and 5, by entering 1 and type [5 3] upon queried. The indices of cells to remove are saved in a csv file. We then edit MAIN.m script again, but we set the INPUTS as described previously except: INPUTS.cells_2_cut=0; % Manual removal of cells We run MAIN.m once more from the workspace and import Moignard dataset (in subfolder EXAMPLES/RTqPCR). We also upload the csv file containing cell’s indices to remove. 4.6.3 Single-cell clustering after removing undesired clusters We now set the number of clusters equals to 3. 26 We obtain the following clustering results. Original time/cell stage info Cell Clustering 8 1 2 3 8 K= K= K= 6 1 pseudo-stage 2 pseudo-stage 3 pseudo-stage 1 2 3 6 4 4 2 2 0 PC1 COMP1 Time/Stage Time/Stage Time/Stage 0 -2 -2 -4 -4 -6 -6 4 4.6.4 2 0 -2 COMP2 -4 -6 4 2 0 -2 PC2 -4 -6 Reconstruction of lineage progression We continue with lineage inference step. During the lineage inference step, CALISTA provides the minimal connected graph (with nodes = cell clusters and edges = state transitions) as starting prediction for the developmental hierarchy. In addition, the user can also manually add or remove one edge at time based on the cluster distance values: ATTENTION: to add an edge (press “p”), remove an edge (press “m”) or finalize the lineage progression graph (press “enter”), the MATLAB figure of the graph must appear in foreground without any modification (e.g., zooming, rotation). Note that the addition/removal of the edges are performed according to increasing/decreasing order of cluster distance. ATTENTION: the final graph must be connected (i.e. there exists a path from any node/cluster to any other node/cluster in the graph), otherwise a warning will be returned. We do not need to remove spurious edges (entering 0 upon queried) The final inferred lineage relationship is shown below 27 Lineage Progression 4 Cluster: Cluster: Cluster: data1 2 COMP2 0 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 1.00 -2 0.49935 0.4873 2 -4 COMP1 -6 10 0 -10 0 1 0.8 0.6 0.4 0.2 Cluster pseudotime In addition, CALISTA returns (not shown): - 4.6.5 Cell clustering plot based on the cluster pseudotime Boxplot, mean, median entropy values calculated for each cluster Plot of mean expression values for each gene based on cell cluster expression level Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph (results not shown here), based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 4.6.6 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. Cell Ordering 4 PC2 2 0 -2 -4 -6 10 1.5 0 1 0.5 PC1 4.7 -10 0 Cell Ordering Running CALISTA GUI Analysis of RT-qPCR data of Bargaje et al. (Bargaje, et al, Cell population structure prior to bifurcation predicts efficiency of directed differentiation in human induced pluripotent cells. Proc. Natl. Acad. Sci. U. S. A. 114, 2271– 2276 (2017)). Here, we report only the main steps of the analysis. For the complete analysis please check Example 4.1. 4.7.1 Data Import and Preprocessing We begin with changing the current directory in MATLAB to the CALISTA folder. Then, we edit the Example_6_BARGAJE_scRT_qPCR_GUI.m script in the main folder of CALISTA and import Bargaje dataset (available in the subfolder EXAMPLES/MOIGNARD). The following are screenshots from running CALISTA on MATLAB. 28 4.7.2 Single-cell clustering In this case, the number of clusters is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 5. The following are screenshots from CALISTA single-cell clustering analysis. 29 Cell Clustering -8 -6 -6 -4 -4 -2 -2 0 PC2 PC2 Original time/cell stage info -8 2 Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage Time/Stage 4 6 8 10 15 10 5 PC1 0 -5 -10 -15 0 24 36 48 60 72 96 120 0 2 4 Cluster Cluster Cluster Cluster Cluster 6 8 1 2 3 4 5 10 15 10 5 PC1 0 -5 -10 -15 If desired, users can remove cells from specific clusters from further analysis. In this example, we do not want to remove any clusters. Hence, we enter 0 (no cluster removal) and then 1 to proceed with lineage inference. 4.7.3 Reconstruction of lineage progression During the lineage inference step, CALISTA automatically generates and displays a lineage graph, obtained by adding an edge between two clusters in increasing cluster distances, until all clusters are connected to at least one other cluster. Subsequently, users can manually add or remove one edge at time based on the cluster distances. ATTENTION: select (or unselect) checkboxes to add (or remove) specific edges. To select (or unselect) all edges use the “Select all” checkbox. ATTENTION: to finalize the lineage progression (by pressing the “OK” button), the final graph must be connected (i.e. there exists a path from any node/cluster to any other node/cluster in the graph). Since the transition from cluster 1 to cluster 5 is inconsistent with the capture time info (i.e. cluster pseudotime values for cluster 1 and 5 are 0 and 1 respectively) we remove the spurious edge between cluster 1 and 5, by unselecting the second checkbox: 30 We press the “OK” button to confirm and the final inferred lineage relationships are displayed below. Lineage Progression Cluster: Cluster: Cluster: Cluster: Cluster: data1 10 1 Cluster pseudotime: 0.00 2 Cluster pseudotime: 0.50 3 Cluster pseudotime: 0.75 4 Cluster pseudotime: 1.00 5 Cluster pseudotime: 1.00 5 0.41808 0.52 226 49 99 3 0 0. COMP2 0.53909 -5 -10 20 10 0 -10 COMP1 -20 0 0.2 0.4 0.6 0.8 1 Cluster pseudotime In addition, CALISTA gives (not shown): - 4.7.4 Cell clustering plot based on the cluster pseudotime Boxplot, mean, median entropy values calculated for each cluster Plot of mean expression values for each gene based on cell cluster expression level Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph (results not shown here), based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make 31 up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 4.7.5 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. 4.8 Example 8. Reconstruction of developmental trajectories during zebrafish embryogenesis Analysis of Drop-seq data of Farrell et al. (Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, eaar3131 (2018)). 4.8.1 Data Import and Preprocessing We begin with changing the current directory in MATLAB to the CALISTA folder. Then, we run Example_7_FARRELL_scDrop_seq.m script in the main folder of CALISTA and import Farrell dataset (available upon request due to the large file size OR run save_to_matlab.R in R to convert the original data into Matlab file). The following are screenshots from running CALISTA on MATLAB. CALISTA processes the data from each time point separately: 32 4.8.2 Single-cell clustering We run a CALISTA clustering for each data point as follows. First, the number of clusters at the final time point is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 23. The following are screenshots from CALISTA single-cell clustering analysis. Then CALISTA will automatically detect the optimal number of clusters for the remaining time points. 33 If desired, users can remove cells from specific clusters from further analysis. In this example, we do not want to remove any clusters. Hence, we enter 0 (no cluster removal) and then 1 to proceed with lineage inference. 4.8.3 Reconstruction of lineage progression During the lineage inference step for time series Drop-seq data, CALISTA automatically generates and displays a lineage graph, obtained by calculating the shortest path between each cluster at the final cluster progression time and the starting cluster. The inferred lineage is represented by a tree-based graph: Here, we do not need to remove any spurious edges, and hence we enter 0 upon queried. 4.8.4 Determination of transition genes After reconstructing the lineage progression, we identify the key transition genes for any two connected clusters in the graph, based on the gene-wise likelihood difference between having the cells separately as two clusters and together as a single cluster. Larger differences in the gene-wise likelihood point to more informative genes. The transition genes are selected as those whose gene-wise likelihood differences make up to more than a certain percentage of the cumulative sum of the likelihood differences of all genes – set by INPUTS.thr_transition_genes. 4.8.5 Pseudotemporal ordering of cells For pseudotemporal ordering of cells, CALISTA performs maximum likelihood optimization for each cell using a linear interpolation of the cell likelihoods between any two connected clusters. The pseudotimes of the cells are computed by linear interpolation of the cluster pseudotimes, and correspond to the maximum point of the likelihood optimization above. Cells are subsequently assigned to the edges in the lineage progression graph. The following screenshot gives the results of this cell-to-edge assignment. 34 4.8.6 Path analysis To plot the expression of marker genes along each path, users can enter 1 upon queried and load the excel file containing the list of genes if interest. In this case the file is in EXAMPLES/FARRELL: 4.9 Example 9. Identification of mouse spinal cord neurons activity during behavior Analysis of snRNA-seq data of Sathyamurthy et al. (Sathyamurthy, A. et al. Massively Parallel Single Nucleus Transcriptional Profiling Defines Spinal Cord Neurons and Their Activity during Behavior. Cell Rep. 22, 2216–2225 (2018)). 4.9.1 Data Import and Preprocessing We begin with changing the current directory in MATLAB to the CALISTA folder. Then, we run Example_8_SATHYAMURTHY_DropNc_seq.m script in the main folder of CALISTA and import Farrell dataset (available upon request due to the large file size). The following are screenshots from running CALISTA on MATLAB. 35 4.9.2 Single-cell clustering The number of clusters is determined using the eigengap plot. According to the eigengap plot below, we set the number of clusters to 9. The following are screenshots from CALISTA single-cell clustering analysis. Eigengap values 1 First Eigengap: 12 Second Eigengap: 14 Third Eigengap: 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Number of clusters If desired, users can remove cells from specific clusters from further analysis. In this example, we do not want to remove any clusters. Hence, we enter 0 (no cluster removal) and then 1 to proceed with lineage inference. We can load the list of marker genes and visualize the mean expression of each predicted cluster: 36 Neuron 1 Neuron 2 Oligo Schwann Meningeal Astrocyte Vascular OPC Sn ap 25 R Syp bf o Sn x3 hg 1 M 1 M bp ob p M og Pl p1 M Pm pz p2 2 Pr x D C cn ol 3a 1 Ig A f2 At qp4 p1 a G 2 Sl ja1 c1 a2 Pe Fl ca t1 m 1 Te M k Pd yl9 g C frb sp G g4 pr Pd 17 gf r C a t Itg ss a Pt m pr c Microglia 4.10 Example 9. Analysis of peripherical blood mononuclear cells (PBMCs) Analysis of Drop-seq data of Zheng et al. (Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).). 4.10.1 Data Import and Preprocessing We begin with changing the current directory in MATLAB to the CALISTA folder. Then, we run Example_8_SATHYAMURTHY_DropNc_seq.m script in the main folder of CALISTA and import Farrell dataset (available upon request due to the large file size). The following are screenshots from running CALISTA on MATLAB. 4.10.2 Single-cell clustering Following the clustering analysis of the original publication, We set the number of clusters to 10. The following are screenshots from CALISTA single-cell clustering analysis. If desired, users can remove cells from specific clusters from further analysis. In this example, we do not want to remove any clusters. Hence, we enter 0 (no cluster removal) and then 1 to proceed with lineage inference. 37 5 Questions and comments Please address any problem or comment to: nanp@ethz.ch or rudiyant@buffalo.edu. 38
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 38 PDF Version : 1.4 Title : Microsoft Word - 1 CALISTA_USER_MANUAL.docx Producer : macOS Version 10.14 (Build 18A391) Quartz PDFContext Creator : Word Create Date : 2018:12:03 17:20:42Z Modify Date : 2018:12:03 17:20:42ZEXIF Metadata provided by EXIF.tools