Kevin Shen

Kevin Shen

Oct 16, 2023

Group 6 Copy 180
0

Bioinformatics Plan

This lab note will elaborate in detail the bioinformatics portion of our project and how it will help us with our goals. We are flexible on the configurations/software/tools we use and will most likely experiment to see what works best. Therefore, the below steps should be treated more as a guideline or initial plan that is subject to change. We draw a lot of inspiration from the bioinformatics work done by Tobias Messmer on his single-cell analysis of bovine cells project (https://doi.org/10.3389/fnut.2023.1212196), however most of the tasks being done are quite general and can be translated across to other single-cell analysis projects too.

Our goals:

  1. Identify and classify the different cell types and subpopulations => useful for learning about cell composition and their relationship with phenotypic changes.

  2. Identification of cell surface markers for each cell type => useful for FACS and eventual antibody development.

  3. Gain insight on how the changes in gene expression and chromatin accessibility affect the proliferation/differentiation and activity of cells => useful for optimizing growth conditions and the eventual development of serum free medium.

  4. Create a killifish single-cell foundation model using our data => useful for extracting additional bioinformatics insights via machine learning and helping future researchers speed up their data analysis.

  5. [SECONDARY GOAL] Create a cultivated meat single-cell data atlas platform => useful for promoting the sharing of data among researchers in cultivated meat development.

Step 1: Preprocessing and Cleaning

Demultiplexing

Use CellRanger’s mkfastq function

Align to reference genome

Reads will be aligned to the Fundulus Heteroclitus (mummichog/killifish) genome: https://useast.ensembl.org/Fundulus_heteroclitus/Info/Annotation

Quality control 

All cells must pass a quality control criteria which will only retain cells that are within 3 Median Absolute Deviations (MAD) of the median for three categories: expressed genes, total counts, and percentage mitochondrial genes.

Explanation:

  • What is “Within X Median Absolute Deviations (MAD) of the Median”: This refers to a statistical measure used to assess the variability of data points. In this context, it means that for each cell, the expression levels of genes, total counts, and the percentage of mitochondrial genes fell within a certain range of variation from the median value across all cells.

  • Why Expressed Genes: This refers to the number of unique genes that are detected as being actively expressed in a single cell. Higher numbers of expressed genes can indicate higher data quality.

  • Why Total Counts: This represents the total number of sequencing reads or fragments mapped to genes in a single cell. It's an indication of how much data was collected for each cell.

  • Why Percentage of Mitochondrial Genes: High percentages of mitochondrial genes can be indicative of poor cell quality, as damaged or dying cells tend to have higher expression of mitochondrial genes.


Normalization

The raw gene expression counts (the number of times a specific gene was sequenced in each cell) will be normalized to account for factors such as percentage of mitochondrial genes, library size, number of genes, cell cycle effects. This will ensure the data can be compared across timepoints. The Seurat library’s sctransform can be used to accomplish this. We will start with using sctransform with its default parameters and adjust accordingly.

Explanation:

  • Why Percentage of Mitochondrial Genes: A high percentage of mitochondrial genes can be indicative of poor cell quality. By "regressing out," it means that the effect of mitochondrial gene expression on the data was statistically adjusted for, effectively removing this potential source of variation.

  • Why Library Size: The total number of sequencing reads or fragments mapped to genes in each cell can vary. This is referred to as library size. Regressing out library size helps to correct for differences in sequencing depth between cells.

  • Why Number of Genes: Some cells might express a higher number of genes than others. This can be due to differences in cell type, activation state, or other biological factors. Regressing out the number of genes helps to account for this variation.

  • Why Cell Cycle Effects: The cell cycle is a natural process in which cells grow, divide, and replicate their DNA. Cell cycle effects can introduce variation in gene expression. By regressing out cell cycle effects, the data is adjusted to minimize the impact of cell cycle-related variations.

Step 2: Manual data analysis

Image sources: All example images were taken from Figure 2 in  https://doi.org/10.3389/fnut.2023.1212196

Differential gene analysis

scRNA UMAP + clustering: Run UMAP dimensionality reduction at each timepoint and then cluster the cells. This will reveal the number of cell types along with their distinct gene expression profiles.

scATAC UMAP + clustering:  Run UMAP dimensionality reduction at each timepoint and then cluster the cells. This will reveal the number of cell types along with their distinct chromatin accessibility profile.

Combined scRNA+scATAC UMAP and clustering: Run the above steps except the data vector of each cell will be a concatenation of its gene expression and chromatin accessibility data. That way, the scATAC is essentially providing additional dimensionality onto the existing scRNA data and this will be factored into the UMAP dimensionality reduction.

This step will give us insights into the following:

  1. How do the composition of cell types change over time? What cell types persist in culture until time point 3?

  2. Do the gene expression profile and chromatin accessibility profiles have different clusterings? This may indicate certain cell types identify with multiple profiles such as one gene expression profile but two or more chromatin accessibility profiles. How does this change when gene expression and chromatin accessibility data is combined?


Regulatory element identification

Annotation of the scATAC data’s genomic regions with their regulatory elements. When combined with the scRNA data, you can identify genes that are in close proximity to accessible regions which helps to provide a link between regulatory elements to the genes that they are likely to regulate.

This step will give us insights into the following:

  1. What are the regulatory elements that control gene expression? Can we identify the regulatory elements that control each gene?


Functional enrichment analysis 

Retrieve a list of most significantly enriched GO terms between timepoints. The EnrichR package can be used. EnrichR is a widely used web-based tool and software package that performs functional enrichment analysis. Functional enrichment analysis takes a list of differentially expressed genes (up-regulated and down-regulated) and finds the terms/processes that most significantly corresponds to the up-regulation and down-regulation (top 5 terms in the below example image)

This step will give us insights into the following:

  1. What biological processes or activities are overrepresented or underrepresented at each timepoint? 

  2. Do different cell types have different lists of significantly enriched GO terms?

  3. Do changes in the GO terms correspond to phenotypical changes of the cells in culture?


Surface Receptor Analysis

Identify the expression of surface receptors in each different cell type at each timepoint. Surface receptors can be identified by filtering for protein-coding genes that are located in the plasma membrane. This can be visualized using gene expression plots or violin plots. Finding the receptors with the highest expression levels for each cell type will be useful for eventual physical cell identification and physical sorting.

 

A common method to isolate cells is using fluorescence-activated cell sorting (FACS). In a FACS panel, special fluorescently labeled antibodies bind to certain cell surface receptors. The engineering of these antibodies first requires knowing what cell surface receptors exist on different fish cells - a piece of knowledge that has always been missing until there is more data and analysis can be done.

This step will give us insights into the following:

  1. What are the receptors we can use for flow cytometry staining and a subsequent fluorescent-activated cell sorting (FACS) panel?


Killifish-specific Analysis 

At this point, we will have most likely identified numerous cell types. We also know that the KFE-5 killifish cell line consists of two core phenotypes: mononucleated fibroblastic cells, and elongated myoblastic cells with the capability to differentiate into myocytic cells. We essentially would like to run an analysis to identify the relationship between the different cell types and the two core phenotypes. 

This step will give us insights into the following:

  1. Are certain cell types associated with a certain phenotype? Or perhaps there is no relationship?

  2. How do cells differentiate into the different phenotypes and can any of the information we gathered in the previous analysis steps help us here?

Step 3: Building a foundational single cell omics model

One of the goals with generating single-cell data is the development of a foundational single-cell omics model for killifish and eventually for future species we may work with. We intend to start off by a model with the same architecture and procedure as scGPT (https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full.pdf) which is a human single-cell foundation model trained on over 33 million cells. Because we won’t have this many cells for our killifish datasets, we’ll explore two options: train a model from scratch on our smaller dataset, or explore transfer learning methods to finetune a human scGPT model on killifish data.

scGPT achieves great performance on many downstream tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference. The following will especially be of interest to us:

  1. Cell type classification: UMAP visualization of scGPT’s cell embeddings to reveal its classification of cell types. We can then compare its results to our step 2 manual analysis pipeline.

  2. Multi-Omic Integration: scGPT can learn joint embeddings for the data from both scRNA/transcriptomic and scATAC/epigenomic data that can be used for downstream tasks.

  3. Gene regulatory network: Create a unified network of gene-gene interactions, regulatory elements interactions, perturbations, pathways, functionally related genes, and gene activity across different states/timepoints.

We have access to the SHARCNET high performance computing cluster which gives us the ability to use GPUs to train our machine learning model.

We hope that having a foundational single-cell model for killifish will help us in automating the data analysis of new data. It will also give us the technical knowledge to create future similar models for other fish species. By open-sourcing our model along with the datasets, we hope to create a repository of these models that researchers can use to understand fish biology. Having this model available will speed up their data analysis pipeline and help them gain more insights from their own single-cell multiomic data which will ultimately speed up scientific development in cultivated seafood.

Step 4: Creation of a cultivated meat single-cell RNA atlas platform

Given the lack of publicly available data in the space, there currently is no centralized ‘cultivated meat atlas’ resource, where researchers can easily upload and analyze their data, along with all other existing data in the field. The idea of creating such a platform is one that has been met with very strong enthusiasm from our mentors/collaborators, especially Tobias Messmer of Mosa Meat. Tobias’s research has already produced many single-cell RNA transcriptomic datasets that he’s willing to share contingent on the creation of such a platform, which would pair nicely with the fish-related datasets we will create. Many people in the cultivated meat space have also expressed informal interest in a data-sharing initiative through an organized well-maintained platform that could benefit alternative protein research with data that otherwise would remain private. If created, we plan to upload all our datasets to this platform.

0 comments

Join the conversation!Sign In

About This Project

Cultivated seafood research lags behind in large part due to limited data, especially on a single-cell level. Our project aims to fill this gap by generating transcriptomic & epigenomic data using the KFE-5 (killifish) cell line. We'll evaluate variability in long-term proliferation/differentiation potential, and use this data and subsequent bioinformatics analysis to lay the groundwork for optimizing cell line performance, serum free media development, and antibody development.

Blast off!

Browse Other Projects on Experiment

Related Projects

Urban Pollination: sustain native bees & urban crops

Bee activity on our crop flowers is crucial to human food security, but bees are also declining around the...

Wormfree World - Finding New Cures

Hookworms affect the lives of more than 400,000,000 men, women and children around the world. The most effective...

Viral Causes of Lung Cancer

We have special access to blood specimens collected from more than 9,000 cancer free people. These individuals...

Backer Badge Funded

Add a comment