CCM Colloquium: Robert Patro

America/New_York
3rd Floor Classroom/3-Flatiron Institute (162 5th Avenue)

3rd Floor Classroom/3-Flatiron Institute

162 5th Avenue

40
Description

Presenter: Rob Patro (University of Maryland)

Title: Counting isn’t easy: Computational successes and inferential challenges in the processing of single-cell RNA-seq data

Abstract: Single-cell RNA sequencing technologies make it possible to quantify gene expression
profiles at the resolution of individual cells, and across thousands to millions of cells per
experiment. The rapid growth of high-throughput single-cell and single-nucleus RNA sequencing
technologies has produced a wealth of such data over the past few years, and the experiments
conducted using these technologies continues to increase in both number and scale. This scale,
and the distinctive characteristics of these data, necessitate the development of new
computational methods to accurately and efficiently quantify single-cell and single-nucleus
RNA-seq data into count matrices that constitute the input to downstream analyses.
I will describe alevin-fry, a fast, accurate, and memory-frugal framework for quantifying single-
cell and single-nucleus RNA-seq data that we have developed, as well as simpleaf, an
“augmented execution context” that simplifies running analyes with alevin-fry and that aids in
data provenance tracking and reproducibility. I will discuss computational aspects of alevin-fry,
as well as some of our prior work that enables its efficiency. I will show how alevin-fry can be
effectively used to quantify single-cell and single-nucleus RNA-seq data, and how the spliced
and unspliced molecule quantification required as input for RNA velocity analyses can be
extracted from the same preprocessed data used to generate regular gene expression count
matrices.
I will also touch upon some computational and data processing challenges that arise in the
accurate processing of single-cell data, and our ongoing work to improve these methods and
tools. I will discuss how sequenced fragments, and their associated UMIs, can appear as
ambiguous with respect to the splicing status of the molecule from which they are drawn, and
describe our initial work in performing modeling, inference and resolution of this splicing
status, as well as how this inferred information can help to recover the origin of gene-
ambiguous reads.
Finally, I’ll discuss some “downstream” challenges and observations that arise in common
processing pipelines for single-cell data. Specifically, I will describe our investigation into the
potential origin of relatively common “off-target” reads, how they may be explained, and how
they may be turned to productive use in downstream processing. These observations and
results suggest that, owing to several unique technological and experimental characteristics of
single-cell technologies, further exploration and consideration is still needed by the community
to determine best practices for single-cell data processing. We suggest some directions for such
exploration and development. This is joint work with Dongze He, Yuan Gao, Natalia Quintana-
Parrilla, Skylar Spencer Chan, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson and Avi
Srivastava.

The agenda of this meeting is empty