Hello everyone,
Bonneau lab group meeting will take place on
Monday, May 3, 2021
10:00am ET
Virtual or in-person, 162, 7th Floor Classroom
Presenter:
Monday, May 3, 2021
10:00am ET
Virtual or in-person, 162, 7th Floor Classroom
Presenter:
Daniel Berenberg, New York University
Language modeling for DNA sequences
Next-generation sequencing efforts have resulted in massive deposits of reliable whole-genome data, namely coding and non-coding nucleotide sequences. Ultra-deep, near (and exceeding!) billion parameter language models have shown unprecedented performance on a variety of input domains including natural language text and protein sequences. Critically, the post-training learned representations for both domains have been regarded as general purpose featurizations, capable of extending to state-of-the-art performance in property prediction, such as protein function classification, structural alignment, and binding affinity. In this work, we intend to leverage the abundance of genomic sequence data and the power of large language models to develop meaningful feature extractors for nucleotide sequences. If fruitful, this process will result in a so-called 'neural metagenomics pipeline', allowing biologists to analyze genomic samples and obtain valuable information quickly and automatically from an entirely data-driven perspective. In this talk, I will describe our progress, current challenges, and future work.
This is a joint project with Tymor Hamamsy advised by professors Rich Bonneau and Kyunghyun Cho (NYU).
Language modeling for DNA sequences
Next-generation sequencing efforts have resulted in massive deposits of reliable whole-genome data, namely coding and non-coding nucleotide sequences. Ultra-deep, near (and exceeding!) billion parameter language models have shown unprecedented performance on a variety of input domains including natural language text and protein sequences. Critically, the post-training learned representations for both domains have been regarded as general purpose featurizations, capable of extending to state-of-the-art performance in property prediction, such as protein function classification, structural alignment, and binding affinity. In this work, we intend to leverage the abundance of genomic sequence data and the power of large language models to develop meaningful feature extractors for nucleotide sequences. If fruitful, this process will result in a so-called 'neural metagenomics pipeline', allowing biologists to analyze genomic samples and obtain valuable information quickly and automatically from an entirely data-driven perspective. In this talk, I will describe our progress, current challenges, and future work.
This is a joint project with Tymor Hamamsy advised by professors Rich Bonneau and Kyunghyun Cho (NYU).