Speaker: Gene Myers, Emeritus Director, Max Planck Institute for Molecular Cell Biology and Genetics, Dresden
Adjunct Faculty, Okinawa Institute of Science and Technology
Honorary Faculty, Welcome Sanger Institute
Topic: Sorting for Speed in Bioinformatics
With modern multi-layered cache architectures, the most significant digit (MSD) radix sort has proven in the last decade to be very efficient in practice as it is (1) cache-coherent, (2) in-place, and (3) progressively more memory localized. Using virtual memory mapping it can even efficiently externally sort data sets too large to fit in a main memory. We briefly review this method and then show that many bioinformatics tasks can be formulated as a series of MSD radix sorts and that by so doing, one outperforms solutions involving hash tables or indexes that are inherently not cache coherent. Specifically, we look at the problems of counting k-mers, aligning two genomes, and building an A-Bruijn graph of an assembly data set.
Choose timezone
Your profile timezone: