Burrows-wheeler transform in secondary memory


Autoria(s): Pereira, Sérgio Miguel Cachucho
Contribuinte(s)

Russo, Luís

Data(s)

02/03/2011

02/03/2011

2010

Resumo

Master’s Thesis in Computer Engineering

A suffix array is an index, a data structure that allows searching for sequences of characters. Such structures are of key importance for a large set of problems related to sequences of characters. An especially important use of suffix arrays is to compute the Burrows-Wheeler Transform, which can be used for compressing text. This procedure is the base of the UNIX utility bzip2. The Burrows-Wheeler transform is a key step in the construction of more sophisticated indexes. For large sequences of characters, such as DNA sequences of about 10 GB, it is not possible to calculate the Burrows-Wheeler transform in an average computer without using secondary memory. In this dissertation we will study the state-of-the-art algorithms to construct the Burrows-Wheeler transform in secondary memory. Based on this research we propose an algorithm and compare it against the previous ones to determine its relative performance. Our algorithm is based on the classical external Heapsort. The novelty lies in a heap that is especially designed for suffix arrays, which we call String Heap. This algorithm aims to be space-conscious, while trying to handle the disk access dominance over main memory access. We divide our solution in two parts, splitting and merging suffix arrays, the latter is the main application of the String Heap. The merging part produces the BWT, as a side effect of merging a set of partial suffix arrays of a text. We also compare its performance against the other algorithms. We also study a second version of the algorithm that accesses secondary memory in blocks.

Identificador

http://hdl.handle.net/10362/5308

Idioma(s)

eng

Publicador

Faculdade de Ciências e Tecnologia

Direitos

openAccess

Palavras-Chave #Suffix arrays #External sorting #Heap #Pattern matching #Indexes
Tipo

masterThesis