Blast on DB

Questions

What is it?

The Blast on DB project aims to use the Blast Algorithm with a DBMS.

Can you explain it better? What is the problem that you want to solve?

The storage of the bio-molecular sequences, like DNA, RNA and proteins, is made utilizing raw files that are formatted using formats like FASTA or blast's own format. The FASTA format doesnt provide any information about sequences, like index for searches or semantic information, being more a data bank than a data base.

A [good] approach is to use good data bases schemas, like GUS , to store the sequences and their information, but a problem occur when a search by similarity is needed. The sequences must be dumped into a temporary file, this temporary file must be formatted to the blast format and finally, the search can be performed. After this process, the results must be saved into the data base. This process wastes time [and money].

Which DBMS do you want to use?

At present time, for me, the best choice is the postgresql DBMS. This is because:

Is it a definitive conclusion?

No. At present moment postgresql seems to be the better option, but I can change or preferentially add support to other DBMS, like mysql.

Do you want to hack postgresql internals

No. For the purpose of this project, it is better to do an extension to postgresql rather than mess with its core.

This is because the main functionality of this project is targeted to a very specific user group and the source code can be packaged as a module. For maintenance, coding and debugging, this approach is really better.

Do you want to implement the Blast algorithm from scratch?

No. This projects wants to reuse an existing blast implementation. The two main blast implementations are NCBI and WU.

WU BLAST has a licensing problem: its license doesn't allow free commercial use and changes in the source code: "WU BLAST 2.0 is copyrighted and may not be sold, redistributed or modified in any form or by any means, without prior express written consent from the Office of Technology Management at Washington University in St. Louis.".

The NCBI BLAST apparently is open source, but the license itself I dont know. The source is good, very well organized and tested, but is hard to understand well and to cut the "important" part to create a plugin.

A good choice is the FSA-BLAST. This BLAST implementation uses the BSD license [CHEER!], its authors says that it's faster than NCBI-BLAST and its source is small and concise. I have already tried it and the results are really good! This implementation is your major candidate!

So, the short answer is: no, I want to use the FSA-BLAST source.

Do you know other similar projects?

Yes. I know the BioPostgres, where it has a similar objective. I mean: "BioPostgres is a collection of modules that extend PostgreSQL for Computational Biology. It implements new datatypes (graph, range, location, etc) with query operators, index and related tools for large-scale analysis.".

One main difference between our project and BioPostgres is that we want to include BLAST into DBMS and they want to do modules for Computational Biology in Postgresql, so they have different objectives and these projects can [or must] work together.

What have you already done? (Speak less and code more!)

Nothing. Actually, some test sources, but nothing too cool. I am reading postgresql manuals and seeing source examples and choosing a good BLAST implementation to use.

This all looks cool! How can I contact you?

I haven't a better way to communicate, so contact me by my personal email: felipe.albrecht(at)gmail.com or by project page at sourceforge.


Links and sources

Postgresql Hacking

Documentation of how to extend SQL in Postgresql

Database Internals Presentation

A Tour of PostgreSQL Internals

Introduction to Hacking PostgreSQL

Similar projects

BioPostgres

BLAST infos

FSA-BLAST

Blast Algorithm


Felipe Fernandes Albrecht - felipe.albrecht(@)gmail.com

Last update: 06/16/2007

Thank you for your visit and attention.


SourceForge.net Logo