How to Use Ncbi Blast Program in Command Line Interface!




The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
First! We need some data and a Rscript to later visualize blast results. Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in our home directory. We’ll use curl (Curl is Command in unix enviement that is used to download the files); these originally came from the NCBI Web site: ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot.
curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download curl -o mouse.2.protein.faa.gz -L https://osf.io/j2qxk/download curl -o zebrafish.1.protein.faa.gz -L https://osf.io/68mgf/download curl -o blastviz.R -L https://osf.io/e548g/download
To look at the files in your current directory:
ls -l
All three of the sequence files are FASTA protein files (that’s what the .faa suggests) that are compressed with gzip (that’s what the .gz means).
Uncompressed them:
gunzip *.faa.gz
gunzip command is used to compress or expand a file or a list of files in Linux. It accepts all the files having extension as .gz, .z, _z,
And let’s look at the first few sequences in the file:
head mouse.1.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’ and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -11 mouse.1.protein.faa > mm-first.fa
head is a program on Unix and Unix-like operating systems used to display the beginning of a text file or piped data.
Let’s do some more sequences (this one will take a little longer to run):
head -498 mouse.1.protein.faa > mm-second.fa blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
will compare the first 96 sequences. You can look at the output file with:
less mm-second.x.zebrafish.txt
The less command is a Linux terminal pager that shows a file’s contents one screen at a time. It is useful when dealing with a large text file because it doesn’t load the entire file but accesses it page by page, resulting in fast loading speeds.
(and again, type ‘q’ to get out of paging mode.)
Note:
- you can copy/paste multiple commands at a time, and they will execute in order;
- why did it take longer to BLAST mm-second .fa than mm-first .fa?
Last, but not least, let’s generate a more machine-readable version of that last file
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6
See this link for a description of the possible BLAST output table formats.
Now we’ll run an R script to visualize the blast results:
Rscript blastviz.R
A pdf will be generated with the results. We can view this by clicking on the Folder icon at the left of our screens, and then double clicking on the file at the top to open the pdf: