Guide on the Side: NCBI BLAST (Part A): Identifying Sequences

Introduction: BLAST

1 of 2
NCBI BLAST allows you to input a sequence from DNA, RNA or protein residues (amino acids) and find sequences that are identical or similar.

To get to BLAST from the NCBI home page, click BLAST from the Popular Resources menu bar on the right of the page.

[Click on image above to expand]

You can also get to BLAST directly by going to http://blast.ncbi.nlm.nih.gov/

Introduction: BLAST

2 of 2

For this simple exercise we will give you a nucleotide sequence to identify. Click Nucleotide BLAST on the left of the page.

Identifying Sequences with BLAST

1 of 3

There are many options on the Standard Nucleotide BLAST page. For example, you can select different databases to search; you can exclude certain data sources; and you can select a specific algorithm by which to search.

For your first BLAST, we will keep this very basic. We will mostly use the default options to enter a sequence string, and we'll use BLAST to identify the organism it came from, and see what else we can learn about it.

Identifying Sequences with BLAST

2 of 3

Copy and paste the entire string of nucleotide symbols, below, into the box under Enter Query Sequence.

Copy this:

ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTA

CTTCCCCTATCATAGAAGAGCTTATCACCTTTCATGATC

ACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCC

TGTATGCCCTTTTCCTAACACTCACAACAAAACTAACTA

ATACTAACATCTCAGACGCTCAGGAAATAGAAACCGTC

TGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGC

CCTCCCATCCCTACGCATCCTTTACATAACAGACGAGG

TCAACGATCCCTCCCTTACCATCAAATCAATTGGCCAC

CAATGGTACTGAACCTACGAGTACACCGACTACGGCG

GACTAATCTTCAACTCCTACATACTTCCCCCATTATTC

CTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACA

ATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATA

ATAATTACATCACAAGACGTCTTGCACTCATGAGCTGT

CCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGAC

GTCTAAACCAAACCACTTTCACCGCTACACGACCGGGG

GTATACTACGGTCAATGCTCTGAAATCTGTGGAGCAAA

CCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCT

AAAAATCTTTGAAATAGGGCCCGTATTTACCCTATAG

to here:

Uncheck this box labeled "Align two more sequences" if it is checked:

then scroll down and click the BLAST button:

Identifying Sequences with BLAST

3 of 3

You may need to be patient.

BLAST is crunching a huge amount of data.

You will see a screen like this for a while during processing:

Reading your BLAST Results

1 of 2

Once your results are displayed, you will see a header followed by the results of your search. The results can be displayed in several different views, including a list of sequence "Descriptions," via a "Graphic Summary," and via a more detailed "Alignments" view.

Select the Graphic Summary by clicking on this tab:

to see a graphic summary of the top 100 results.

Reading your BLAST Results

2 of 2

Each bar in this graph represents a match with another sequence in the database. The color of each line represents the extent to which the sequence in the database aligns with the sequence you input (the "Query" sequence). See the color key:

Of the top 100 results for this BLAST, how many sequences in the database align very well with yours?

100 50 0

What are these highly aligned sequences? Where did they come from?

One way to find out is to click on one of the bars in the graphic summary. Try that now.

What species is your query sequence from?

Xenopus laevis Homo sapiens Bos taurus Oryza sativa

Exploring your BLAST Results

1 of 6

You should be viewing your BLAST results in your other browser window.

Click on the Descriptions tab to learn more about each of the sequences that aligned with yours.

Click on the description of the sequence to see the alignment.

For this exercise, select one of the sequences labeled, "Homo sapiens...mitochondrion"

Clicking on a sequence will bring you to the Alignments view.

Exploring your BLAST Results

2 of 6

You can now see all the nucleotide base matches between your sequence (the "query" sequence) and the sequence from the database (the "subject" sequence).

This particular alignment isn't very interesting to look at because the two sequences match perfectly. In the next example we'll look at two sequences that do not perfectly align so that you can look at differences.

Our goal right now is simply to identify the sequence and explore the results.

What chromosome is the subject ("Sbjct") sequence (this one on the database that matched your query) from?

14 6 17 That is a trick question! This is mitochondrial DNA.

The first base in your query ("Query") sequence aligns with approximately which base in the Subject ("Sbjct") sequence?

1 9300 680 7590

Exploring your BLAST Results

3 of 6

To go to the subject sequence in the Nucleotide database, there are several links from the alignment.

The first two: (1) one in the header next to Download labeled GenBank, and (2) another link from the Sequence ID, take you to the record for the full sequence as it was submitted (or created). Remember that our match starts around base 7590. The third link (3), adjacent to the range (also labeled GenBank), takes you to a record displaying just the range of interest (around 7590 to around 8270).

Either record might be useful, but let's look at the record for the entire sequence that was submitted, and look at our query sequence in that context.

Follow the link to the GenBank record in the Nucleotide database from your Sequence ID (OK266950.1 in this example):

[If this page insists on opening in a new browser tab, you can use this link instead to go to OK266950.1]

Exploring your BLAST Results

4 of 6

You should now be in the NCBI Nucleotide database, looking at a record labeled something like, "Homo sapiens haplogroup H3i mitochondrion, complete genome." (You may be looking at a different record.)

What is a haplogroup?

Many of the records we look at in this course are Reference Sequences or "RefSeq" records, which are curated by NCBI. But this is an "original" sequence record submitted by a GenBank participant.

Approximately how many bases does this record include?

16,000 332,000 1,400

In what section of the record can you find the name of the affiliation of the researcher or organization that submitted the record?

DEFINITION SOURCE JOURNAL FEATURES

Exploring your BLAST Results

5 of 6

An interesting part of a Nucleotide record is the section labeled "FEATURES." Called the "feature table," this is the part that reflects scientists' annotations -- notes on what biological features of interest are known about a sequence.

Scroll down the feature table of this mitochondrial DNA record. Definitions of some of the feature labels can be found in the GenBank Sample Record.

Two features of major interest include:

CDS = a coding sequence, or region of nucleotides that corresponds with amino acids in a protein.
gene = a region identified as a gene. A gene may include multiple sections of coding sequences, so the same nucleotide sequence (shown in a number range) may be labeled as CDS and gene.

In the feature table, each labeled feature is hyperlinked to the sequence itself, which is at the bottom of the record. Click on the first instance of a "gene" label in this feature table.

You can now see the sequence for the gene highlighted in the context of the rest of the sequence:

Exploring your BLAST Results

6 of 6

The tools that appear at the bottom provide a useful way to learn and navigate your way around the features.

For example, since you clicked on a gene, you can now toggle through all the genes in this record using the tool in the lower left.

How many genes have been labeled in this human mitochondrial DNA record?

4261 2 13

Click around this feature table for a few minutes to get more accustomed to looking at this data.

When you're ready, move on to NCBI BLAST (Part B): Compare Sequences to explore these mitochondrial sequences in an interesting way using BLAST.