Introduction

This tutorial shows how to answer the question:

What are the Reference genomic, transcript and protein sequences for a gene?

Click next for an example scenario.

Example scenario

"One of my collaborators sent me this DNA sequence, which I’ve already blasted. I think they gave me the wrong sequence. How can I get the right sequence for the MC1R gene?"

Step 1: Search for the gene and organism

Reference Sequences (or "RefSeqs") are standard sequences, curated by NCBI. When someone asks for the "right" sequence, they mean a standard or reference sequence.

One way you can find a standard genomic (DNA), transcript (mRNA) or protein sequence is to use the NCBI Gene Database.

1. In the Gene database search box, enter:

human[orgn] AND mc1r[sym] 

This will take you directly to the one matching record for this gene.

Step 2: Find the RefSeqs

1 of 7

You are now looking at a Gene record for the "MC1R melanocortin 1 receptor" gene in humans. This record organizes information about the gene in one place.

On the right side of the page is the Table of Contents. Click on the "NCBI Reference Sequences (RefSeq)" link.

Table of Contents - NCBI RefSeq

Step 2: Find the RefSeqs

2 of 7

Looking at the section of the Gene record for NCBI Reference Sequences (RefSeqs), note that there are two types of RefSeqs:

  • RefSeqs maintained independently of Annotated Genomes: Here, you can view the gene, mRNA or protein sequence as isolated "snippets" of nucleotides or amino acids.
  • RefSeqs of Annotated Genomes: Here, you can view the gene sequence as part of an assembly of the entire chromosome. 
What is an "assembly," again?
 
We won't often be looking at records for entire chromosomes in this class. We'll generally be looking at those smaller records, where we can view the annotations right from our web browser.

Step 2: Find the RefSeqs

3 of 7
 
Recall our patron's request: the "right sequence for the MC1R gene."

It’s not clear from the person’s question if they want the genomic (DNA), mRNA transcript or protein sequence.

The DNA sequence is accessible from the GenBank link, which opens the NG_012026.1 record.

GenBank link

The mRNA transcript sequence can be obtained by following the NM_002386.4 link.

The protein sequence can be obtained by clicking on the NP_002377.4 link.

mRNA and Proteins

Let's explore the RefSeqGene NG_012026.1 record.

Click the GenBank link to view the reference sequence for NG_012026.1.

Genbank link

Step 2: Find the RefSeqs

4 of 7

Strings of nucleotides don't come numbered in nature, and each genome assembly might number the sequences differently. So we need some kind of reference for where a sequence "begins" or "ends."

Regardless of the specific chromosomal location according to the current assembly, the RefSeqGene record is going to start in the same place in the sequence relative to the "gene." The RefSeqGene record contains sequences for the annotated portion of the gene as well as 5,000 bases upstream and 2,000 bases downstream. 

The RefSeqGene record is always going to start at "1," in our example for MC1R, the gene feature the gene will start at "5984."

Step 2: Find the RefSeqs

5 of 7

Note the section on the right of the screen labeled "Change region shown."

change region shown

You can see that you are viewing only a portion of the RefSeqGene record.

This extra sequence provides access to potential regulatory regions and allows room for expansion of the gene boundaries.

Sometimes parts of other genes are included in the RefSeqGene record (if they lie within 5,000 bases upstream and 2,000 bases downstream).

Now return to the Gene record. Use the link from, "More about the MC1R gene" in the right column.

more about the MC1R gene

Step 2: Find the RefSeqs

6 of 7

Return to the Reference sequence links by using the same link in the Table of Contents, "NCBI Reference Sequences (RefSeq)."

We've now looked at the RefSeqGene record for the human MC1R gene (NG_012026.1).

Another option is to look at the genomic sequence in the context of the entire assembly.

On the Gene record, you'll find this after the mRNA transcript and protein accession number links. This is from the Reference and Alternate genome assemblies.

Primary Assembly

The Reference Primary Assembly is generated and controlled by the Genome Reference Consortium (GRC)

GRCh38 is the current assembly that is descended from the original, publicly-funded human genome project sequence (1990-2005). The "h" in the assembly name stands for human. 38 is the build number.

Step 2: Find the RefSeqs

7 of 7

You can check the Assembly database to confirm the latest GRCh assembly. Once a new genome assembly is released, it can take some time for NCBI resources to be updated with the new genome assembly. As of August 2019, the newest assembly is GRCh38.p13, however this particular RefSeq chromosome did not change between GRCh38.p10 and GRCh38.p13 (the version number is still ‘10’ as in NC_000016.10).

Notice the RefSeq accessions for the genome assembly begin with NC_, which identifies the record as being for a chromosome. The RefSeqGene accession is NG_, which identifies the record as being for a genomic (but not chromosomal) sequence.

Summary

You have reached the end of the tutorial for the question: What are the Reference genomic, transcript and protein sequences for a gene?

NCBI Gene Help Resources

Continue to Chapter 7. What variations are present in the gene and are they associated with disease?

Powered by Guide on the Side from the University of Arizona Libraries
Developed resources reported in this site are supported by the National Library of Medicine (NLM), National Institutes of Health (NIH) under cooperative agreement number UG4LM012344 with the University of Utah Spencer S. Eccles Health Sciences Library. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH..