BIOL
499 - Bioinformatics
The ENTREZ TUTORIAL
Entrez is a molecular biology database and retrieval system developed by the National Center for Biotechnology Information (NCBI). NCBI's Entrez integrates genomic sequencing and mapping data, DNA and protein sequences, 3-D structures, taxonomic relationships, and the PubMed database of bibliographic references. Entrez contains molecular data from several databases worldwide, including GenBank at NCBI, the European Molecular Biology Laboratory (EMBL), the DNA DataBank of Japan (DDBJ), SWISS-PROT, Protein Identification Resource (PIR), Brookhaven Protein DataBank (PDB), the Protein Research Foundation (PRF) database, and the Genomic Sequence Database (GSDB). It also provides hypertext links to the OMIM online
catalog of genetic disorders and some full-text electronic journals.
This tutorial is designed to give the user hands-on experience with the
various resources that Entrez offers. The main focus will be on utilitization
of the central component of the Entrez system: the nucleotide record. The links
from the nucleotide record to PubMed, to the protein database, and finally to
viewing available 3D protein structures will all be explored. Additionally, the
tutorial should instruct new users how to manipulate information within the
Entrez interface and how to easily segue from one resource to another. To use
the tutorial, follow the instructions in the left exercise panel. Enter terms
and make selections with the Entrez browser appearing on the right.
ENTREZ
TUTORIAL
Introduction
Open the Entrez
Browser: (http://www.ncbi.nlm.nih.gov/Entrez/).
Click Entrez Help or The Entrez Databases for more information.
Return to Entrez.
Nucleotide Search
Search Question: What information is available about the Nf-kappaB
RelA/p65 gene and the protein encoded by it?
Select the Nucleotides database and perform the following search:
· Enter the term "p65" into the search field.
· Click on the Limits Term under the search window.
· Check all of the boxes to exclude ESTs, STSs, working drafts and patents.
· Presume that you know the original cloning occurred within the last ten years. Set the year term of the date field to: FROM 1990 TO 2000.
· Toggle the setting to the left of the date field to "publication date".
·
Click GO to start the search.
The search window returns all the available sequences for p65 from DDBJ, EMBL and GenBank that correspond to publication within the last ten years. However, since p65 is named for its molecular weight the search will return a broader class of sequences than is useful. Refine the search by clicking on the Preview/Index term adjacent to LIMITS.
· Within the index search field, type "kappa" and press the View Index button.
· Select the top term with the most entries listed in the parentheses to the right.
· Once the "kappa" term is selected, click on the AND button on the right hand side.
· Press the PREVIEW at the top of the page to see the size of the search results.
· This may require that you scroll to the right side of the screen.
·
Press GO to initiate the search.
Suppose that human sequences for the p65 gene is all that is of interest. Restricting the search to just human sequences can be done through the Preview/Index window.
· Click on the Preview/Index term again.
· Type "human" into the search line and scroll down to Organism in the Field window. Hit the View Index button.
· Select the broadest "human" term, which will be the one with the most entries.
· Press the AND button to add this term to the search criteria.
·
Initiate the search by once again pressing GO.
Once the list of sequences has been winnowed down to the most relevant few, then the task of picking specific sequences of interest is incumbent upon the researcher.
· Scroll down the list and find X61499 and M62399
· Select these sequences by placing a check in the associated box
· Go to the top of the screen and in the selection window associated with the Display button, scroll down to GenBank format.
·
Click Display.
The Genbank record contains useful annotations for understanding the source
and type of sequence. Along the top of the sequence record is the Accession number,
the gi number and the title. The live links to the right of the gi number are
very useful:
PubMed is a direct link to the primary publication associated with the sequence.
Protein links the nucleotide sequence to the protein entry or the translation product of the sequence.
Related Sequences calls up a number of close homologues within the gene family from various species
Link Out links the nucleotide sequence to its associated LocusLink entry.
The next line also contains useful information about the sequence file
bannered across the top.
LOCUS: the first feature on the header which tells the gene locus name. The locus name for M62399 is HUMP65NFKB.
sequence length: shows the length of the entire sequence submission.
molecule type: this designates what type of molecule is represented by the following record. For example, the human p65 clone in M62399 is a cDNA obtained from 'mRNA'.
Consequently, it is easy to tell the following record is not for a genomic sequence entry.
Sequence category: provides information about the species or phylum from which the sequence originated. The M62399 record was derived from human tissue, so it is categorized in the "Primate" phylum which is abbreviated PRI.
Date: this shows the most recent date the sequence entry was modified.
Along the side of the record are a number of other useful links, identifiers
and informational resources. Note that the Accession number, the NID and the
Version numbers are all ways of defining the sequence record. The root of the
Accession number remains the same, irrespective of whether modifications are
made to the sequence. Another way to track a specific sequence is with the NID
number which represents a single unique sequence. If the sequence is changed or
modified, the new sequence receives a new NID. The NID is also listed as the gi
number within the version field.
In addition to sequence identifiers, there are also useful links listed in the record.
· Scroll down to the ORGANISM field and note the link to taxonomic information.
· CDS is a useful link giving the same record with only the coding sequence in the nucleotide record.
· Click on it and notice the first codon is ATG.
· Click on the MEDLINE link listed above CDS.
· The reference that appears is in Abstract form. Note that there is a list of terms to the right including related articles that link the reference to other similar articles.
· In order to import the reference into a citation manager it generally needs to be in Medline format. To reformat the citation click on the display type window in which "Abstract" is listed and scroll down to Medline.
·
Click Display.
The Entrez interface maintains a couple more useful functions. In order to return to the nucleotide records:
· Click on the Nucleotide database in the black banner at the top of the window.
· Click on History from the sub-banner and note that the previous searches still held in memory.
· Click on the live link to the top (most recent) search. This should return you to the two chosen genetic records.
· Check the two boxes to the left of the records and press the "Add to Clipboard" button at above the record list.
·
Press the Clipboard button and notice that the two
records have been posted to the Entrez clipboard associated with this set of
searches.
Look in the left hand side for Related Resources. Select LocusLink.
· Type p65 in the query window and hit GO.
· Notice that the abbreviated files have several colored icons to the right and the LocusLink number is a link to the primary record.
· Click on the record 5970.
· Read through all the links and information provided by LocusLink
· Follow the OMIM link and read the elaborate description. At the bottom of the page, there is a bibliography of some of the key papers in the field.
· Return to the Entrez Nucleotide database.
· Click on History to return to the previous searches.
· Click on the top search.
· Select the M62399 record.
· Scroll down the page to the protein links nested within the sequence annotation.
·
Click on the protein id=AAA36408.1 link, and note that
this is the protein entry corresponding to the nucleotide file.
Many applications require nucleotide and protein sequences to be FASTA
format.
Notice that the default setting is for the Genpept format.
· Change the format in the Display type box to FASTA
· Click Display.
·
Note the > character initiating the file. This is
how FASTA format is designated.
To finish the examination of p65 with the available 3D Protein Structures
· Go to the Structure database by clicking on structure at the top of the window.
· Type kappa in the search window and start the search.
· Select 1NFI.
· Go to the bottom of the screen and click on the view button.
· Note: if a structure viewer is not installed on your computer, you would have to download a structure viewer such as Cn3D 2.5 before viewing.
· Try changing the settings under Structure. Switch the model to spacefill.
· Grab on the protein with the mouse and move it around to see how it can be rotated in 3D space.
· To zoom in on the protein, hold down the control key and draw a box around the region of interest with the mouse.
· To return the magnification to default values, go to View and select "reset".
·
Highlight some of the residues in the sequence box and
watch what happens to the protein.