Blocks Tutorial
Using the Blocks Database to
Recognize Functional Domains
Jorja G. Henikoff
Fred Hutchinson Cancer
Research Center
Telephone: 206-667-4509
Fax: 206-667-5889
Email: jorja@fhcrc.org
Elizabeth A. Greene
Fred Hutchinson Cancer
Research Center
Telephone: 206-667-6576
Fax: 206-667-6497
Email: eagreene@fhcrc.org
Nick Taylor
Fred Hutchinson Cancer
Research Center
Telephone: 206-667-6576
Fax: 206-667-6497
Email: ntaylor@fhcrc.org
Shmuel Pietrokovski
Weizmann Institute of Science
Telephone: ##972 (8) 934 2747
FAX: ##972 (8) 934 4180
Email: pietro@bioinfo.weizmann.ac.il
Steven Henikoff Howard Hughes Medical
Institute Fred Hutchinson Cancer
Research Center Telephone: 206-667-4515 Fax: 206-667-5889 Email: steveh@fhcrc.org Key terms: protein motif,
amino acid sequence conservation, multiple sequence alignment, protein homology
searching, PCR primer design Abstract Blocks are ungapped multiple
alignments of segments of related protein sequences that correspond to the most
conserved regions of proteins. The Blocks Database is a collection of blocks
representing known protein families that can be used to compare a protein or
DNA sequence with documented families of proteins. Procedures in this unit
describe the analysis of proteins and families using Blocks-based tools,
including searching, exploring relationships with trees, making blocks and
designing PCR primers with blocks for isolating homologous sequences. Using the Blocks Database
to Recognize Functional Domains Blocks are ungapped multiple
alignments of segments of related protein sequences that correspond to the most
conserved regions of proteins. The Blocks Database is a collection of blocks
representing known protein families that can be used to compare a protein or
DNA sequence with documented families of proteins (Henikoff and Henikoff,
1991). The current Blocks+ Database, generated by the automated PROTOMAT
system, includes protein families documented in InterPro (Apweiler et al.,
2000) and Prints (Attwood et al., 2000). Part 1 describes
retrieval of a Blocks Database entry and numerous options for displaying and
analyzing conserved sequence information. Appendix 1 describes
searching other databases with block queries (Pietrokovski et al., 1998).
Parts 2 and 3 describe procedures for analyzing a sequence of interest
using Blocks-based tools. Part 4 introduces the ProWeb Tree Viewer, a
graphical tool that facilitates the exploration of relationships between
protein family members. Part 5 illustrates how a user can create
blocks from a set of related sequences using Block Maker (Henikoff et al.,
1995). Part 6 describes the use of blocks in designing optimal PCR
primers by applying the CODEHOP strategy (Rose et al., 1998). These procedures
are illustrated with an example of current interest. Part 1:
EXPLORING PROTEIN FAMILIES USING THE BLOCKS DATABASE The blocks for each protein
family entry in the Blocks Database can be retrieved and displayed, and can be
used as queries in searches of other databases. There are three ways to access
information in the Blocks Database: Web interface. The best way to access the Blocks Database is through
the Web at http://blocks.fhcrc.org/ . E-mail. You can send a message to blocks@blocks.fhcrc.org. Instructions
for using the E-mail system will be returned if the word "help"
appears in the subject line. Download. The Blocks Database is available as a text file from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/. Necessary Resources Hardware. A workstation, personal computer or terminal
connected to the Internet. Software. An E-mail program for the E-mail interface, and any
type of Web browser for the Web interface. Either Chime or Rasmol helper
application to view protein structures using a browser. A file transfer program
to download the data files. Data Files. The Blocks Database is distributed as an ASCII text
file. 1. Get a Blocks Database
entry. We use the Blocks Database
entry for the C-5 cytosine-specific DNA methylases as an example. Open the
Blocks Web site in a Web browser: http://blocks.fhcrc.org/.
The first window to appear is shown in Figure 1. a. Click on "Get Blocks
by key word". b. Enter "cytosine and
methylase" and hit "Enter". One item is returned, the entry
IPB001525. This is the Blocks Database accession number for blocks made from
the InterPro family with accession number IPR001525 (Apweiler et al., 2000). c. Click on the link to
IPB001525. The entire Blocks Database entry for IPB001525 is shown in text
format. The first page is reproduced in Figure 2.
There are six blocks for this family labeled IPB001525A to IPB001525F. Links at
the top of the page lead directly to the blocks. The first part of IPB001525B
is shown in Figure 3.
Each block starts with ID, AC and DE lines adapted from InterPro. They list, respectively,
the InterPro short identifier, the Blocks accession number, and the InterPro
description of the family. The AC line also includes the minimum and maximum
distance from the end of the previous block to this block across all sequences.
For the A block, these numbers are the distances from the beginning of the
sequences. The BL line following the DE line in each block contains information
from PROTOMAT, including a three-character motif, the width of the block and
the number of sequence segments in it. Additional numerical calibration points
(99.5% and strength) are used by the BLIMPS searching program described in
Parts 2 and 3. The aligned sequence segments
follow the BL line in each block. The sequence identifier from
Swiss-Prot/Trembl (Bairoch and Apweiler, 2000) is followed by the position of
the first residue in the segments. Clicking on the sequence identifier link
brings up the Swiss-Prot/Trembl entry for the sequence. Sequence segments are
clumped and separated by blank lines if at least 80% of the aligned residues
match between any pair of segments. Numerical sequence segment weights are
shown to the right of each segment (Henikoff and Henikoff, 1994). The higher
this weight, the more dissimilar the segment is from other segments in the
block, with the segment most dissimilar from all others having a weight of 100. Each block in a Blocks
Database entry contains segments from the same sequences, but the order is
different since the segments clump differently in each block. The six IPB001525
blocks each contain segments from the same 158 sequences. At the top of the Blocks
Database entry page are several links that provide additional information and
views. 2. Display blocks
graphically. a. Map. Click on "Block
Map". The locations of all six blocks in all 158 sequences is displayed. b. Logos. Under the
"Logos" bullet, select "GIF" display format. The six blocks
are shown as sequence logos (Schneider and Stephens, 1990) reproduced in
Figure 4. A
sequence logo is a graphical representation of aligned sequences where at each
position the size of each residue is proportional to its frequency in that
position, and the total height of all the residues in the position is
proportional to the conservation of the position. Highly conserved motifs, such
as the "PCQ" in IPB001525B and "ENV" in IPB001525C, stand
out more clearly in logos than in the text format. Logos may also be displayed
in other formats. c. Phylogenetic tree. Under
the "Tree from blocks alignment" bullet, select "ProWeb
TreeViewer". It takes a few minutes to build and display a phylogenetic
tree computed from the sequence segments in the blocks (Chapter 6). The tree is
displayed in a separate browser window. The ProWeb TreeViewer is discussed in
Appendix 2. d. Protein structures. If any
of the sequences in the blocks for a family has a structure in the Protein Data
Bank (http://www.rcsb.org/pdb/), then
the blocks can be displayed on the structure. Select "PDB entries".
Two sequences in the blocks, MTH1_HAEHA and MTH3_HAEAE, have known structures
that overlap the block regions, 6MHT and 1DCT respectively. Click on
"6MHT" under the "3D Blocks" column. A thumbnail sketch of
the structure with the six blocks marked in different colors is displayed,
along with links to start Web browser helper applications for the Chime or
Rasmol structure viewers. 3. Other links. a. Design polymerase chain
reaction (PCR) primers from blocks. The COnsensus-DEgenerate Hybrid
Oligonucleotide Primers (CODEHOP; (Rose et al., 1998)) tool designs PCR primers
from protein multiple alignments. It is described in Part 6. b. Predict amino acid
substitutions in blocks. The Sorting Intolerant from Tolerant (SIFT; (Ng and
Henikoff, 2001)) program predicts which amino acid substitutions in each block
position are likely to affect protein function. Clicking on the SIFT link
brings up the SIFT entry form with the IPB001525 blocks inserted. c. Additional links.For some
families in the Blocks Databases, links are provided to other Web sites with
related information. For IPB001525, there are links to CYRCA (Kunin et al.,
2001) and MetaFam (Silverstein et al., 2001). Appendix 1: SEARCH
BLOCKS VERSUS OTHER DATABASES. Representations of the six
IPB001525 blocks can be used to search other databases for additional C-5
cytosine-specific DNA methylases. This approach is more powerful than searching
with a single protein sequence (Henikoff and Henikoff, 1997). 1. COBBLER sequence. Select "COBBLER
sequence" under the "Search Blocks vs other databases" bullet.
COBBLER stands for COnsensus Biasing By Locally Embedding Residues. A single
sequence is selected from the set of blocks and enriched by replacing the
conserved regions delineated by the blocks with consensus residues derived from
the blocks. Embedding consensus residues improves performance with readily
available single sequence query searching programs, such as BLAST [ (Altschul
et al., 1990); Unit 3A.4] and FASTA [ (Pearson, 1990); Unit 3A.5]. The
IPB001525 blocks are embedded in the portion of MTF1_FUSNU spanned by the
blocks. The blocks are shown in upper case and the intervening sequence in
lower case. Click on "Gap-Blast Search" and a search of the COBBLER
sequence against the non-redundant protein database is automatically started at
NCBI's Blast Web site in a separate browser window. Other BLAST searching
options are also provided. The COBBLER sequence may also be copied and pasted
into other sequence searching Web pages. 2. MAST search. Select "MAST
search" under the "Search Blocks versus other databases" bullet
and a MAST searching form will appear in a separate browser window. MAST is a
searching tool at the San Diego Super Computing Center [ (Bailey, and Gribskov,
1998); Unit 2.5]. The six IPB001525 blocks are converted into numerical
position-specific scoring matrices (Henikoff and Henikoff, 1996) consisting of
20 scores for each amino acid's probable occurrence in each position. MAST
scans all six of these PSSMs against one of several amino acid or nucleotide
sequence databases and returns the results by E-mail. Enter an E-mail address
in the MAST form and select a sequence database to search. Consult the MAST
help files by clicking on the links for the other options. For our example,
select the Drosophila database and accept the defaults for the other options.
MAST will search for C-5 cytosine-specific DNA methylases among Drosophila
proteins. The list of MAST hits is shown in Figure 5.
The top hit, AAF53163.1 is an unequivocal DNA methyltransferase homolog with an
E-value of 4.7 x 10-26. 3. LAMA search. Select "LAMA
search" under the "Search Blocks vs other databases" bullet and
a LAMA searching form will appear in a separate browser window with the
IPB001525 blocks inserted in the query field. LAMA (Local Alignment of Multiple
Alignments) is a program for comparing protein multiple sequence alignments
with each other (Pietrokovski, 1996). The program can search databases of
multiple alignments in the Blocks Database format. The search is for sequence
similarities between conserved regions of protein families. The method is
sensitive, detecting weak sequence relationships between protein families and
sequence similarities beyond the range of conventional sequence database
searches. Under the "Select database to search" heading on the LAMA
form, select "Prints Database" and click the "Perform
Search" button. The Prints Database [ (Attwood et al., 2000); Unit 2.8] is
another collection of ungapped conserved regions of protein families similar in
philosophy to the Blocks Database. Four hits are reported by LAMA to two
different Prints entries. IPB001525A,C,D are aligned with PR00105A,B,C. PR00105
is the Prints entry for cytosine-specific DNA methyltransferases. Click on the
"Logo" icon at the right of each LAMA hit to see the blocks aligned
as logos. IPB001525C has a weaker alignment with PR00115E, the fifth of six
blocks representing the fructose-1, 6-bisphosphatases in the Prints Database.
The aligned logos for these two blocks show both blocks have highly conserved
P, F and E residues in the same relative positions. Part 2: ANALYZING
PROTEIN SEQUENCES WITH THE BLOCK SEARCHER The primary use of the Blocks
Database is to classify a query sequence as belonging to one or more known
protein families based on sharing conserved regions. This part discusses
classifying a protein query and Part 3 discusses classifying a DNA
sequence query. Web interface. The best way to compare a query sequence with the
Blocks Database is through the Web at http://blocks.fhcrc.org/.
Three different searching programs are available. E-mail. Users can send a message containing the sequence to
be searched to blocks@blocks.fhcrc.org. UNIX programs. Programs to search the Blocks Database and analyze
results are available for UNIX systems from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/blimps. Data Files. Query sequences are accepted in FASTA or GENBANK
format. 1. Select a searching option. Open the Blocks Web site in a
Web browser: http://blocks.fhcrc.org/ (Figure 1).
Three searching options are provided: Block Searcher (Henikoff and Henikoff,
1991), Reverse PSI-BLAST Searcher and IMPALA Searcher (Schaffer et al., 1999).
Block Searcher uses the original BLIMPS (Henikoff et al., 1995) program.
Reverse PSI-BLAST and IMPALA are searching programs from the NCBI group and use
the BLAST searching algorithms and statistics (Schaffer et al., 2001). All
three of these programs convert blocks to position-specific scoring matrices
(PSSMs) for searching. Of the three, reverse PSI-BLAST is the fastest way to
search the Blocks Database, requiring less than a minute for the average
protein query on our Web server. eMotif at Stanford University (Huang and
Brutlag, 2001) is an even faster, although less sensitive, way to search the
Blocks Database, requiring perhaps a second to search the Blocks+ and Prints
databases. eMotif attains high speed by searching amino acid strings rather
than PSSMs. Whereas Reverse PSI-BLAST,
IMPALA and eMotif are limited to protein query sequences, Block Searcher
accepts both protein and nucleotide query sequences, translating DNA sequences
on-the-fly. Click on the "Block Searcher" link and the form shown in
Figure 6
appears. Links to PSI-BLAST,
IMPALA, eMotif and other protein family searching sites are included on the
Block Searcher page. 2. Submit a protein query
sequence to the Block Searcher. For our example, we are
interested in Drosophila cytosine methyltransferases in Drosophila. Using step
2 of Appendix 1 for this protein family, a MAST search of the IPB001525
blocks against Drosophila proteins returns GenBank sequence AAF53163.1 as the
top hit (Figure 5).
Follow the 'E' link for AAF53163.1 to the amino acid sequence entry, display in
FASTA format, and copy and paste it into the sequence box of the Block Searcher
form. Accept the default values for the rest of the options on the form and
click "Perform Search". The search takes a few minutes and the
results can optionally be returned to an E-mail address. By default the
"Blocks+" Database is searched. This database represents InterPro
families( (Apweiler et al., 2000)), plus additional families from Prints(
(Attwood et al., 2000)). Blocks for the InterPro families are made by PROTOMAT,
but Prints blocks are taken directly from the Prints Database. Optionally the
"Blocks+ Database without compositionally biased blocks" may be
searched. This is a subset of Blocks+ with highly biased blocks removed to
reduce false positive hits to compositionally biased queries. A description of
the current release of Blocks+ is at http://blocks.fhcrc.org/blocks_release.html.
The entire Prints Database may also be searched. The default cutoff expected
value is 1; an average protein is expected to hit one protein family by chance.
There are several output options. However, all but "Summary with
alignments" and "Summary", which omits the alignments, are
specialized and not generally recommended. The BLIMPS searching program will
examine the query sequence to determine whether it is amino acid or DNA, but
sequence type may be specified. These are the only options for a protein query. 3. Examine results returned
by the Block Searcher. Block Searcher results are
prefaced by a description of the version of the Blocks Database searched and a
brief description of the output format. The query title and length are then
listed, followed by the number of blocks compared and the number of query-block
alignments scored. In Figure 7,
the top hit for AAF53163.1 is IPB001525 with a combined E-value of 6.9 x 10-27
for five of the six IPB001525 blocks which are aligned with the query in the
correct order and with distances between them compatible with those observed in
sequences in the blocks. A second hit to PR01035 with an expected value of 0.67
is probably spurious because only one of twelve PR01035 blocks was aligned. Because there is often a
question concerning the reality of twilight zone hits with marginal E-values,
you should ask whether a suspected match is detectable using a different
searching program. Reverse PSI-BLAST (or IMPALA) and eMotif differ very
substantially from Block Searcher and each other in the way they align and
score matches, and so they are unlikely to detect the same chance similarities.
Verify your search of AAF53163.1 using Reverse PSI-Blast (http://blocks.fhcrc.org/blocks/rpsblast.html).
Reverse PSI-Blast reports a portion of the CXXC zinc finger family (IPB002857)
with E=0.37. Investigation of IPB002857 reveals that it was annotated by
InterPro as a domain that is usually found upstream of cytosine methylases.
Because the Blocks Database is generated automatically, it occasionally
includes a conserved region adjacent to the annotated domain if that region is
found in a large fraction of the sequences. Thus, IPB002857 includes regions
from known cytosine methyltransferases that are found in IPB001525. As Reverse
PSI-BLAST allows gaps, it tends to extend alignments increasing sensitivity at
the expense of selectivity relative to Block Searcher. Following the hit summary are
alignment details for each hit. Each block in the hit and its location in the
query sequence is listed with individual expected values. A schematic map is
shown to compare the block alignments with the range of alignments in sequences
in the blocks. Finally, the query segment is aligned with a single segment most
like it from each block. AAF53163.1 is aligned with IPB001525B-F, but not with
IPB001525A. IPB001525B is aligned starting at position 70, IPB001525C at
position 95, etc. The query is most similar to Trembl sequence O35212 in
IPB001525B, to O43669 in IPB001525D, and to PMT1_SCHPO|P40999 in IPB001525C, E
and F. Because the alignments of
AAF53163.1 with IPB001525B-F are so clear, it is curious that Block IPB001525A
is missing. The GenBank annotation (Unit 1.2) documents AAF53163.1 as a
predicted protein from a large sequencing project, and it thus may not have
been adequately scrutinized. It was translated from AE003635.1:7013..8050, so a
DNA query can be extracted from AE003635 including more upstream sequence where
the A block may lie. One such query is shown in Figure 8.
Part 3:
ANALYZING DNA SEQUENCES WITH THE BLOCK SEARCHER If you have a DNA query, the
Block Searcher will translate it into protein in all frames on one or both
strands. Each block in a family is aligned with the translated query sequence
independently and then hits are assembled on each strand. Therefore, all blocks
in a hit are on the same strand, but not necessarily in the same frame. Web interface. The best way to compare a DNA query sequence with
the Blocks Database is through the Web at http://blocks.fhcrc.org/blocks_search.html. E-mail. Users can send a message containing the sequence to
be searched to blocks@blocks.fhcrc.org. Data Files. Query sequences are accepted in FASTA or GenBank
format. 1. Select a searching option. Open the Blocks Web site in a
Web browser: http://blocks.fhcrc.org
(Figure 1). Only the Block Searcher (Henikoff and Henikoff, 1991) will
handle a DNA query sequence, translating it on-the-fly. Click on the
“Block Searcher” link and the form shown in Figure 6 appears. 2. Submit a DNA query
sequence to the Block Searcher. Copy and paste the DNA
sequence (Figure 8)
into the Block Searcher form. Because DNA queries are translated before
comparing them with the Blocks Database, three (one strand) or six (both
strands) times as many comparisons are made as for a protein query. Therefore,
this type of search takes longer and may result in higher background levels of
false positive hits. To reduce the background, select the "Blocks+
database without compositionally biased blocks". Hits are pieced together
from blocks on the same strand, although they may be in different frames. To
reduce search time, select "Forward Strand" under "Additional
optional search parameters for a DNA query" at the "Strands to search"
bullet. You may also want to select
"DNA" under "Optionally force query sequence type". If
extra line feeds are introduced during copy-and-paste so that a long FASTA
title line becomes two lines, BLIMPS may decide the query is protein. It is a
good idea to check the title line after the paste operation as different
workstations and browsers produce different results. Click the "Perform
Search" button and wait the for the results (usually a few minutes). 3. Examine results returned
by the Block Searcher. This time, the top hit
includes all six IPB001525 blocks with a combined E-value of 2.9 x 10-31
(Figure 9).
The A block is in a different frame (1) than the other five blocks (2) and is
located upstream of the region of AE003635 translated for AAF53163.1. Further
analysis (not shown here) reveals a 49 nucleotide intron between the A and B
blocks missed by the gene prediction programs. The corrected protein, now dubbed
"Dnmt2" is shown in Figure 10.
The Block Searcher results
with the DNA query also report a hit to the single block representing IPB001529
with an E-value of 0.7. Because this RNA polymerase M/15 Kd subunit family is
only represented by one block in the Blocks Database, it is not a clear false
positive. However, the alignment reveals a stop codon within the block region
of the query, which is unlikely. The corrected protein can be searched against
the Blocks Database to see if this hit again turns up (it does not, nor does
the hit to PR01035 (Figure 7)
show up with the corrected query). In order to explore how Dnmt2
relates to the other C-5 cytosine-specific DNA methylases, click on one of the
IPB001525 links on the Block Searcher results to get the Part 1 page
for this group. Part 4: VIEWING
TREES BASED ON BLOCKS A phylogenetic tree is made
for each protein family in the Blocks Database using the multiple alignments in
the block regions only. The neighbor‑joining algorithm [ (Saitou and Nei,
1987); Unit 6.4] is applied using Clustal W [ (Thompson et al., 1994); Unit 2.4].
The Kimura correction for multiple substitutions is applied. If there are not
too many sequences, 100 bootstrap values are calculated. The output from
Clustal W is a tree file in a format which can be read by most tree display
programs (Unit 6.2). The ProWeb TreeViewer allows
you to interactively explore trees made from blocks, zooming in on sections of
interest, and to view additional information associated with the sequences used
to create the tree. It also facilitates making new blocks from subtrees. This
type of analysis is valuable when your sequence belongs to a clade from a large
family which may have somewhat different properties than the entire family. Web interface. The ProWeb TreeViewer is available through the Web at
http://www.proweb.org/treeviewer/info.html 1. Start the ProWeb
TreeViewer. From the Part 1
page for IPB001525, click on the "ProWeb TreeViewer" link near the
top of the page. Alternatively, enter "IPB001525" in the form http://www.proweb.org/treeviewer.
A phylogenetic tree appears. 2. Select a subtree. In the Block Searcher output
from Part 3, step 3 (Figure 9),
Dnmt2 is most like PMT1_SCHPO, O35212 and O43669. Near the top of the page, you
should see a subtree that contains sequences PMT1_SCHPO, O43669, O14717, O35212
and O55055 (Figure 11).
Prune the tree to include only this subtree by clicking on the small solid blue
box at the junction between PMT1_SCHPO and the other four sequences. The pruned
tree is shown in a new browser window (Figure 12).
Below the pruned tree are several links. "View FASTA files of these
sequences" shows the full-length sequences included in the tree.
"View extracted subclade Blocks" shows the sequence segments from the
IPB001525 blocks for the five sequences following a MAST form which uses these
pruned blocks as a query. There is also a link to the CODEHOP page described in
Part 6 to design PCR primers from these pruned blocks. 3. Link to Block Maker. Because the IPB001525 blocks
represent conserved regions in all the 158 sequences in the group, they may not
capture the conserved regions in this subtree particularly well. Click on
"Run BLOCK MAKER on these sequences" to make new blocks from just
these five sequences. The Block Maker input form will appear in a new browser
window with the five sequences already inserted (Figure 13). Part 5: USING
BLOCK MAKER Block Maker finds blocks in a
group of related protein sequences. Block Maker uses the PROTOMAT algorithm
(Henikoff and Henikoff, 1991), a two-step procedure. First, candidate motifs
are found using a motif-finder. Then a best set of motifs is assembled along
the length of most of the sequences. Block Maker runs PROTOMAT
twice, first using MOTIF (Smith et al., 1990) and second using a Gibbs sampler
[ (Neuwald et al., 1995); Unit 2.13] as motif‑finding algorithms. It
returns both sets of blocks. While the system attempts to align all sequences
provided, it will sometimes exclude sequences that are too diverged from the
majority of sequences for the similarity to be detected. Web interface. The best way to use Block Maker is through the Web at
http://blocks.fhcrc.org/blockmkr/make_blocks.html. E-mail. Users can send a message containing a group of
related sequences to blockmaker@blocks.fhcrc.org.
Instructions are returned when the word "help" appears in the subject
heading of the E-mail message. UNIX programs. Programs to make Blocks are available for UNIX
systems from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/blimps/. Data Files. Query sequences are accepted in FASTA or GenBank
format. 1. Submit sequences to Block
Maker. The input to Block Maker is a
set of related sequences in FASTA or GenBank format. In Figure 13,
sequences have been preinserted for the subtree closest to Dnmt2 (Figure 12)
by the ProWeb TreeViewer (Part 4). Sequences can be edited within the
form. For our example, copy and paste the corrected Dnmt2 sequence
(Figure 10)
into the form after the five preinserted sequences in order to make blocks from
it and the other sequences in the subtree. As an alternative to copy and paste,
the name of a file on your workstation containing the sequences can be entered
in the "Enter the name of a file containing your protein sequences"
field. A minimum of three sequences
is required to make blocks and Block Maker will accept up to 250 sequences
depending on their combined lengths. Block Maker requires considerable computer
resources, and so sequence sets with combined length of more than 15,000 amino
acids must be submitted to the E-mail server by entering an email address on
the Web form (Figure 13)
or by mailing them to blockmaker@blocks.fhcrc.org.
Sets of sequences with a combined length of more than 100,000 cannot be
processed by the Block Maker servers and the programs must be installed
locally. Enter "Dnmt2" in
the "Enter a short description of your group of sequences" field and
click the "Make Blocks" button. 2. Examine Block Maker
results. The Block Maker results
resemble the Get Blocks display (Part 1) for an entry in the Blocks
Database, except that two sets of blocks are displayed. Following an
introduction briefly describing the result format is a "Block Maps"
link which, when clicked, compares the locations of the MOTIF and Gibbs blocks.
Both algorithms found five blocks, but they differ. The A and E blocks
correspond, but the Gibbs B block lies between the MOTIF A and B blocks, and
the MOTIF B and C blocks correspond to the Gibbs C and D blocks. The MOTIF D
block lies between the Gibbs D and E blocks. The MOTIF B and C blocks are
contiguous in all six sequences as are the Gibbs C and D blocks. Both sets of
blocks are wider than the blocks in IPB001525 because the six sequences used to
make them are more similar to one another than are the 158 in IPB001525. 3. Blocks from Motif. Click on "Blocks from
Motif" (Figure 14).
The "Logos", "Tree" and "Search" links are
described in Part 1 and Appendix 1. It is instructive to do a
LAMA search of these blocks against the Blocks Database to see how the blocks
made from this subtree correspond to those from IPB001525 (Appendix 1,
step 3). Click on the "LAMA" link to start the search. MOTIF misses
the IPB001525B region containing the catalytic PCQ motif because PMT1_SCHPO has
SCQ in this position. PMT1_SCHPO is a cryptic pseudogene in the unmethylated Schizosaccharomyces
pombe genome, and replacement of SCQ
by PCQ turns it into a DNA methyltransferase that is active in vitro (Pinarbasi et al., 1996). 4. Blocks from Gibbs. Click on "Blocks from
Gibbs" (Figure 15).
It is again instructive to do a LAMA search of these blocks against the Blocks
Database. In contrast to the MOTIF blocks, the Gibbs B block contains the PCQ
motif corresponding to IPB001525B and correctly aligns PMT1_SCHPO in it. The
Gibbs motif-finder uses a statistical approach that does not depend as heavily
on sequence identity as does MOTIF, which looks for a few common residues in
most sequences (Neuwald et al., 1995). Click on the
"CODEHOP" link to design primers from Gibbs blocks for polymerase
chain reactions (PCR). Part 6:
DESIGNING PRIMERS FROM BLOCKS The CODEHOP
(Consensus-Degenerate Hybrid Oligonucleotide primers) program designs DNA
primers that you can use to amplify distantly related homologs of a gene of
interest (Rose et al., 1998). A CODEHOP primer has a degenerate 3'
"core", with a length of 11‑12 bp across four codons of highly
conserved amino acids, and a non‑degenerate 5' consensus
"clamp" region, with a length that depends on its desired annealing
temperature, typically between 20 and 30bp. Web interface. CODEHOP is accessible at http://blocks.fhcrc.org/codehop.html. UNIX programs. The CODEHOP program is available for UNIX systems
from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/blimps/
. Data Files. Input is in Blocks format as described at http://blocks.fhcrc.org/block_format.html. Utilities are available at http://blocks.fhcrc.org/process_blocks.html to
convert common multiple alignment formats to Blocks format. 1. Submitting blocks to
CODEHOP. Blocks are inserted into the
CODEHOP Web form by Get Blocks (Part 1) and by Block Maker (Part
5, steps 3 and 4, Figure 16).
Near the top of form is a link to the "Blocks multiple alignment processor"
which carves out blocks from multiple alignments and then inserts them into the
form. Alternatively, blocks can be copied and pasted into the form. The blocks can be edited
within the form. For instance, you may want to adjust the sequence segment weights
to emphasize some sequences over others. Setting a sequence weight to zero will
ignore the contribution of that sequence to the block. 2. CODEHOP parameters. Usually it is only necessary
to select an appropriate codon usage table for back-translation of amino acids
and use the default values for the other parameters. If primers are not found
when the defaults are used, then read the "Getting started" guide,
which describes how to adjust the parameters systematically to obtain a
satisfactory set. There are several parameters that can be set, including clamp
annealing temperature, degeneracy and "strictness", which are
described in detail in the "Full Help file". 3. CODEHOP results. Starting with the Gibbs
blocks from Part 5, step 4 (Figure 16),
select the Drosophila melanogaster
codon usage table by scrolling through the list of tables next to "Codon
usage table", then click "Look for primers". You will see a
large number of suggested primers, with the degenerate core in lower case,
using the standard degenerate alphabet, and the consensus clamp in upper case.
Some primers have the comment "CLAMP NEEDS EXTENSION". Using the
CODEHOP strategy, primers cannot extend beyond the limits of the blocks and
this comment indicates that the melting temperature is lower than desired. You
can copy and paste the primer into the oligo temperature calculation site
linked from the comment and add residues to the 5' end until the desired
temperature is reached. The most reliable primers will be the least degenerate,
and so by reducing the maximum degeneracy from the default of 128 to 32, a
smaller set of primers is reported (Figure 17).
These are mapped along a consensus sequence representing each block and
summarized at the bottom of the page, providing all oligo sequences from 5' to
3' for ordering from a supplier. In the example, the best primer pair consists
of an 8-fold degenerate primer to IPB001525A (tacgtrrtrcgGAACTTGCTCCGGG) and
either of two overlapping 32-fold degenerate primers to the complement of
IPB001525C (ctyttrcanktCCCGAAGCTCCACAGGT or ttrcanktyccGAAGCTCCACAGGTTCT). DATA INTERPRETATION Different protein family
search engines can produce different results, especially in the twilight zone.
A search of the Blocks Database does not guarantee correct or complete results.
An expected value is provided for each hit by Block Searcher based on
statistics developed for the MAST system (Bailey and Gribskov, 1998), but the
value can be skewed by compositional bias and repeated domains. Single block
hits require careful evaluation, and it is important to verify uncertain hits
using other searching methods, such as Reverse PSI-BLAST. Phylogenetic trees are
becoming increasingly useful for discerning subfamily relationships, however,
they are no better than the alignments that they are based on. Although block
alignments are limited to conserved regions, and so are likely to be correct,
slight misalignments can occur within a block where it spans a short variable
region. Other uncertainties in the reliability of trees stem from differences
in rates of evolution between positions and from compromises made in
constructing trees, in this case using neighbor-joining. Branch lengths
indicate the degree of divergence of sequences, however, uncertainties in
evolutionary rates add an unknown degree of uncertainty. Although Tree viewer
indicates which nodes are judged to be reliable by coloring those with 75%
bootstrap support, this is meant as only a rough guide of reliability. Block Maker includes two
different motif-finding algorithms, MOTIF and Gibbs sampling, that use
different scoring systems. As a result, it is unlikely that the same block
alignment will be detected unless it is real. Block Maker always returns a set
of blocks, even when these are from randomly chosen sequences. You may be
surprised at how real such block alignments can appear, sometimes rivaling
alignments that we are accustomed to seeing in molecular biology publications
(Henikoff, 1991). An interesting exercise for students is to randomly select 10
sequences of >300 amino acids and run Block Maker on them. Using these blocks
in a MAST search will invariably detect each of the sequences that went into
them, despite the fact that the alignments have no meaning! This illustrates
why you should never use the mere ability to obtain a plausible alignment
between two sequences as evidence that they are related. Database searching is
well-suited to validating similarity, as the E-values that are returned can be
interpreted in the context of a comparison against a large set of truly
unrelated proteins or families, without depending on subjective judgments. COMMENTARY Background Information The utility of blocks Blocks, or motifs, correspond
to minimal units of protein function. They are typically short amino acid
segments that are conserved in sequence and in length. Motifs form protein
active sites, substrate and cofactor-binding sites, and structural features
crucial for function. Although individual amino acids comprise smaller units
than blocks, they are not sufficiently specific to define a unique function.
For example, a position with either Asp or Glu residues can be part of a metal
binding site, a protein binding site, etc. Larger units, made up of multiple
motifs, comprise protein domains that most often correspond to structure folds.
Some distinct domains nevertheless share common motifs, for example, HTH DNA
binding motifs, P‑loop ATP ‑binding motifs and Rossmann fold‑like
phosphate/sulphate binding loops. Unlike 3D structural folds, motifs do not
generally assume a stable structure by themselves and depend on the presence of
other (less sequence conserved) protein segments to support and position them.
The alignment-based searching methods that comprise the Blocks system can be
used for detection and analysis of protein functional building blocks in
different contexts. Block-based alignment methods
differ from those based on global multiple sequence alignment. Both perform
better than single sequence analyses in identifying the functionally critical
sequence regions from a group of related sequences. Block-based methods are explicitly
designed to identify conserved regions, whereas more global multiple sequence
alignment usually includes alignment of both conserved and non-conserved
regions. Global multiple sequence alignment may also be unable to align short
conserved regions that are found in different contexts. Multiple blocks can be
joined to achieve a global alignment, a strategy used by Gapped-BLAST and
PSI-BLAST (Altschul et al., 1997), but the converse is not always true, because
in global alignment the boundary between conserved and non-conserved is often
unclear. Global multiple alignment
methods have been widely used to identify complete domains, which typically
consist of multiple blocks and adjacent regions. These methods have become
standard for automatic annotation of genomic sequence, because they tend to
identify complete domains. Blocks-based methods are more suitable for analyzing
critical regions and residues within domains, and so the two classes of methods
are complementary. Making blocks Blocks are produced by the
automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust
motif-finder to a set of related protein sequences. Resulting candidate motifs
are assembled into a best set along the lengths of the sequences to give a
multiple alignment consisting of ungapped conserved regions separated by
unaligned regions of variable size. The Blocks Database consists of blocks
constructed from protein families cataloged in the InterPro (Apweiler et al.,
2000) collection of protein families. MOTIF looks for spaced
triples in most of the sequences and aligns them around these triples (Smith et
al., 1990). A spaced triple is a set of three amino acids separated by two
distances. For Block Maker, all spaced triples with all combinations of two distances
ranging from between 0 and 17 amino acids each are tallied. PROTOMAT also has been
modified to utilize a Gibbs sampler as motif-finder (Neuwald et al., 1995).
GIBBS uses a statistical sampling algorithm to find motifs and does not rely on
finding amino acid identities in the sequences. Searching with blocks Block alignments are
converted into position-specific scoring matrices (PSSMs) for searching. Each
PSSM column corresponds to a block position and includes 20 numerical scores
representing the odds for each amino acid occurring in that position.
Calculation of the Block Searcher PSSMs uses sophisticated methods of sequence
weighting and pseudo-count estimation shown to be effective in comprehensive
tests (Henikoff and Henikoff, 1996). A theoretical score distribution is
computed for each PSSM (Tatusov et al., 1994). A query sequence is compared
with each PSSM in the Blocks Database by aligning it with the block at every
possible position and adding the log-odds scores in each PSSM column. The highest-scoring
alignment is saved and the probability of its score looked up in the
theoretical distribution. For families with multiple blocks, each block is
aligned and scored individually with the query, and the probabilities of all
the blocks are combined to give the overall expected value for the alignment of
the query with the blocks for the family (Bailey and Gribskov, 1998). Multiple
blocks are only combined in a hit if they occur in order and within reasonable
distances of one another within the query sequence. Reasonable distances are
determined by looking at the distances between blocks in the known members of a
family. Block searches against
sequences can be improved upon by searching blocks against blocks. In such
cases both query and target are devoid of non-conserved sequence regions, and
both are defined by amino acid distribution in each position (Pietrokovski,
1996). Since the block-to-block alignment is ungapped and over relatively short
regions, it is possible to automatically identify consistent alignments of
several blocks (Kunin et al., 2001). Because blocks are inherently
local, they can accommodate partial sequences, such as those that are available
from EST projects. The Block Searcher facilitates this task by accepting DNA
queries, which it translates in 3 or 6 frames, piecing together multiple block
hits in different frames on a DNA strand. This feature is also useful for
identifying missing exons caused by alternative splicing or gene mis‑prediction,
as illustrated in the Dnmt2 example. Using blocks for tree
construction Multiple sequence alignments
and phylogenetic trees constructed from them are well suited to reconstruct
relationships between the component sequences. However, regions that are
wrongly aligned will confound this analysis. Because blocks correspond to more
confidently aligned segments, they may be more reliable than reconstructions
based on global alignment. The Tree Viewer tool describes relationships between
sequences that are derived from the best aligned regions of proteins, and this
reduces the concern that divergence is an artifact of misalignment. Using CODEHOP to isolate
orthologs in related organisms CODEHOP primers overcome
problems of both degenerate and consensus methods for primer design. Hybrid primers
consist of a relatively short 3' degenerate core and a 5' non-degenerate clamp.
Reducing the length of the 3' core to a minimum decreases the total number of
individual primers in the degenerate primer pool. Hybridization of the 3'
degenerate core with the target template is stabilized by the 5' consensus
clamp of the primer and the target sequence during the initial PCR cycles. Even in the postgenomic era,
sequencing has hardly begun on the vast majority of genomes on earth, and so
methods are still needed for isolating homologs that are not present in
sequence databases. The CODEHOP primer designer can aid in this task, by
implementing a strategy that permits high stringency annealing to avoid
mispriming by chance. PCR primer design takes advantage of the accumulation of
sequence data, which facilitates the task of obtaining homologous sequences
from organisms of interest. As illustrated by the Dnmt2 example, using just a
subfamily of cytosine methylases allows primers to be designed specifically for
members of this subtree, which should succeed in most organisms that have Dnmt2
orthologs. The cytosine methylase family is typical in that it is so diverse
that the design of PCR primers to specifically amplify them all is unfeasable.
Fortunately, the diversity of most protein families is mostly evident in
paralogous relationships, and so limiting oneself to probable orthologs is
likely to be a sound general strategy. As orthologs are expected to share
function, the primer design strategy illustrated for Dnmt2 allows a user to
focus on shared function despite the possible occurrence of paralogs that may
be functionally dissimilar. Critical Parameters and
Troubleshooting Blocks Database retrieval Usually a keyword or sequence
name is sufficient to retrieve a family using Get Blocks. However, homology
searching is a more reliable way to determine if a protein belongs to one or
more families, and Reverse PSI-BLAST is fast and sensitive. Because block
alignments may differ from those used for the corresponding InterPro entry,
occasional significant hits may not correspond to their InterPro annotations,
and an example of this is found in Part 2 section 3. Blocks are not made for every
InterPro entry. In particular, they are not made for entries that are subsets
of other entries. This reduces overlap between families in the Blocks Database. Avoiding spurious hits in
searches The expected (E) value is the
most critical parameter, where E=1 means that a single hit is expected to occur
by chance, and so higher values should result in more hits being reported. View
significant E-values with caution when there is compositional bias, and use
filtering on such queries (Wootton and Federhen, 1993). Alternatively, search
the Blocks+ database with compositionally biased blocks removed. Compositional
bias can be especially severe when non-coding short repetitive sequences are
present in DNA queries. In addition to searching the Blocks+ database with
compositionally biased blocks removed, you can perform a search using only the
coding strand of the query to reduce background. Block Searcher does not
penalize gaps, and so it is possible that very long DNA queries will report
successive blocks that are implausibly far apart on the same strand. One of
these may be spurious, especially if there is compositional bias in either the
query or the database entry. If a family is represented by only a single block,
then the hit's quality is more difficult to judge. In this case perform another
search using the Reverse PSI-BLAST or IMPALA Searcher to confirm the hit, as
these programs use different alignment algorithms and statistics than Block
Searcher. Block Maker features Block Maker constructs blocks
using two very different motif-finders: Motif and Gibbs, requiring no
externally provided parameters other than the set of protein sequences
submitted to it. Non-overlapping blocks are found and a "best set" of
blocks is reported, sometimes discarding individual sequences that do not
sufficiently conform with the others. This can occur if it lacks some of the
strongest motifs found in other sequences, or if the motifs are out of order or
overlap. The complementary strengths
and weaknesses of the Motif and Gibbs means that you can compare their results
as a "reality check". PROTOMAT will always report blocks, even if
random sequences are provided. If sequences truly have motifs in common, then
both methods yield similar, and sometimes identical sets of blocks. However, if
sequences have nothing in common, the two motif‑finding algorithms tend
pick up completely different meaningless blocks. Repeated domains are not
handled by Block Maker. Rather, only a single repeat member is aligned within a
block. MEME (Bailey and Elkan, 1994), which is available from http://meme.sdsc.edu/meme/website/,
is designed to align all of the repeat members within a block. MEME uses a
statistical approach that is comparable to Gibbs sampling. Using CODEHOP
interactively There are ways of reducing
the stringency if you do not get predictions using the default parameters, or
if you don't like what you get. Raising the strictness of the core region, for
example from 0.0 to 0.1 or even to 0.25 will discriminate against the less
probable codons. If one or more of the sequences is expected to be closer to
the desired target gene, then raising its weight relative to the others can
reduce the size of the target primer pool without requiring that you raise the
degeneracy or strictness. You do this by working in the Web box on the sequence
segment weight in the last column. The maximum sequence weight in a block from
the Blocks Database or Block Maker is 100, so you might upweight your favorite
sequence to 200 or 400. You can also ignore the contribution of individual
sequences to the block by down‑weighting them to 0 if they are too
divergent or misaligned and so prevent finding a solution. Clamp residues can be
selected as the most common codons of the consensus amino acids. Otherwise, the
clamp residues are the ones with maximum weight in the DNA PSSM, which may
result in artificial codons. These do not affect the primers chosen, but the
output may be disturbing. Suggestions for Further
Analysis Conserved regions of proteins
are those that are most likely to suffer deleterious effects when mutated (Ng
and Henikoff, 2001). SIFT (Sorting Intolerant from Tolerant, http://blocks.fhcrc.org/~pauline/SIFT.html)
is a Web tool for predicting which changes are likely to affect protein
function based on conservation. Given a multiple alignment such as set of
blocks, SIFT predicts which changes can be expected to damage the protein. If
SIFT is given a sequence, it uses PSI-BLAST to obtain homologous sequences from
sequence databanks for multiple alignment. When applied to human polymorphism
data, SIFT identifies disease loci with about 70% accuracy (Ng and Henikoff,
2002). CODDLE (http://www.proweb.org/coddle)
and PARSESNP (http://www.proweb.org/parsesnp)
are general Web tools for polymorphism and mutation assessment that take
sequence input from a variety of sources, display gene models, and use Blocks
Database alignments to aid in identifying regions most suitable for targeted
mutagenesis. Literature Cited Altschul, S.F.,
Gish,W., Miller,W., Myers,E.W., and Lipman,D.J. 1990. Basic local alignment
search tool. J. Mol. Biol. 215: 403–410. Altschul, S.F.,
Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J.
1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25: 3389–3402. Apweiler, R.,
Attwood,T.K., Bairoch,A., Bateman,A., Birney,E., Biswas,M., Bucher,P.,
Cerutti,L., Corpet,F., Croning,M.D., Durbin,R., Falquet,L., Fleischmann,W.,
Gouzy,J., Hermjakob,H., Holo,N., Jonassen,I., Kahn,D., KanapinA,
Karavidopoulou,Y., Lopez,R., Marx,B., Mulder,N.J., Oinn,T.M., Pagni,M.,
Servant,F., Sigrist,C.J., and Zdobnov,E.M. 2000. InterPro--an integrated
documentation resource for protein families, domains and functional sites. Bioinformatics 16:
1145–1150. Attwood, T.K.,
Croning,M.D.R., Flower,D.R., Lewis,A.P., Mabey,J.E., Scordia,P., Selley,J.N.,
and Wright,W. 2000. PRINTS-S: the database formerly known as PRINTS. Nucleic
Acids Res. 28: 225–227. Bailey, T.
and Elkan,C. 1994. Fitting a mixture model by expectation maximization to
discover motifs in biopolymers. In
Proceedings of the Second International Conference on Intelligent Systems for
Molecular Biology, pp. 28–36
AAAI Press, Menlo Park, CA. Bailey, T.L.
and Gribskov,M. 1998. Combining evidence using p-values: Application to
sequence homology searches. Bioinform. 14: 48–54. Bairoch, A.
and Apweiler,R. 2000. The SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res. 28:
45–48. Henikoff, J.G.
and Henikoff,S. 1996. Using substitution probabilities to improve
position-specific scoring matrices. Comput. Appl. Biosci. 12: 135–143. Henikoff, S.
1991. Playing with blocks: Some pitfalls of forcing multiple alignments. New
Biol. 3: 1148–1154. Henikoff, S.
and Henikoff,J.G. 1991. Automated assembly of protein blocks for database
searching. Nucleic Acids Res. 19: 6565–6572. Henikoff, S.
and Henikoff,J.G. 1994. Position-based sequence weights. J. Mol. Biol. 243:
574–578. Henikoff, S.
and Henikoff,J.G. 1997. Embedding strategies for effective use of multiple
sequence alignment information. Prot. Sci. 6: 698–705. Henikoff, S.,
Henikoff,J.G., Alford,W.J., and Pietrokovski,S. 1995. Automated construction
and graphical presentation of protein blocks from unaligned sequences. Gene 163:
GC17–GC26. Huang, J.Y.
and Brutlag,D.L. 2001. The EMOTIF database. Nucleic Acids Res. 29:
202–204. Kunin, V.,
Chan,B., Sitbon,E., Lithwick,G., and Pietrokovski,S. 2001. Consistency analysis
of similarity between multiple alignments: prediction of protein function and
fold structure from analysis of local sequence motifs. J. Mol. Biol. 307:
939–949. Neuwald, A.F.,
Liu,J.S., and Lawrence,C.E. 1995. Gibbs motif sampling: detection of bacterial
outer membrane protein repeats. Prot. Sci. 4: 1618–1632. Ng, P.C.
and Henikoff,S. 2001. Predicting deleterious amino acid substitutions. Genome
Res. 11: 863–874. Ng, P.C.
and Henikoff,S. 2002. Accounting for human polymorphisms predicted to affect
protein function. Genome Res. (In
press) Pearson, W.R.
1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Meth.
Enzymol. 183: 63–98. Pietrokovski, S.
1996. Searching databases of conserved sequence regions by aligning protein
multiple-alignments. Nucleic Acids Res. 24: 3836–3845. Pietrokovski, S.,
Henikoff,J.G., and Henikoff,S. 1998. Exploring protein homology with the Blocks
server. Trends Genet. 14: 162–163. Pinarbasi, E.,
Elliott,J., and Hornby,D.P. 1996. Activation of a yeast pseudo DNA
methyltransferase by deletion of a single amino acid. J. Mol. Biol. 257:
804–813. Rose, T.M.,
Schultz,E.R., Henikoff,J.G., Pietrokovski,S., McCallum,C.M., and Henikoff,S.
1998. Consensus-degenerate hybrid oligonucleotide primers for amplification of
distantly related sequences. Nucleic Acids Res. 26:
1628–1635. Saitou, N.
and Nei,M. 1987. The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol. Biol. Evol.
4: 406–425. Schaffer, A.A.,
Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L., and Altschul,S.F. 1999.
Software to match a protein sequence against a collection of
PSI-BLAST-constructed position-specific score matrices. Bioinform. (In press) Schaffer, A.A.,
Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., Wolf,Y.I., Koonin,E.V., and
Altschul,S.F. 2001. Improving the accuracy of PSI-BLAST protein database
searches with composition-based statistics and other refinements. Nucleic
Acids Res. 29: 2994–3005. Schneider, T.D.
and Stephens,R.M. 1990. Sequence logos: a new way to display consensus
sequences. Nucleic Acids Res. 18: 6097–6100. Silverstein, K.A.,
Shoop,E., Johnson,J.E., and Retzel,E.F. 2001. MetaFam: a unified classification
of protein families. I. Overview and statistics. Bioinformatics 17:
249–261. Smith, H.O.,
Annau,T.M., and Chandrasegaran,S. 1990. Finding sequence motifs in groups of
functionally related proteins. Proc. Natl. Acad. Sci. USA 87:
826–830. Tatusov, R.L.,
Altschul,S.F., and Koonin,E.V. 1994. Detection of conserved segments in
proteins: Iterative scanning of sequence databases with alignment blocks. Proc.
Natl. Acad. Sci. USA 91: 12091–12095. Thompson, J.D.,
Higgins,D.G., and Gibson,T.J. 1994. CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res. 22:
4673–4680. Wootton, J.C.
and Federhen,S. 1993. Statistics of local complexity in amino acid sequences
and sequence databases. Comput. Chem.
17: 149–163. Key References Henikoff, S.
and Henikoff ,J.G. 1991. Automated assembly of protein blocks for database
searching. Nucleic Acids Res. 19: 6565–6572. Introduces the Blocks
Database, how it is constructed using PROTOMAT and how it is searched using
Block Searcher. Pietrokovski, S.
1996. Searching databases of conserved sequence regions by aligning protein
multiple-alignments. Nucleic Acids Res. 24: 3836–3845. Introduces LAMA for
searching blocks versus a database of blocks as an example of searching
multiple alignments against one another for sensitive detection of motifs. Rose, T.M.,
Schultz ,E.R., Henikoff ,J.G., Pietrokovski ,S.,
McCallum ,C.M., and Henikoff ,S. 1998. Consensus-degenerate hybrid
oligonucleotide primers for amplification of distantly related sequences. Nucleic
Acids Res. 26: 1628–1635. Describes the CODEHOP
strategy for detecting distant homologs using PCR and the Web-based
implementation for designing optimal CODEHOP primers. Internet Resources http://blocks.fhcrc.org Blocks home page. ProWeb home page.