___________ ___________ ___________
|\ __________\ |___________| /__________ /|
| | | | | | | |
| | **********| |***********| |********** | |
| | * BLOCK | | MAKER | | SERVER * | |
| | **********| |***********| |********** | |
\|___________|___________|___________|___________|___________|/
|\ __________\ /__________ /|
| |blockmaker@| | http:// | |
| | blocks. | | blocks.| |
| | fhcrc. | | fhcrc. | |
| | org | | org/ | |
\|___________| |___________|/
The BLOCK MAKER SERVER finds blocks in a group of related protein sequences.
Blocks are short multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins. Typically, a group of proteins has more
than one region in common and their relationship is represented as a series of
blocks separated by unaligned regions.
Internet address: blockmaker@blocks.fhcrc.org
Message options are
Send current help file: help
Find blocks in sequences: >My sequence1
MCKTASE....
>My sequence2
MCKTESE....
>My sequence3
MCKTEASE....
World Wide Web: http://blocks.fhcrc.org/
Citation: Steven Henikoff, Jorja G. Henikoff, William J. Alford,
& Shmuel Pietrokovski, "Automated construction and
graphical presentation of protein blocks from unaligned
sequences", Gene-COMBIS, Gene 163 (1995), GC 17-26.
Return to top
Getting help
The current version of this file is returned when the single word HELP appears
on the subject line or in the body of an otherwise blank e-mail message to the
following Internet address:
BLOCKMAKER@BLOCKS.FHCRC.ORG
A database of blocks has been constructed by successive application of the
automated PROTOMAT system (1) to individual entries in the PROSITE catalog of
protein groups (2) keyed to the SWISS-PROT protein sequence databank (3). You
can obtain the complete BLOCKS database and PROSITE catalog from the
repository of the National Center for Biological Information via ftp ('ftp
ncbi.nlm.nih.gov' log in as 'anonymous', give your e-mail address as password,
then 'cd repository/blocks' or 'cd repository/prosite'). PROTOMAT software
and documentation for DOS and UNIX machines are also available from the
repository. Ftp instructions are found in the README file in
repository/blocks. If human help is required or if you find a bug, please
contact us (see link at end of this document). Since we do not save any queries
sent to the server, nor any results sent out, please include these in your
message.
Return to top
Sending sequences
You can send a minimum of 2 and a maximum of 250 protein sequences to
Block Maker. Do not send DNA sequences, as they will be interpreted using the
protein alphabet. PROTOMAT is especially effective for large numbers of
sequences that are difficult to align by standard multi-sequence alignment
methods, but is not a very good method for aligning just 2 sequences or
multiple sequences that are very similar to one another.
Sequences must be in a single format (e.g., "FASTA"), one after the other in
the body of your message. Other acceptable formats are GenBank, EMBL, Swiss-
Prot, GCG, Genepro and PIR. Blank lines are ignored; however, you must not
mix formats, nor insert extra spaces in front of titles which will fool the
system. In addition, the first 10 characters after the ">" in FASTA format
must be unique for each sequence, and must not include UNIX control characters,
such as "~" or "$", because the system uses these characters to make temporary
filenames. Here is an example of an acceptable message. Note that optional
information can be provided in the subject line:
To: blockmaker@blocks.fhcrc.org Subject: Lipocalins; Five tough ones from Lawrence et al, Science 262:208-214 >BBP_PIEBR BILIN-BINDING PROTEIN (BBP) NVYHDGACPE VKPVDNFDWS NYHGKWWEVA KYPNSVEKYG KCGWAEYTPE GKSVKVSNYH VIHGKEYFIE GTAYPVGDSK IGKIYHKLTY GGVTKENVFN VLSTDNKNYI IGYYCKYDED KKGHQDFVWV LSRSKVLTGE AKTAVENYLI GSPVVDSQKL VYSDFSEAAC KVN >ICYA_MANSE INSECTICYANIN A FORM (BLUE BILIPROTEIN) GDIFYPGYCP DVKPVNDFDL SAFAGAWHEI AKLPLENENQ GKCTIAEYKY DGKKASVYNS FVSNGVKEYM EGDLEIAPDA KYTKQGKYVM TFKFGQRVVN LVPWVLATDY KNYAINYNCD YHPDKKAHSI HAWILSKSKV LEGNTKEVVD NVLKTFSHLI DASKFISNDF SEAACQYSTT YSLTGPDRH >LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) MKCLLLALAL TCGAQALIVT QTMKGLDIQK VAGTWYSLAM AASDISLLDA QSAPLRVYVE ELKPTPEGDL EILLQKWENG ECAQKKIIAE KTKIPAVFKI DALNENKVLV LDTDYKKYLL FCMENSAEPE QSLACQCLVR TPEVDDEALE KFDKALKALP MHIRLSFNPT QLEEQCHI >MUP2_MOUSE MAJOR URINARY PROTEIN 2 PRECURSOR (MUP 2) MKMLLLLCLG LTLVCVHAEE ASSTGRNFNV EKINGEWHTI ILASDKREKI EDNGNFRLFL EQIHVLEKSL VLKFHTVRDE ECSELSMVAD KTEKAGEYSV TYDGFNTFTI PKTDYDNFLM AHLINEKDGE TFQLMGLYGR EPDLSSDIKE RFAKLCEEHG ILRENIIDLS NANRCLQARE >RETB_BOVIN PLASMA RETINOL-BINDING PROTEIN (PRBP) ERDCRVSSFR VKENFDKARF AGTWYAMAKK DPEGLFLQDN IVAEFSVDEN GHMSATAKGR VRLLNNWDVC ADMVGTFTDT EDPAKFKMKY WGVASFLQKG NDDHWIIDTD YETFAVQYSC RLLNLDGTCA DSYSFVFARD PSGFSPEVQK IVRQRQEELC LARQYRLIPH NGYCDGKSER NILReturn to top
Receiving results
Block Maker will run PROTOMAT twice, first using Smith's MOTIF (4) and second
using a modification of Lawrence's Gibbs sampler (5, 6) as motif-finding
algorithms, and then it will return both sets of blocks to you. While the
system attempts to align all sequences provided, it will sometimes exclude
sequences that are too diverged from the majority of sequences for the
similarity to be detected. Optional information from the subject line of the
input message is used to fill out the ID and AC lines of the output, whereas
the title of the first sequence is used for the DE line. Here is some
sample output:
==============================================================================
**BLOCKS from MOTIF**
>Lipocal Five tough ones from Lawrence et al, Science 262:208-214...
5 sequences are included in 2 blocks
LipocalA, width = 15 LipocalB, width = 11
BBP_PIEBR 16 NFDWSNYHGKWWEVA ( 70) 101 VLSTDNKNYII
ICYA_MANSE 17 DFDLSAFAGAWHEIA ( 73) 105 VLATDYKNYAI
LACB_BOVIN 25 GLDIQKVAGTWYSLA ( 70) 110 VLDTDYKKYLL
MUP2_MOUSE 27 NFNVEKINGEWHTII ( 101) 143 DLSSDIKERFA
RETB_BOVIN 14 NFDKARFAGTWYAMA ( 77) 106 IIDTDYETFAV
**BLOCKS from GIBBS**
>Lipocal Five tough ones from Lawrence et al, Science 262:208-214...
5 sequences are included in 2 blocks
LipocalA, width = 15 LipocalB, width = 11
BBP_PIEBR 16 NFDWSNYHGKWWEVA ( 70) 101 VLSTDNKNYII
ICYA_MANSE 17 DFDLSAFAGAWHEIA ( 73) 105 VLATDYKNYAI
LACB_BOVIN 25 GLDIQKVAGTWYSLA ( 70) 110 VLDTDYKKYLL
MUP2_MOUSE 27 NFNVEKINGEWHTII ( 68) 110 IPKTDYDNFLM
RETB_BOVIN 14 NFDKARFAGTWYAMA ( 77) 106 IIDTDYETFAV
Since approximately the same two blocks are reported using both
MOTIF and GIBBS and include all 5 sequences submitted, it is very likely that
these blocks represent correct alignments. Indeed, Lawrence et al (5) indicate
that these alignments are identical to those determined from analysis of the
3-dimensional structures of these proteins, and that these 2 regions are the
only ones in common for the group. Notice, however, that MOTIF apparently
aligned MUP2_MOUSE incorrectly in the B block.
ftp ncbi.nlm.nih.gov login: anonymous cd repository/blocks/unix/blimpsReturn to top
How blocks are made
PROTOMAT (1) is based on a 2-step system for finding a best set of blocks
representing a group of related proteins. The first step finds candidate
alignments and the second step extends the alignments, then sorts them in such
a way that a best set ("best path") is chosen. Since 1991, the first step
employed a modified version of Hamilton Smith's MOTIF program (4). MOTIF
exhaustively examines all spaced triplets out to a maximum distance for their
presence in at least a subset of sequences. For example, one spaced triplet is
Ala-Ala-Ala, another is Ala-x-Ala-Ala and another is Val-x(16)-Ala-x(7)-Cys
where x represents any amino acid. A spaced triplet found in enough sequences
provides an alignment against which the sequences lacking the triplet can be
aligned to maximize a block score, which is determined using an amino acid
substitution matrix (currently Blosum 62). To maximize the sensitivity of
MOTIF, we allow the subset of sequences to be so small that some spaced
triplets would be found even for shuffled sequences. The best alignments are
passed on to the second step (MOTOMAT) which 1) merges overlapping triplets,
2) extends alignments to provide the highest-scoring blocks that still contain
the triplet, and 3) determines the best set of blocks, where the blocks are
all in the right order without overlapping for the largest subset of sequences
in the group. MOTOMAT does not realign sequences that fail to conform, but
rather discards them. We have found MOTIF-MOTOMAT to be very effective in
finding motifs for even the most distantly related groups, and this automated
system is the basis for the current Blocks Database, which can be searched to
detect distant relationships.
In 1993, a "Gibbs sampling strategy" was applied to the problem of
finding motifs in groups of related sequences (5,6). This strategy picks random
positions along all but one of the sequences and then tries to align the
remaining sequence for best fit with the others. This procedure is reiterated
a large number of times until the score is maximized, based on information
content. False starts will fail to improve, whereas detection of a true
pattern in even a small subset of sequences leads to rapid improvement in
score. We have confirmed that this method works very well in practice for even
very large groups, providing very similar alignments to what is obtained using
MOTIF. However, there are limitations to this strategy, in that the number
of blocks representing a group and the minimum width of each of the blocks must
be specified in advance for each run. For typical families, which might have
several blocks of different widths, it is extremely impractical to try all
possible models of number of blocks (N) and minimum widths (W). In spite of
this limitation, the sampling procedure is attractive for obtaining
essentially optimal blocks which could provide a basis for accurate multiple
alignment. Therefore, we have investigated its use as a motif finder to
provide candidate blocks for MOTOMAT to extend, score and sort. We have
developed an effective heuristic strategy for doing this, requiring only a
small number of runs of the Gibbs sampling program and which can be carried
out in a reasonable amount of time. This strategy inevitably makes compromises
not necessary with MOTIF (which is more exhaustive with respect to block width
and number) and sometimes misses blocks that MOTIF finds. Furthermore, GIBBS
is much slower than MOTIF. However, we find that the resulting blocks are less
likely to have errors, and that fewer sequences are discarded by MOTOMAT.
The complementary strengths and weaknesses of MOTIF-based and GIBBS-based
motif-finding methods suggests that they can be compared to provide a "reality
check". PROTOMAT will *ALWAYS* report blocks, even if random sequences are
provided, so it's "garbage in, garbage out". We find that if sequences truly
have motifs in common, then both runs yield similar, and sometimes identical
sets of blocks. However, if sequences have nothing in common, we find that the
two motif-finding algorithms pick up completely different garbage blocks.
It is important to realize that while blocks can be extremely useful for
multi-sequence alignments the PROTOMAT system was not specifically designed for
this purpose, but rather for database searching applications. So, occasional
misalignments can be tolerated so long as the motifs are correctly identified
for the large majority of sequences; in such cases the contributions from
misaligned segments will be diluted out, and so searching performance will be
affected only slightly. However, because of these occasional errors, we have
refrained from recommending that the blocks be used directly for multi-
sequence alignments. The ability of PROTOMAT to find blocks should not
ordinarily be interpreted as evidence for homology, although the blocks can
aid in the detection of motifs and in the determination of family relationships.
Return to top
References
If you find Block Maker useful, please cite:
0. Henikoff, S., Henikoff, J.G, Alford, W.J, and Pietrokovski, S. (1995), Automated construction and graphical presentation of protein blocks from unaligned sequences, Gene 163:GC17-26.
1. Henikoff, S. and Henikoff, J.G. (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Research 19:6565-6572.
2. Bairoch, A. (1992) PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Research 20:2013-2018.
3. Bairoch, A. and Boeckmann, B. (1992) The SWISS-PROT protein sequence data bank. Nucleic Acids Research 20:2019-2022.
4. Smith, H.O., Annau, T.M. and Chandrasegaran, S. (1990) Finding sequence motifs in groups of functionally related proteins. Proc. Natl. Acad. Sci. USA 87:826-830.
5. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993) Detecting Subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208-214.
6. Neuwald, A.F., Liu, J.S. and Lawrence, C.E. (1995), Gibbs motif sampling: detection of bacterial outer membrane protein repeats, Protein Science 4:1618-1621
7. Henikoff, S. and Henikoff, J.G. (1996), Embedding strategies for effective use of multiple alignment information. Submitted for publication.
8. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994), CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, NAR 1994, 22:4673-4680. FTP site
9. Saitou, N. and Nei, M. (1987), The neighbor-joining method: A new method for reconstructing phylogenetic trees, Mol. Biol. Evol., 4:406-425.
Jan 1996: New Gibbs sampler
Some bugs were fixed in Motif and the programs were made more tolerant of
aberrant sequence formats.
The Gibbs sampler program was updated to the new version described in
(6) and the heuristic described in (0) was modified for it. We execute
Gibbs using the "site sampler" and "fragmentation" options. The frag-
mentation option allows us to specify just the minimum block width,
instead of specifying the exact block width for each block in the model.
Gibbs will then explore to find an optimal block width. Our revised
heuristic always uses minimum block width of 8, and chooses the number
of blocks based on the length of the shortest sequence as follows:
Gibbs Model Minimum sequence Number of blocks length (minimum length = 8) < 36 1 36 - 85 2 86 - 135 3 136 - 185 4 186 - 235 5 236 - 285 6 etc.