Block Maker Help


        ___________               ___________               ___________ 
       |\ __________\            |___________|            /__________ /|
       | |           |           |           |           |           | |
       | | **********|           |***********|           |********** | |
       | | *  BLOCK  |           |   MAKER   |           |  SERVER * | |
       | | **********|           |***********|           |********** | |
        \|___________|___________|___________|___________|___________|/
                     |\ __________\         /__________ /|
                     | |blockmaker@|       |  http://  | |
                     | |   blocks. |       |    blocks.| |
                     | |   fhcrc.  |       |    fhcrc. | |
                     | |   org     |       |    org/   | |
                      \|___________|       |___________|/

The BLOCK MAKER SERVER finds blocks in a group of related protein sequences. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions.
Return to top

A quick summary


Internet address:       blockmaker@blocks.fhcrc.org

Message options are

      Send current help file:       help

      Find blocks in sequences:     >My sequence1
                                    MCKTASE.... 
                                    >My sequence2
                                    MCKTESE....
                                    >My sequence3
                                    MCKTEASE....

World Wide Web:		http://blocks.fhcrc.org/

Citation:	Steven Henikoff, Jorja G. Henikoff, William J. Alford,
		& Shmuel Pietrokovski, "Automated construction and
		graphical presentation of protein blocks from unaligned
		sequences", Gene-COMBIS, Gene 163 (1995), GC 17-26.

Return to top

Getting help

The current version of this file is returned when the single word HELP appears on the subject line or in the body of an otherwise blank e-mail message to the following Internet address:

BLOCKMAKER@BLOCKS.FHCRC.ORG

A database of blocks has been constructed by successive application of the automated PROTOMAT system (1) to individual entries in the PROSITE catalog of protein groups (2) keyed to the SWISS-PROT protein sequence databank (3). You can obtain the complete BLOCKS database and PROSITE catalog from the repository of the National Center for Biological Information via ftp ('ftp ncbi.nlm.nih.gov' log in as 'anonymous', give your e-mail address as password, then 'cd repository/blocks' or 'cd repository/prosite'). PROTOMAT software and documentation for DOS and UNIX machines are also available from the repository. Ftp instructions are found in the README file in repository/blocks. If human help is required or if you find a bug, please contact us (see link at end of this document). Since we do not save any queries sent to the server, nor any results sent out, please include these in your message.
Return to top

Sending sequences

You can send a minimum of 2 and a maximum of 250 protein sequences to Block Maker. Do not send DNA sequences, as they will be interpreted using the protein alphabet. PROTOMAT is especially effective for large numbers of sequences that are difficult to align by standard multi-sequence alignment methods, but is not a very good method for aligning just 2 sequences or multiple sequences that are very similar to one another.
Sequences must be in a single format (e.g., "FASTA"), one after the other in the body of your message. Other acceptable formats are GenBank, EMBL, Swiss- Prot, GCG, Genepro and PIR. Blank lines are ignored; however, you must not mix formats, nor insert extra spaces in front of titles which will fool the system. In addition, the first 10 characters after the ">" in FASTA format must be unique for each sequence, and must not include UNIX control characters, such as "~" or "$", because the system uses these characters to make temporary filenames. Here is an example of an acceptable message. Note that optional information can be provided in the subject line:
To: blockmaker@blocks.fhcrc.org
Subject: Lipocalins; Five tough ones from Lawrence et al, Science 262:208-214

>BBP_PIEBR BILIN-BINDING PROTEIN (BBP) 
NVYHDGACPE VKPVDNFDWS NYHGKWWEVA KYPNSVEKYG KCGWAEYTPE GKSVKVSNYH
VIHGKEYFIE GTAYPVGDSK IGKIYHKLTY GGVTKENVFN VLSTDNKNYI IGYYCKYDED
KKGHQDFVWV LSRSKVLTGE AKTAVENYLI GSPVVDSQKL VYSDFSEAAC KVN

>ICYA_MANSE INSECTICYANIN A FORM (BLUE BILIPROTEIN) 
GDIFYPGYCP DVKPVNDFDL SAFAGAWHEI AKLPLENENQ GKCTIAEYKY DGKKASVYNS
FVSNGVKEYM EGDLEIAPDA KYTKQGKYVM TFKFGQRVVN LVPWVLATDY KNYAINYNCD
YHPDKKAHSI HAWILSKSKV LEGNTKEVVD NVLKTFSHLI DASKFISNDF SEAACQYSTT
YSLTGPDRH

>LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG)
MKCLLLALAL TCGAQALIVT QTMKGLDIQK VAGTWYSLAM AASDISLLDA QSAPLRVYVE
ELKPTPEGDL EILLQKWENG ECAQKKIIAE KTKIPAVFKI DALNENKVLV LDTDYKKYLL
FCMENSAEPE QSLACQCLVR TPEVDDEALE KFDKALKALP MHIRLSFNPT QLEEQCHI

>MUP2_MOUSE MAJOR URINARY PROTEIN 2 PRECURSOR (MUP 2) 
MKMLLLLCLG LTLVCVHAEE ASSTGRNFNV EKINGEWHTI ILASDKREKI EDNGNFRLFL
EQIHVLEKSL VLKFHTVRDE ECSELSMVAD KTEKAGEYSV TYDGFNTFTI PKTDYDNFLM
AHLINEKDGE TFQLMGLYGR EPDLSSDIKE RFAKLCEEHG ILRENIIDLS NANRCLQARE
>RETB_BOVIN PLASMA RETINOL-BINDING PROTEIN (PRBP) 
ERDCRVSSFR VKENFDKARF AGTWYAMAKK DPEGLFLQDN IVAEFSVDEN GHMSATAKGR
VRLLNNWDVC ADMVGTFTDT EDPAKFKMKY WGVASFLQKG NDDHWIIDTD YETFAVQYSC
RLLNLDGTCA DSYSFVFARD PSGFSPEVQK IVRQRQEELC LARQYRLIPH NGYCDGKSER
NIL
Return to top

Receiving results

Block Maker will run PROTOMAT twice, first using Smith's MOTIF (4) and second using a modification of Lawrence's Gibbs sampler (5, 6) as motif-finding algorithms, and then it will return both sets of blocks to you. While the system attempts to align all sequences provided, it will sometimes exclude sequences that are too diverged from the majority of sequences for the similarity to be detected. Optional information from the subject line of the input message is used to fill out the ID and AC lines of the output, whereas the title of the first sequence is used for the DE line. Here is some sample output:
==============================================================================
              **BLOCKS from MOTIF**
 
>Lipocal Five tough ones from Lawrence et al, Science 262:208-214...
5 sequences are included in 2 blocks

            LipocalA, width = 15 LipocalB, width = 11     
 BBP_PIEBR    16 NFDWSNYHGKWWEVA (  70)   101 VLSTDNKNYII
ICYA_MANSE    17 DFDLSAFAGAWHEIA (  73)   105 VLATDYKNYAI
LACB_BOVIN    25 GLDIQKVAGTWYSLA (  70)   110 VLDTDYKKYLL
MUP2_MOUSE    27 NFNVEKINGEWHTII ( 101)   143 DLSSDIKERFA
RETB_BOVIN    14 NFDKARFAGTWYAMA (  77)   106 IIDTDYETFAV
 
 
              **BLOCKS from GIBBS**
 
>Lipocal Five tough ones from Lawrence et al, Science 262:208-214...
5 sequences are included in 2 blocks

            LipocalA, width = 15 LipocalB, width = 11     
 BBP_PIEBR    16 NFDWSNYHGKWWEVA (  70)   101 VLSTDNKNYII
ICYA_MANSE    17 DFDLSAFAGAWHEIA (  73)   105 VLATDYKNYAI
LACB_BOVIN    25 GLDIQKVAGTWYSLA (  70)   110 VLDTDYKKYLL
MUP2_MOUSE    27 NFNVEKINGEWHTII (  68)   110 IPKTDYDNFLM
RETB_BOVIN    14 NFDKARFAGTWYAMA (  77)   106 IIDTDYETFAV
Since approximately the same two blocks are reported using both MOTIF and GIBBS and include all 5 sequences submitted, it is very likely that these blocks represent correct alignments. Indeed, Lawrence et al (5) indicate that these alignments are identical to those determined from analysis of the 3-dimensional structures of these proteins, and that these 2 regions are the only ones in common for the group. Notice, however, that MOTIF apparently aligned MUP2_MOUSE incorrectly in the B block.
Please note that, because the GIBBS algorithm is non-deterministic, (using randomly determined starting points) the GIBBS results may differ when the same sequences are submitted repeatedly. Although Block Maker will always use the same seed for the random number generator and always sorts the sequences alphabetically by name, a small change to the sequences (even changing a sequence name and thus its order in the input presented to GIBBS) can change the results.
Blocks are also returned in a "Searchable" format for the BLIMPS searching program, which will search a block against a sequence database. BLIMPS is available by anonymous ftp from the NCBI repository:
	ftp ncbi.nlm.nih.gov
	login: anonymous
	cd repository/blocks/unix/blimps
Return to top

How blocks are made

PROTOMAT (1) is based on a 2-step system for finding a best set of blocks representing a group of related proteins. The first step finds candidate alignments and the second step extends the alignments, then sorts them in such a way that a best set ("best path") is chosen. Since 1991, the first step employed a modified version of Hamilton Smith's MOTIF program (4). MOTIF exhaustively examines all spaced triplets out to a maximum distance for their presence in at least a subset of sequences. For example, one spaced triplet is Ala-Ala-Ala, another is Ala-x-Ala-Ala and another is Val-x(16)-Ala-x(7)-Cys where x represents any amino acid. A spaced triplet found in enough sequences provides an alignment against which the sequences lacking the triplet can be aligned to maximize a block score, which is determined using an amino acid substitution matrix (currently Blosum 62). To maximize the sensitivity of MOTIF, we allow the subset of sequences to be so small that some spaced triplets would be found even for shuffled sequences. The best alignments are passed on to the second step (MOTOMAT) which 1) merges overlapping triplets, 2) extends alignments to provide the highest-scoring blocks that still contain the triplet, and 3) determines the best set of blocks, where the blocks are all in the right order without overlapping for the largest subset of sequences in the group. MOTOMAT does not realign sequences that fail to conform, but rather discards them. We have found MOTIF-MOTOMAT to be very effective in finding motifs for even the most distantly related groups, and this automated system is the basis for the current Blocks Database, which can be searched to detect distant relationships.
In 1993, a "Gibbs sampling strategy" was applied to the problem of finding motifs in groups of related sequences (5,6). This strategy picks random positions along all but one of the sequences and then tries to align the remaining sequence for best fit with the others. This procedure is reiterated a large number of times until the score is maximized, based on information content. False starts will fail to improve, whereas detection of a true pattern in even a small subset of sequences leads to rapid improvement in score. We have confirmed that this method works very well in practice for even very large groups, providing very similar alignments to what is obtained using MOTIF. However, there are limitations to this strategy, in that the number of blocks representing a group and the minimum width of each of the blocks must be specified in advance for each run. For typical families, which might have several blocks of different widths, it is extremely impractical to try all possible models of number of blocks (N) and minimum widths (W). In spite of this limitation, the sampling procedure is attractive for obtaining essentially optimal blocks which could provide a basis for accurate multiple alignment. Therefore, we have investigated its use as a motif finder to provide candidate blocks for MOTOMAT to extend, score and sort. We have developed an effective heuristic strategy for doing this, requiring only a small number of runs of the Gibbs sampling program and which can be carried out in a reasonable amount of time. This strategy inevitably makes compromises not necessary with MOTIF (which is more exhaustive with respect to block width and number) and sometimes misses blocks that MOTIF finds. Furthermore, GIBBS is much slower than MOTIF. However, we find that the resulting blocks are less likely to have errors, and that fewer sequences are discarded by MOTOMAT.
The complementary strengths and weaknesses of MOTIF-based and GIBBS-based motif-finding methods suggests that they can be compared to provide a "reality check". PROTOMAT will *ALWAYS* report blocks, even if random sequences are provided, so it's "garbage in, garbage out". We find that if sequences truly have motifs in common, then both runs yield similar, and sometimes identical sets of blocks. However, if sequences have nothing in common, we find that the two motif-finding algorithms pick up completely different garbage blocks.
It is important to realize that while blocks can be extremely useful for multi-sequence alignments the PROTOMAT system was not specifically designed for this purpose, but rather for database searching applications. So, occasional misalignments can be tolerated so long as the motifs are correctly identified for the large majority of sequences; in such cases the contributions from misaligned segments will be diluted out, and so searching performance will be affected only slightly. However, because of these occasional errors, we have refrained from recommending that the blocks be used directly for multi- sequence alignments. The ability of PROTOMAT to find blocks should not ordinarily be interpreted as evidence for homology, although the blocks can aid in the detection of motifs and in the determination of family relationships.
Return to top

References

If you find Block Maker useful, please cite:

0. Henikoff, S., Henikoff, J.G, Alford, W.J, and Pietrokovski, S. (1995), Automated construction and graphical presentation of protein blocks from unaligned sequences, Gene 163:GC17-26.

1. Henikoff, S. and Henikoff, J.G. (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Research 19:6565-6572.

2. Bairoch, A. (1992) PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Research 20:2013-2018.

3. Bairoch, A. and Boeckmann, B. (1992) The SWISS-PROT protein sequence data bank. Nucleic Acids Research 20:2019-2022.

4. Smith, H.O., Annau, T.M. and Chandrasegaran, S. (1990) Finding sequence motifs in groups of functionally related proteins. Proc. Natl. Acad. Sci. USA 87:826-830.

5. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993) Detecting Subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208-214.

6. Neuwald, A.F., Liu, J.S. and Lawrence, C.E. (1995), Gibbs motif sampling: detection of bacterial outer membrane protein repeats, Protein Science 4:1618-1621

7. Henikoff, S. and Henikoff, J.G. (1996), Embedding strategies for effective use of multiple alignment information. Submitted for publication.

8. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994), CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, NAR 1994, 22:4673-4680. FTP site

9. Saitou, N. and Nei, M. (1987), The neighbor-joining method: A new method for reconstructing phylogenetic trees, Mol. Biol. Evol., 4:406-425.

Return to top

Jan 1996: New Gibbs sampler

Some bugs were fixed in Motif and the programs were made more tolerant of aberrant sequence formats.
The Gibbs sampler program was updated to the new version described in (6) and the heuristic described in (0) was modified for it. We execute Gibbs using the "site sampler" and "fragmentation" options. The frag- mentation option allows us to specify just the minimum block width, instead of specifying the exact block width for each block in the model. Gibbs will then explore to find an optimal block width. Our revised heuristic always uses minimum block width of 8, and chooses the number of blocks based on the length of the shortest sequence as follows:
			Gibbs Model
	Minimum sequence	Number of blocks
	length			(minimum length = 8)
	    <  36		1
	 36 -  85		2
	 86 - 135		3
	136 - 185		4
	186 - 235		5
	236 - 285		6
			etc.

June 1996: COBBLER sequence

A "COBBLER sequence" (7) is now returned with each set of blocks (one from Motif, another from Gibbs). This is one of the submitted sequences that has been embedded with consensus residues computed from the blocks to bias it towards the conserved regions. The consensus residues appear in upper case. This sequence can be used as the query in a standard homology search (e.g. using blastp or fasta). Caution should be exercised in interpreting the results of such a search, however. If the submitted sequences are not actually related, then the resulting blocks may be "garbage", and the COBBLER sequence will be biased towards the same garbage.

Sept 1996: BLAST link

The COBBLER sequence may now be automatically sent to the BLAST server to search the non-redundant protein database with default parameters.

Oct 1996: Neighbor-joining tree

A tree made from the Block Maker block alignments by the CLUSTALW program (8) using the neighbor-joining method (9) is available in XBitmap format. It is based on a matrix of distances between all pairs of sequences.

Oct 1996: MAST link

Position-specific scoring matrices are made from blocks returned by Block Maker in a format suitable for the MAST Searching Service.

Mar 1998: Protomat 10

These are the program versions used to created version 10.0 of the blocks database based on PROSITE 14.0. Although the basic algorithms for the motif finder (motifj program) and motif assembler (motomat program) remain the same, major adjustments were made to the PROTOMAT programs for this release with the intention of improving the results, first by saving fewer motifs spread along more of the sequence lengths (motifj), and secondly by the reducing competition between overlapping motifs during assembly (motomat). We also now run a new program (addseqs) to add related sequences excluded from the blocks by PROTOMAT back into the blocks: these sequences are added only if they can be aligned with all of the blocks in the correct order using the BLIMPS searching procedure.

Return to top

Contact us