MR using keyword input
Let us look at a very simple PHASER script (AUTO_toxd1.com in the distributed tutorial files):
phaser << eof MODE MR_AUTO HKLIN toxd.mtz LABIN F = FTOXD3 SIGF = SIGFTOXD3 #optional from Phaser-2.7.12 ENSEMBLE toxd PDBFILE 1D0D_B.pdb IDENTITY 0.364 COMPOSITION PROTEIN SEQUENCE toxd.seq NUM 1 SEARCH ENSEMBLE toxd NUM 1 ROOT AUTO_toxd1 eof
The words in bold are Phaser keywords. Only the first 4 characters are significant so it does not matter whether you write ENSEMBLE, ENSEMB or ENSE.
Let us examine the contents of this script line by line. The first line:
phaser << eof
tells us what to run. phaser is the Phaser executable which should be in your PATH for this to work. If you were to run Phaser it wouldn't actually do anything straight away but ask for user input (try it). So what we need to do next is to feed it some information before we get anything out of it. This is what the script job is for. The last part of the first line says: Feed in the subsequent lines until you hit eof (at the end of the script).
Let us move on to the next line
MODE is a Phaser keyword and tells it what kind of job we want to run. Phaser can be used for many different jobs, so it needs to know what it is being used for. In this case MR_AUTO stands for Molecular Replacement - AUTOmatic. Other possibilities are MR_FRF or MR_FTF for example. Most molecular replacement problems can be solved with the AUTO mode, but very specialized jobs can be run by using the other keywords individually.
The next line in the script:
specifies where our data are coming from. The HKL data are found in an MTZ file located in the directory you run the script from. MTZ files can store a lot of data so we need to tell Phaser which part of the file we need. The next line in the script:
LABIN F = FTOXD3 SIGF = SIGFTOXD3
specifies which columns the structure factor amplitudes and their standard deviations come from. Note that, when intensities and their standard deviations are available (which is not the case for this structure), it is preferable to use intensities instead of amplitudes. These are provided through a LABIN command as well, but with I and SIGI instead of F and SIGF. From Phaser-2.7.12, if only one I column is present in the mtz file, this will be used else if only one F column is present in the mtz file, this will be used.
Next we need to specify a model that we will use for the molecular replacement. The line:
ENSEMBLE toxd PDBFILE 1D0D_B.pdb IDENTITY 0.364
tells us the PDB File for the model and the sequence identity between this model and the corresponding protein in our crystal. The sequence identity is needed so phaser can estimate the RMS error in the model. Ignore the keyword ENSEMBLE for the time being. It will be discussed in more detail later. For now, just think of it as a handle to the model.
The next line:
COMPOSITION PROTEIN SEQUENCE toxd.seq NUM 1
tells us the composition in the Asymmetric Unit. In our case it is protein with a sequence given in the file toxd.seq, and there is one molecule in the asymmetric unit. We could have specified the composition in a number of other ways, for instance by saying that the molecular weight of the protein is 7139.
Now all we need to do is to tell Phaser which model to search with. In our case we only have one model so it is trivial:
SEARCH ENSEMBLE toxd NUM 1
The subkey "NUM 1" is the default, but we could have asked Phaser to search for more than one copy. Finally we tell PHASER where to put all the output files. The line:
tells it to put everything in the tutorial directory and that the start of the output file names will be AUTO_toxd1, so you will get a bunch of files called AUTO_toxd1.log, AUTO_toxd1.sum, etc.
Ok, we are done. Let us run it! If you now type
in the tutorial directory, it should all run. After it has finished you will notice a number of files that have been created. The best file to look at to get an overview of the job is the summary file, AUTO_toxd1.sum in this case. In that file you will see the progress of all the steps in an automatic Phaser run: correction for anisotropy, cell content analysis, rotation search, rescoring rotation peaks, translation searches, rescoring translations, packing, refinement, and generation of output PDB and MTZ files.
Adding to the Script
Let us modify our script file a little. Use your favourite editor and, after the ENSEMBLE command, add the line:
ENSEMBLE toxd PDBFILE 1BIK_on_1D0D.pdb IDENTITY 0.377
To avoid overwriting your old files, change the file root to AUTO_toxd2. Your script job will look like this (AUTO_toxd2.com in the tutorial directory):
phaser << eof MODE MR_AUTO HKLIN toxd.mtz LABIN F = FTOXD3 SIGF = SIGFTOXD3 ENSEMBLE toxd PDBFILE 1D0D_B.pdb IDENTITY 0.364 ENSEMBLE toxd PDBFILE 1BIK_on_1D0D.pdb IDENTITY 0.377 COMPOSITION PROTEIN SEQUENCE toxd.seq NUM 1 SEARCH ENSEMBLE toxd NUM 1 ROOT AUTO_toxd2 eof
What have we done? We have now specified a second PDB File to be added to our model. But why do we give it the same handle? Remember I promised to talk in more detail about the ENSEMBLE keyword? Well here we go: What ENSEMBLE really does is that it takes all the PDB files and merges them together into an averaged model. See, neither 1D0D_B nor 1BIK are very good on their own but we can hope that if we put the two together into one single model that their identical features emphasize each other and that their dissimilar parts will be weighted down. This is exactly what the ENSEMBLE is for. It takes a number of PDB files and combines them using their sequence identity (or RMS error) to compute weighting factors.
Of course, for this to make any sense the models in the PDB files have to have the same relative orientation. In the distributed files, you will find a file 1BIK.pdb, but not the file 1BIK_on_1D0D.pdb. You will have to generate this by using another program to superimpose 1BIK on 1D0D. Probably the most convenient is to use the SSM superpose option in coot. It is VERY important to view your PDB files together in a graphics program (like O, PyMol, xfit or coot) before you attempt to use them in this way.
An alternative to this manual approach is to use the Ensembler program, which will superimpose the models and place them into a single merged PDB file.
In our previous example we only had one single PDB file so the ENSEMBLE keyword didn't really mean a lot.
Searching for more than one molecule
The following job illustrates a more difficult molecular replacement problem, searching for the two components of a complex between beta-lactamase and the beta-lactamase inhibitor protein (BLIP) (AUTO_beta_blip.com in the tutorial directory) .
phaser << eof MODE MR_AUTO HKLIN beta_blip_P3221.mtz LABIN F = Fobs SIGF = Sigma ENSEMBLE beta PDBFILE beta.pdb IDENTITY 1.0 ENSEMBLE blip PDBFILE blip.pdb IDENTITY 1.0 COMPOSITION PROTEIN SEQUENCE beta.seq NUM 1 COMPOSITION PROTEIN SEQUENCE blip.seq NUM 1 SEARCH ENSEMBLE beta NUM 1 SEARCH ENSEMBLE blip NUM 1 ROOT AUTO_beta_blip eof
Here we define a separate ENSEMBLE for each separate rigid body that we will be looking for, and we give separate SEARCH commands for each one we wish to look for in this job. Regardless of the order in which we specify the SEARCHes, by default Phaser will search first for the component expected to be easier to find. In this case, this is the bigger molecule, beta. Once beta has been placed, the information from the fixed beta component will be used in looking for the smaller (and harder to locate) BLIP component.
Note the NUM subkey of the SEARCH command. If we were looking for more than one copy, we could give a NUM greater than 1. Note also that, for convenience, more than one COMPOSITION command can be given. Phaser will just add up the compositions given for the separate components.