Molecular Replacement

From Phaserwiki
Revision as of 14:57, 24 October 2012 by Airlie (talk | contribs) (Building an Ensemble from Coordinates)

Phaser should be able to solve most structures with the Automated Molecular Replacement mode, and this is the first mode that you should try. Give Phaser your data (How to Define Data) and your models (How to Define Models), tell Phaser what to search for, and a list of possible spacegroups (in the same point group).

If this doesn't work (see Has Phaser Solved It?), you can try selecting peaks of lower significance in the rotation function in case the real orientation was not within the selection criteria. By default peaks above 75% of the top peak are selected (see How to Select Peaks). See What to do in Difficult Cases for more hints and tips. If the automated molecular replacement mode doesn't work even with non-default input you need to run the modes of Phaser separately. The possibilities are endless - you can even try exhaustive searches (translations of all orientations) if you want - but experience has shown that most structures that can be solved by Phaser can be solved by relatively simple strategies.

Automated Molecular Replacement

Automated Molecular Replacement combines the anisotropy correction, likelihood enhanced fast rotation function, likelihood enhanced fast translation function, packing and refinement modes for multiple search models and a set of possible spacegroups to automatically solve a structure by molecular replacement. Top solutions are output to the files FILEROOT.sol, FILEROOT.#.mtz and FILEROOT.#.pdb (where "#" refers to the sorted solution number, 1 being the best, and only 1 is output by default). Many structures can be solved by running an automated molecular replacement search with defaults, giving the ensembles that you expect to be easiest to find first.

Flow Diagram for Automated MR

Should Phaser Solve It?

The difficulty of a molecular replacement problem depends primarily on two major factors: how well the model will be able to explain the diffraction data (which depends both on the accuracy of the model and on its completeness), and how many reflections can be explained, at least in part. Each reflection provides a piece of information that helps to identify correct MR solutions.

It is possible to make a reasonable prediction of whether or not a solution will be found. If the quality of the model (its accuracy and completeness) can be estimated, then the expected contribution of each reflection to the total LLG can also be estimated. From a large battery of tests, we know that an LLG of 40 or greater usually indicates a correct solution (at least in the absence of complicating factors such as translational non-crystallographic symmetry, tNCS). Building on this understanding, if it is estimated that the LLG will be 60 or less, then Phaser will assume that the problem is a difficult one, and will implement search procedures optimised for difficult problems.

What Resolution of Data Should be Used?

The signal for a molecular replacement solution should be very clear if the expected value of the LLG is much higher than the minimum required to be fairly certain of a solution. Currently Phaser aims for a minimum LLG of 120 and, if it is possible to achieve an even higher value, given the quality of the model and the quantity of diffraction data, then the resolution for the initial search is limited to the value required to achieve an expected LLG of 120. Data to the full resolution are still used for a final rigid-body refinement, or in a second pass if a clear solution is not found in the first attempt.

However, if the model is expected to have a large RMS error (based usually on the correlation between sequence identity and RMS error), then data to high resolution will not contribute any significant signal. Regardless of the expected LLG at the highest resolution limit, the resolution used is limited to 1.8 times the estimated RMS error of the model, because this resolution limit gives about 99% of the LLG that could be achieved.

Because Phaser implements strategies designed to solve structures with as much confidence as possible, as efficiently as possible, it is best to leave the choice of resolution to Phaser, at least in the first instance.

Has Phaser Solved It?

TF Z-score Have I solved it?
less than 5 no
5 - 6 unlikely
6 - 7 possibly
7 - 8 probably
more than 8* definitely
*6 for 1st model in monoclinic space groups

Ideally, only the number of solutions you are expecting should be found. However if the signal-to-noise of your search is low, there will be noise peaks in the final selection also. Signal-to-noise is judged using the Z-score, which is computed by comparing the LLG values from the rotation or translation search with LLG values for a set of random rotations or translations. The mean and the RMS deviation from the mean are computed from the random set, then the Z-score for a search peak is defined as its LLG minus the mean, all divided by the RMS deviation, i.e. the number of standard deviations above (or below) the mean.

For a rotation function, the correct solution may be in the list with a Z-score under 4, and will not be found until a translation function is performed and picks out the correct solution.

For a translation function the correct solution will generally have a Z-score (number of standard deviations above the mean value) over 5 and be well separated from the rest of the solutions. Of course, there will always be exceptions! The table gives a very rough guide to interpreting TF Z-scores. This table will be updated, as we learn more from systematic molecular replacement trials.

You should always at least glance through the summary of the logfile. One thing to look for, in particular, is whether any translation solutions with a high Z-score have been rejected by the packing step. By default up to 10 clashes are allowed. Such a solution may be correct, and the clashes may arise only because of differences in small surface loops. If this happens, repeat the run allowing a suitable number of clashes. Note that, unless there is specific evidence in the logfile that a high TF-function Z-score solution is being rejected with a few clashes, it is much better to edit the model to remove the loops than to increase the number of allowed clashes. Packing criteria are a very powerful constraint on the translation function, and increasing the number of allowed clashes beyond the default will increase the search time enormously without the possibility of generating any correct solutions that would not have otherwise been found.


Annotation

A highly compact summary of the history of a solution is given in the annotation of a solution in the .sol file. This is a good place to start your analysis of the output. The annotation gives the Z-score of the solution at each rotation and translation function, the number of clashes in the packing, and the refined LLG.

Annotation Meaning
RFZ= Rotation Function Z-score
TFZ= Translation Function Z-score
PAK= Number of packing clashes
LLG= LLG after refinement (resolution may change)
TFZ== Translation Function Z-score equivalent
RF++ Rotation angle from previous Rotation Function has been used in the addition of next solution
RF*0 Rotation angle 000 identified by low R-factor of input model
TFZ=* First molecule in P1 (arbitrary origin, no Translation Function required)
TF*0 Translation vector 000 identified by low R-factor of input model
(& ... & ...) Set of TFZ PAK and LLG values for placements that were amalgamated (more than one placement from a single Translation Function)
LLG+=(... & ...)  Set of LLG values calculated during amalgamation, which will always be increasing in value
+TNCS Components added by Translational NCS relation

Two versions of TFZ (the translation function Z-score) now appear for each component. The first ("TFZ=") is the Z-score from the actual translation search, which depends on the accuracy of the orientation used for that search. The second ("TFZ==") is the TFZ-equivalent, which indicates what the TFZ score would have been with the correct (refined) orientation. You should see the TFZ-equivalent is high at least for the final components of the solution, and that the LLG (log-likelihood gain) increases as each component of the solution is added. For example, in the case of beta-blip the annotation for the single solution output in the .sol file shows these features

 SOLU SET RFZ=10.7 TFZ=24.3 PAK=0 LLG=472 TFZ==24.7 RFZ=6.4 TFZ=24.4 PAK=0 LLG=1006 TFZ==29.7 LLG=1006
 SOLU 6DIM ENSE beta EULER 200.849 41.269 183.909 FRAC -0.49604 -0.15830 -0.28092 BFAC 0.00000
 SOLU 6DIM ENSE blip EULER 43.749 80.793 117.292 FRAC -0.12289 0.29435 -0.09266 BFAC 0.00000

What to do in Difficult Cases

Not every structure can be solved by molecular replacement, but the right strategy can push the limits. What to do when the default jobs fail depends on why your structure is difficult.

  • Flexible Structure
    The relative orientations of the domains may be different in your crystal than in the model. If that may be the case, break the model into separate PDB files containing rigid-body units, enter these as separate ensembles, and search for them separately. If you find a convincing solution for one domain, but fail to find a solution for the next domain, you can take advantage of the knowledge that its orientation is likely to be similar to that of the first domain. The ROTAte AROUnd option of the brute rotation search can be used to restrict the search to orientations within, say, 30 degrees of that of the known domain. Allow for close approach of the domains by increasing the allowed clashes with the PACK keyword by, say, 1 for each domain break that you introduce. Alternatively, you could try generating a series of models perturbed by normal modes, with the NMAPdb keyword. One of these may duplicate the hinge motion and provide a good single model.
  • Poor or Incomplete Model
    Signal-to-noise is reduced by coordinate errors or incompleteness of the model. Since the rotation search has lower signal to begin with than the translation search, it is usually more severely affected. For this reason, it can be very useful to use the subsequent translation search as a way to choose among many (say 1000) orientations. THe MR_AUTO FAST search mode automatically reduces the cutoff for accepting peaks from the fast rotation function if the decault pass does not find a solution with a high z-score, but you can manually reduce this further with the PEAKS and PURGE keywords. You can also try turning off the clustering of fast rotation function peaks because the correct orientation may sit on the shoulder of a peak in the rotation function.
    As shown convincingly by Schwarzenbacher et al. (Schwarzenbacher, Godzik, Grzechnik & Jaroszewski, Acta Cryst. D60, 1229-1236, 2004), judicious editing can make a significant difference in the quality of a distant model. In a number of tests with their data on models below 30% sequence identity, we have found that Phaser works best with a "mixed model" (non-identical sidechains longer than Ser replaced by Ser). In agreement with their results, the best models are generally derived using more sophisticated alignment protocols, such as their FFAS protocol. Use phenix.sculptor to edit your model.
  • High Degree of Non-crystallographic Symmetry
    If there are clear peaks in the self-rotation function, you can expect orientations to be related by this known NCS. Methods to automatically use such information will be implemented in a future version of Phaser. In the meantime, you can work out for yourself the orientations that would be consistent with NCS and use the ROTAte AROUnd option to sample similar orientations. Alternatively, you may have an oligomeric model and expect similar NCS in the crystal. First search with the oligomeric model; if this fails, search with a monomer. If that succeeds, you can again use the ROTAte AROUnd option to force a subsequent monomer to adopt an orientation similar to the one you expect.
  • What not to do
    The automated mode of Phaser is fast when Phaser finds a high Z-score solution to your problem. When Phaser cannot find a solution with a significant Z-score, it "thrashes", meaning it maintains a list of 100-1000's of low Z-score potential solutions and tries to improve them. This can lead to exceptionally long Phaser runs (over a week of CPU time). Such runs are possible because the highly automated script allows many consecutive MR jobs to be run without you having to manually set 100-1000's of jobs running and keep track of the results. "Thrashing" generally does not produce a solution: solutions generally appear relatively quickly or not at all. It is more useful to go back and analyse your models and your data to see where improvements can be made. Your system manager will appreciate you terminating these jobs.
    It is also not a good idea to effectively remove the packing test. Unless there is specific evidence in the logfile that a high TF-function Z-score solution is being rejected with a few clashes, it is much better to edit the model to remove the loops than to increase the number of allowed clashes. Packing criteria are a very powerful constraint on the translation function, and increasing the number of allowed clashes beyond a few (e.g. 1-5) will increase the search time enormously without the possibility of generating any correct solutions that would not have otherwise been found.
  • Other suggestions
    Phaser has powerful input, output and scripting facilities that allow a large number of possibilities for altering default behaviour and forcing Phaser to do what you think it should. However, you will need to read the information in the manual below to take advantage of these facilities!

How to Define Data

You need to tell Phaser the name of the mtz file containing your data and the columns in the mtz file to be used using the HKLIn and LABIn keywords. Additional keywords (BINS CELL OUTLier RESOlution SPACegroup) define how the data are used.

How to Define Models

Phaser must be given the models that it will use for molecular replacement. A model in Phaser is referred to as an "ensemble", even when it is described by a single file. This is because it is possible to provide a set of aligned structures as an ensemble, from which a statistically-weighted averaged model is calculated. A molecular replacement model is provided either as one or more aligned pdb files, or as an electron density map, entered as structure factors in an mtz file. Each ensemble is treated as a separate type of rigid body to be placed in the molecular replacement solution. An ensemble should only be defined once, even if there are several copies of the molecule in the asymmetric unit.

Fundamental to the way in which Phaser uses MR models (either from coordinates or maps) is to estimate how the accuracy of the model falls off as a function of resolution, represented by the Sigma(A) curve. To generate the Sigma(A) curve, Phaser needs to know the RMS coordinate error expected for the model and the fraction of the scattering power in the asymmetric unit that this model contributes.

A Babinet-style correction is used to account for the effects of disordered solvent on the completeness of the model at low resolution.

Molecular replacement models are defined with the ENSEmble keyword and the COMPosition keyword. The ENSEmble keyword gives (amongst other things) the RMS deviation for the Sigma(A) curve. The COMPosition keyword is used to deduce the fraction of the scattering power in the asymmetric unit that each ensemble contributes. The composition of the asymmetric unit is defined either by entering the molecular weights or sequences of the components in the asymmetric unit, and giving the number of copies of each. Expert users can also enter the fraction of the scattering of each component directly, although the composition must still be entered for the absolute scale calculation. Please note that the composition supplied to Phaser has to include everything in the asymmetric unit, not just what is being looked for in the current search!

Building an Ensemble from Coordinates

The RMS deviation is determined directly from RMS or indirectly from IDENtity in the ENSEmble keyword using a formula that depends on the sequence identity and the number of residues

Initial estimate of RMS deviation
Number of residues versus sequence identity
#50 #100 #200 #300 #400 #600 #850 #1000 #1500 #2000
ID=0% 1.600 1.673 1.821 1.968 2.116 2.411 2.780 3.001 3.739 4.476
ID=10% 1.374 1.437 1.564 1.691 1.817 2.071 2.387 2.577 3.211 3.844
ID=20% 1.180 1.234 1.343 1.452 1.561 1.778 2.050 2.214 2.758 3.302
ID=30% 1.013 1.060 1.153 1.247 1.340 1.527 1.761 1.901 2.368 2.835
ID=40% 0.870 0.910 0.991 1.071 1.151 1.312 1.512 1.633 2.034 2.435
ID=50% 0.750 0.782 0.851 0.920 0.989 1.126 1.299 1.402 1.747 2.091
ID=60% 0.750 0.750 0.750 0.790 0.849 0.967 1.115 1.204 1.500 1.796
ID=70% 0.750 0.750 0.750 0.750 0.750 0.831 0.958 1.034 1.288 1.543
ID=80% 0.750 0.750 0.750 0.750 0.750 0.750 0.823 0.888 1.106 1.325
ID=90% 0.750 0.750 0.750 0.750 0.750 0.750 0.750 0.763 0.950 1.138
ID=100% 0.750 0.750 0.750 0.750 0.750 0.750 0.750 0.750 0.816 0.977

The RMS deviation estimated from ID may be an underestimate of the true value if there is a slight conformational change between the model and target structures. To find a solution in these cases it may be necessary to increase the RMS from the default value generated from the ID, by say 0.5 Ångstroms. On the other hand, when Phaser succeeds in solving a structure from a model with sequence identity much below 30%, it is often found that the fold is preserved better than the average for that level of sequence identity. So it may be worth submitting a run in which the RMS error is set at, say, 1.5, even if the sequence identity is low. The table below can be used as a guide as to the default RMS value corresponding to ID.

If you construct a model by homology modelling, remember that the RMS error you expect is essentially the error you expect from the template structure (if not worse!). So specify the sequence identity of the template, not of the homology model.

Only the model with the highest sequence identity is reported in the output pdb file. Also, HETATM cards in the input pdb file are ignored in the calculation of the structure factors for the ensemble, but are carried through to the output pdb file. Thus, the phases on the output mtz file (which come from the structure factors of the ensemble) do not correspond to those that would be calculated from the output pdb file, when there is more than one pdb file in an ensemble and/or the pdbfile(s) have HETATM records.

Coordinate Editing

HETATM/LIGANDS

Phaser ignores the scattering from HETATM records. The HETATM records are carried though to output with occupancy set to zero. Ligands will therefore not contribute to the scattering used for molecular replacement. The exceptions to this rule are the HETATM records for MSE (seleno-methionine) MSO (seleno-methionine selenoxide) CSE (seleno-cysteine) CSO (seleno-cysteine selenoxide) ALY (acetyllysine) MLY (n-dimethyl-lysine) and MLZ (n-methyl-lysine) which are used in the scattering and carried through to output with their original occupancy. If you wish to include any HETATM records in the scattering the record name use the keyword ENSE modlid HETATOM ON

WATER

Water molecules (identified by the residue name OW WAT HOH H2O OH2 MOH WTR or TIP) are deleted from the pdb file on input, are not used in the scattering and are not carried through to file output. If you want to retain water molecules you will need to change the residue name to something other than this (e.g. WWW) so that the atoms are not identified as water. To include the water molecules in the scattering, the HETATM records will also have to be changed to ATOM records as described above.

Building an Ensemble from Electron Density

When using density as a model, it is necessary to specify both the extent (x,y,z limits) of the cut-out region of density, and the centre of this region. With coordinates, Phaser can work this out by itself. This information is needed, for instance, to decide how large rotational steps can be in the rotation search and to carry out the molecular transform interpolation correctly. In the case of electron density, the RMS value does not have the same physical meaning that it has when the model is specified by atomic coordinates, but it is used to judge how the accuracy of the calculated structure factors drops off with resolution. A suitable value for RMS can be obtained, in the case of density from an experimentally-phased map, by choosing a value that makes the SigmaA curve fall off with resolution similarly to the mean figures-of-merit. In the case of density from an EM image reconstruction, the RMS value should make the SigmaA curve fall off similarly to a Fourier correlation curve used to judge the resolution of the EM image.

For detailed information, including a tutorial with example scripts, see Using density as a model

How to Define Composition

The composition defines the total amount of protein and nucleic acid that you have in the asymmetric unit not the fraction of the asymmetric unit that you are searching for.

Default Composition

For convenience, the composition defaults to 50% protein scattering by volume (the average for protein crystals). It is better to enter it explicitly, even if only to check that you have correctly deduced the probable content of your crystal. If your crystal has higher or lower solvent content than this, or contains nucleic acid, then the composition should be entered explicitly.

Composition by Solvent Content

Scattering is determined from the solvent content of the crystal, assuming that the crystal contains protein only, and the average distribution of amino acids in protein. If your crystal contains nucleic acid or your protein has an unusual amino acid distribution then the composition should be entered explicitly using the MW or sequence options.

Composition by Number of Residues in ASU

Scattering is determined from the number of residues in the asymmetric unit, assuming that the crystal contains protein only or nucleic acid only, and assuming an average distribution of residues for either. If your crystal contains a mixture then the composition should be entered explicitly using the MW or sequence options. If your crystal has an unusual residue distribution then the composition should be entered explicitly using the sequence options.

Composition by Molecular Weight

The composition is calculated from the molecular weight of the protein and nucleic acid assuming the protein and nucleic acid have the average distribution of amino acids and bases. If your protein or nucleic acid has an unusual amino acid or base distribution the composition should be entered by sequence. You can mix compositions entered by molecular weight with those entered by sequence.

Composition by Sequence

The composition is calculated from the amino acid sequence of the protein and the base sequence of the nucleic acid in fasta format. You can mix compositions entered by molecular weight with those entered by sequence. Individual atoms can be added to the composition with the COMPOSITION ATOM keyword. This allows the explicit addition of heavy atoms in the structure e.g. Fe atoms.

Composition by Percentage Scattering

The fraction scattering of each ensemble can be entered directly. The fraction scattering of each ensemble is normally automatically worked out from the average scattering from each ensemble (calculated from the pdb files if entered as coordinates, or from the protein and nucleic acid molecular weights if entered as a map) divided by the total scattering given by the composition, but entering the fraction scattering directly overrides this calculation. This option is for use when the pdb files of the models in the ensemble are unusual e.g. consist only of C-alpha atoms, or only of hydrogen atoms (as in the CLOUDS method for NMR).

How to Define Solutions

Phaser writes out files ending in ".sol" and ".rlist" that contain the solution information from the job. The root of the files is given by the ROOT keyword. By default, the root filename is PHASER. These files can be read back into subsequent runs of Phaser to build up solutions containing more than one molecule in the asymmetric unit.

"PHASER.sol" files are generated by all modes (rotation function modes with VERBOSE output), and contain the current idea of potential molecular replacement solutions.

"PHASER.rlist" files are generated by the rotation function modes, and are used as input for performing translation functions. (They are also produced by degenerate (2D) translation functions, for performing a translation function to find the third dimension)

For simple MR cases you don't really need to know how to define molecular replacement solutions. However, for difficult cases you might need to edit the files "PHASER.sol" and "PHASER.rlist" files manually

"sol" Files

At different stages of molecular replacement, an Ensemble will be oriented but not positioned (after the rotation search), or oriented and positioned (after the translation search), or, rarely, oriented and the position in 2 of 3 dimensions known. These three states correspond to solutions defined by the keywords SOLUtion 3DIM, SOLUtion 6DIM, and SOLUtion 5DIM. Each Ensemble in the asymmetric unit has its own SOLUtion keyword. Solutions of the type 3DIM are given by the rotation function, solutions of the type 6DIM are given by the translation function, and solutions of the type 5DIM are given by the degenerate translation function.

When more than one (potential) molecular replacement solution is present, the solutions are separated with the SOLUTION SET keywords.

"rlist" Files

These files define a rotation function list. The peak list is given with a series of SOLUtion TRIAl keywords.

If a partial solution is already known, then the information for the currently "known" parts of the asymmetric unit is given in the form used for the PHASER.sol file, followed by the list of trial orientations for which a translation function is to be performed.

If a degenerate translation function is performed, then a SOLUtion TRIAl line is produced with the degenerate translation information present, ready for performing the translation function on the third dimension.

Fixed partial structure

If you have the coordinates of a partial solution with the pdb coordinates of the known structure in the correct orientation and position, then you can force Phaser to use these coordinates. Use the SOLUTION keyword to fix a rotation of 0 0 0 and a position of 0 0 0 for these coordinates.

How to Select Peaks

The selection of peaks saved for output in the rotation and translation functions can be done in four different ways.

  • Select by Percentage
    Percentage of the top peak, where the value of the top peak is defined as 100% and the value of the mean is defined as 0%.
    Default, cutoff=75%. This criteria has the advantange that at least one peak (the top peak) always survives the selection. If the top solution is clear, then only the one solution will be output, but if the distribution of peaks is rather flat, then many peaks will be output for testing in the next part of the MR procedure (e.g. many peaks selected from the rotation function for testing with a translation function).
  • Select by Z-score
    Number of standard deviations (sigmas) over the mean (the Z-score).
    Absolute significance test. Not all searches will produce output if the cutoff value is too high (e.g. 5 sigma).
  • Select by Number
    Number of top peaks to select.
    If the distribution is very flat then it might be better to select a fixed large number (e.g. 1000) of top rotation peaks for testing in the translation function.
  • No selection
    All peaks are selected.
    Enables full 6 dimensional searches, where all the solutions from the rotation function are output for testing in the translation function. This should never be necessary; it would be much faster and probably just as likely to work if the top 1000 peaks were used in this way.
Selection criteria

Peaks can also be clustered or not clustered prior to selection in steps 1 and 2.

  • Clustering Off
All high peaks on the search grid are selected
  • Clustering On
Points on the search grid with higher neighbouring points are removed from the selection


Clustering

How to Control Output

The output of Phaser can be controlled with optional keywords.

The ROOT keyword is not compulsory (the default root filename is "PHASER"), but should always be given, so that your jobs have separate and meaningful output filenames.

The TOPFiles keyword controls the number of potential MR solutions for which PDB and (in the appropriate modes) MTZ files are produced.

For the MR_AUTO, MR_RNP and MR_LLG modes, unless HKLOut OFF is given as an optional keyword, Phaser produces an MTZ file with "SigmaA" type weighted Fourier map coefficients for producing electron density maps for rebuilding.

MTZ Column Labels Description
FWT/PHWT Amplitude and phase for 2m|Fobs|-D|Fcalc| exp(iαcalc) map
DELFWT/PHDELWT Amplitude and phase for m|Fobs|-D|Fcalc| exp(iαcalc) map
FOM m, analogous to the "Sim" weight, to estimate the reliability of αcalc
HLA/HLB/HLC/HLD Hendrickson-Lattman coefficients encoding the phase probability distribution