ctprotein module¶
This is where the ctprotein shinanigans occurs. Maybe we reference [BiDB79] yeah
CTProtein is the main protein trajectory class in CAMPARITRAJ

class
camparitraj.ctprotein.
CTProtein
(traj, residue_offset)[source]¶ CTProtein objects are initialized with a trajectory subset that contains only the atoms a specific, single protein. This means that a CTProtein object allows operations to performed on a single protein. A single trajectory may have multiple proteins in it. indexing with a protein object assumes that the protein is indexed from 0 to n, n is the number of residues.
This is an important idea to emphasize  it means that you (the user) will need to determine the correct residue index for a region of interest being examined. The region will NOT (necessarily) correspond to the residue index in the PDB file used.
To make this easier the function CTProtein.print_residues() will print the mapping of residue index to residue name and residue number.
To reiterate:
Residue number is the number of the residue in the PDB file.
Residue index is the index value associated with a residue in a specific protein and will always begin from 0  note this will include the peptide caps (ACE/NME) if present.

__init__
(traj, residue_offset)[source]¶ Initialize a CTProtein object instance using trajectory information, and information about offsets.
 Parameters
traj (cttrajectory.CTTrajectory) – An instance of a system’s trajectory populated via cttrajectory.CTTrajectory.
residue_offset (int) –
Similar to atom_offset, this is the residue index which will serve as the marker for residue index 0.
The residue offset operation is performed by the __get_offset_residue fuction. For ANY function where a specific residue is supplied there must also be the option to perform this offset or not (i.e. each public facing function should be able to stand alone and not rely on an offset being performed by another function) BUT the option to perform the offseting provided so it in functions that call other functions which can perform the offset it will only need to be performed once.

__check_contains_CA
(R1, R1_org=None, has_correctOffset=True)¶ Function which checks if residue R1 (which is the residue AFTER an offset correction has been applied) contains a Calpha atom.
Returns None or raises a CTException.
 Parameters
R1 (int) – The zeroindexed residue index (int) whose index is checked and validated.
R1_org (int or None {None}) –
 Raises
CTException – When the CTProtein has an uncorrected offset and residue R1 lacks a Calpha atom. Or, when the offset residue R1_org which maps to R1 lacks a Calpha. And, when the offset converted residue lacks a Calpha atom.

__check_single_residue
(R1)¶ Internal function that checks that a single residue provided makes sense in the context of this protein. NOTE this checks BEFORE an offset is applied (i.e. we assume once a residue offset has been made that the resid’s validity has been established).
Returns None or raises a CTException.
 Parameters
R1 (int) – The zeroindexed residue index (int) whose index is checked and validated.
 Raises
CTException – When the residue ID is greater than the chain length, or when the distances explored are greater than the chain size.

__check_stride
(stride)¶ Checks that a passed stride value doesn’t break everything. Returns None or raises a CTException.
 Parameters
stride (int) – The nonzero number of steps to perform while iterating across a trajectory.
 Raises
CTException – When the stride is larger than the number of available frames in the trajectory, or less than 1.

__check_weights
(weights, stride=None)¶ Function that checks a passed weightsarray is usable and matches the number of frames (avoids a lot of heartache when something breaks deep inside the code).
NOTE: This also typecasts weights to a numpy.array which allows them to be indexed directly using a list of values.
 Parameters
weights (array_like) – An numpy.array object that corresponds to the number of frames within an input trajectory.
stride ({None}, optional) – The stepsize used when iterating across the frames. A value of None will set the step size to 0.
 Returns
An np.array object containing trajectory frames selected per stride number of frames.
 Return type
numpy.array

__get_first_and_last
(R1, R2, withCA=False)¶ Function which returns first and last residue for a range, correcting for the residue offset problem. The returned tup
 Parameters
R1 (int or False/None {False}) – First residue in range  can be an integer (assumes first residue in chain indexed at 0). If False assume we start at 0.
R2 (int or False/None {False}) – Last residue in range  can be an integer (assumes first residue in chain indexed at 0). If False assumes we’re using the whole chain.
withCA (bool {False}) – Flag which, if True and R1 or R2 are False, selects R1/R2 values that contain a CA, which basically means caps are dealt with here if present.
 Returns
Returns a tuple with three positions:
[0] = R1 with relevant offsets applid (int).
[1] = R2 with relevant offsets applid (int).
[2] = String that can be passed directly to topology select to extract the atoms associated with these positions.
 Return type
tuple

__get_resid_with_CA
()¶ Internal function which should only be needed during initialization. Defines the list of residues where CA atoms are present, and the list of zeroindexed residue indices which contain CA atoms.
In the case that the first resid in the self.topology object is 0 then these two lists are the same (which for MDTraj 1.9.5 or higher should always be the same) but for systems with multiple protein chains in MDTraj 1.9.4 or lower the residuesWithCA and idxWithCA willn differe for the second chain and onwards.
This list is then assigned to the property variable self.resid_with_CA and self.idx_with_CA.
Note this is quite slow for large trajectories
 Returns
A 2tuple that is comprised of lists:
[0] := The list of residue indices which contain Calpha atoms selected from the topology.
[1] := The list of zeroindexed residue indices which contain Calpha atoms.
 Return type
tuple
See also

__get_selection_atoms
(region=None, backbone=True, heavy=False, correctOffset=True)¶ Function which returns a list of atoms associated with the residues defined by the region keyword. If no region is supplied this returns the entire region (NME/ACE caps included).
 Parameters
region (np.array, list, or tuple {None}) – An array_like object of size 2 which defines the first and last residue (INCLUSIVE) for a region to be examined.
backbone (bool {True}) – Boolean flag to determine if only the backbone atoms should be returned, or if all the full chain’s atoms should be included (i.e. including sidechain).
heavy (bool {False}) – Boolean flag to determine if we should only select heavy atoms or not (i.e. not H).
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again
 Returns
A numpy.array comprised of atom indices corresponding to the residues in a given region.
 Return type
selectionatoms
 Raises
CTException – When the input region is larger than 2.

__get_subtrajectory
(traj, stride)¶ Internal function which returns a subtrajectory. Expects traj to be an mdtraj trajectory object and stride to be an int.
 Parameters
traj (mdtraj.Trajectory) – An instance of an mdtraj.Trajectory which is nonempty  i.e. contains at least 1 frame.
stride (int) – The nonzero number of steps to perform while iterating across the input trajectory, traj.
 Returns
A sliced trajectory which contains the frames selected every stride step from the input trajectory, traj.
 Return type
mdtraj.Trajectory

__phi_psi_bbseg
(phi_vector, psi_vector)¶ Internal function that takes two equallymatched phi and psi angle vectors and based on the pairwise combination classified each pair of elements using the BBSEG2 definition. Definition was generated from the BBSEG2 file distributed with CAMPARI, and is encoded and stored in the _internal_data module.
NOTE that because this is an internal function we do not double check that the phi_vector and psi_vectors are of the same length, but this is critical, so if this function is being called make sure this is true!
 Parameters
phi_vector (iterable (list or numpy vector)) – ordered list of phi angles for a specific residue
psi_vector (iterable (list or numpy vector)) – ordered list of psu angles for a specific residue
 Returns
classes – A list of length equal to phi_vector and psi_vector that classifies each pair of phi/psi angles using the BBSEG2 definition.
 Return type
list

__residue_atom_lookup
(resid, atomname=None)¶ Memoisation function to lookup the atomic index of a specific residues atom. Originally I’d assumed the underlying MDTraj topology.select() operation was basically a lookup, BUT it turns out it’s actually really expensive, so this method converts atom/residue lookup information into a dynamic O(1) operation, greatly improving the performance of a number of different methods in the processes.
 Parameters
resid (int) – The residue index to lookup. This index must correspond to the range derived from the corrected offset. If the residue has not been cached, it will be added to the lookup table for later reuse.
atomname (str or None {None}) – The name of the atom to lookup which will return the corresponding residue ID. Like the previous parameter, if that residue does not exist in the lookup table it will be added for later reuse.
 Returns
A list containing all the atoms corresponding to a given residue id that match the input residue id (resid) or, the residue corresponding to the atom name (atomname).
 Return type
list

calculate_all_CA_distances
(residueIndex, mode='CA', onlyCterminalResidues=True, correctOffset=True, stride=1)[source]¶ Calculate the full set of distances between Calpha atoms. Note that by default this explicitly works in a way to avoid computing redundancy where we ONLY compute distances between residue i and residues greater than i up to the final residue. This behaviour is defined by the onlyCterminalResidues flag.
Distance is returned in Angstroms.
Can be fed a mode ‘COM’ keyword to calcule center of mass distances instead of CA distances.
 Parameters
residueIndex (int) – Defines the residue index to select the CA from.
mode (str {'CA'}) – String, must be one of either ‘CA’ or ‘COM’.  ‘CA’ = alpha carbon.  ‘COM’ = center of mass (associated withe the residue).
onlyCterminalResidues (bool {True}) – This variable means that only residues Cterminal of the residueIndex value will be considered. This is useful when performing an ALL vs. ALL matrix as it ensures that only the upper triangle is calculated if we iterate over all residues, but may not be deseriable in other contexts.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride (int {1}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame.
 Returns
Array containing the endtoend distance measures based on the input mode.
 Return type
numpy.array
 Raises
CTException – If the input mode is nether ‘CA’ or ‘COM’.

property
ccap
¶ Flag that returns if a Cterminal capping residue is present (or not).
 Returns
True if Cteminal cap is present, False if not
 Return type
bool

get_CA_index
(residueIndex, correctOffset=True)[source]¶ Get the CA atom index for the residue defined by residueIndex. Again does this via memoization  i.e. the first time a specific residue is requested the function looks up the information and then stores it locally in case its needed again.
Defensivly checks for errors.
 Parameters
residueIndex (int) – Defines the residue index to select the CA from.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 Returns
A list of size 1 containing the CA atom index for the residue index, residueIndex.
 Return type
list
 Raises
CTException – When the number of CA atoms do not equal 1.

get_D_vector
(stride=20, verbose=True)[source]¶ Function to calculate the vector of D values used to calculate the Phi parameter from Lyle et al[1].
The stride parameter defines the spacing between frames which are analyzed. This is just for practical purposes.
The Phi calulation computes a D value for each frame vs. frame comparison  for a 2000 frame simulation this would be 4 Million D values if every value was calculated which is a bit much, so the stride let’s you define how many frames you should skip.
Importantly, the DVector calculated here measures dij (see part III A of the paper) as the CACA distance and NOT the average interatomic distance. This has two effects:
Heterogeneity is now, formally, a measure over backbone heterogeneity and not full protein heterogeneity  this may be desirable (arguably it’s a more interpratable measure of conformational change) but if the interatomic version is required this could be implemented.
It is much more efficient than the original version
For a 2000 frame trajectory of a 80 residue protein with a stride=20 allows the calculation to take about 5 seconds. However, as protein size increases the computational cost of this process grows rapidly.
 Parameters
stride (int {20}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
Returns a numpy array of D values (i.e. the D_vector)
 Return type
np.ndarray
References
[1] Lyle, N., Das, R. K., & Pappu, R. V. (2013). A quantitative measure for protein conformational heterogeneity. The Journal of Chemical Physics, 139(12), 121907.

get_Q
(protein_average=True, region=None, beta_const=50.0, lambda_const=1.8, native_contact_threshold=4.5, correctOffset=True, stride=1, weights=False)[source]¶ Function which will calculate the fraction of native contacts in each frame of the trajectory, where the ‘native’ state is defined as a specific frame (1st frame by default  note this means the native state frame = 0 as we index from 0!). In earlier versions the ‘native state frame’ was a variable, but this ends up being extremely messy when weights are considered, so assume the native state frame is always frame 0.
Native contacts are defined using the definition from Best, Hummer, and Eaton, “Native contacts determine protein folding mechanisms in atomistic simulations” PNAS (2013) 10.1073/pnas.1311599110. The implementation is according to the code helpfully provided at http://mdtraj.org/latest/examples/nativecontact.html
 Parameters
protein_average (bool {True}) – This is the default mode, and means that the return vector is the AVERAGE fraction of native contacts over the protein surface for each frame (e.g. each value refers to a single frame). If set to false the simulationaverage value at nativecontact resolution is returned, instead returning $NATIVE_CONTACT number of values and an additional list of native contact pairs.
region (list/tuple of length 2 {None}) – Defines the first and last residue (INCLUSIVE) for a region to be examined. By default is set to None which means the entire protein is used
beta_const (float {50}) – Constant used for computing Q in reciprocal nanometers. Default is 50 and probably should not be changed without good reason.
lambda_const (float {1.8}) – Constant value is 1.8 for allatom simulations. Probably should not be changed without good reason
native_contact_threshold (float {4.5}) – Threshold in Angstroms typically used for allatom simulations and again probably should not be changed without good reason
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again
stride (int {1}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame.
weights (list or array of floats {False}) – Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
 Returns
If protein_average=True a single vector is returned with the overall protein average fraction of
native contacts associated with each frame for each residue. If protein_average is set to False a
4position tuple is returned, where each of the four positions has the following identity
idx 
0  The fraction of the time a contact is native for all each native contact (vector of length – $N_NATIVE CONTACTS)
1  The native contact definition (same length as 1) where each element is a pair of atoms which – are considered native
2  The residuebyresidue dictionarynative contacts dictionary. Keys are residue namenumber – and each keyassociated value is the fractional native contacts for atoms associated with that residue. To get the residuespecific fraction of native contacts take the mean of the element values
3  The ordered list of keys from 2 for easy plotting in a residueresidue manner
4  A nres x nres array showing a 2D contact map defining interresidue specific Q values

get_RMSD
(frame1, frame2= 1, region=None, backbone=True, correctOffset=True, stride=1)[source]¶ Function which will calculate the aligned RMSD between two frames, or between one frame and all frames in the trajectory. This can be done over the entire protein but we can also specificy a local region to perform this analysis over.
Units are Angstroms.
 frame1int
Defines the frame to be used as a reference
 frame2int {1}
Defines the frame to be used as a comparison, OR if left blank or set to 1 means the entire trajectory
 regionlist/tuple of length 2 {None}
Defines the first and last residue (INCLUSIVE) for a region to be examined. By default is set to None which means the entire protein is used
 backbonebool {True}
Boolean flag for using either the full chain or just backbone. Generally backbone alone is fine so default to True.
 correctOffsetbool {True}
Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again
 strideint {1}
Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame
 Returns
Returns a numpy array of either 1 (if two frames are passed) or nframes length which corresponds to the RMSD difference either between two frames or between one frame and ALL other frames
 Return type
np.ndarray

get_all_SASA
(probe_radius=0.14, mode='residue', stride=20)[source]¶ Returns the Solvent Accessible Surface Area (SASA) for each residue from every strideth frame. SASA is determined using shrake_rupley algorithm.
SASA is returned in Angstroms squared, BUT PROBE RADIUS is in nanometers!
probe_radius [float] {0.14} Radius of the solvent probe used in nm. Uses the GoldenSpiral algorithm. 0.14 nm is pretty standard. NOTE  the probe radius must be in nanometers
mode [string] {‘residue’,’atom’,’sidechain’,’backbone’, ‘all’} Defines the mode used to compute the SASA. Must be one of ‘residue’ or ‘atom’. For atom mode, extracted areas are resolved peratom. For ‘residue’, this is computed instead on the per residue basis.
stride [int] {20} Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame

get_amino_acid_sequence
(oneletter=False, numbered=True)[source]¶ Returns the protein’s amino acid sequence.
 Parameters
oneletter (bool {False}) – If True returns a single sequence of one letter amino acid codes. If False get a list of 3 letter codes with residue number separated by a ‘‘ character.
numbered (bool {True}) – If True the return value is a list of RESNAMERESID strings, if False return value is a list of RESNAME in the correct order.
 Returns
A list comprised of the 1letter or 3letter names of the amino acid sequence.
 Return type
list

get_angle_decay
(atom1='C', atom2='N', return_full_matrix=False)[source]¶ Returns the a 4 by n numpy array in which column 1 gives residue number, column 2 is local helicity,
No checking of atom1 and atom2…
 Parameters
atom1 (str {C}) – The first atom to use when calculating the angle decay.
atom2 (str {N}) – The second atom to use when calculating the angle decay.
return_full_matrix (bool {False}) – Whether or not to return the full matrix along with the angle decay calculation.
 Returns
If array_like, the matrix returned is comprised of only the angle decay. If a 2tuple, both the angle decay matrix (index 0) and the full matrix is returned (index 1).
 Return type
array_like, or 2tuple

get_asphericity
(R1=None, R2=None, correctOffset=True, verbose=True)[source]¶ Returns the asphericity associated with the region defined by the intervening stretch of residues between R1 and R2. This can be a somewhat slow operation, so a status message is printed for the impatient biophysicisit.
Asphericity is defined in many places  for my personal favourite explanation and definition see Page 65 of Andreas Vitalis’ thesis (Probing the Early Stages of Polyglutamine Aggregation with Computational Methods, 2009, Washington University in St. Louis).
R1 [int] {None} Index value for first residue in the region of interest. If not provided (False) then first residue is used.
R1 [int] {None} Index value for last residue in the region of interest. If not provided (False) then last residue is used.
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 verbosebool
Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

get_clusters
(region=None, n_clusters=10, backbone=True, correctOffset=True, stride=20)[source]¶ Function to determine the structural clusters associated with a trajectory. This can be useful for identifying the most populated clusters. This approach uses Ward’s hiearchical clustering, which means we must define the number of clusters we want apriori. Clustering is done using RMSD  BUT the approach taken here would be easy to reimplement in another function where you ‘simiarity’ metric was something else.
Returns a 4place tuple with the following subelements:
[0]  cluster_members: A list of length n_clusters where each element corresponds to the number of frames in each of the 1n_clusters cluster. i.e. if I had defined n_clusters=3 this would be a list of length 3
[1]  cluster_trajectories: A list of n_cluster mdtraj trajectory objects of the conformations in the cluster. This is particularly useful because it means you can perform any arbitrary analaysis on the cluster members
[2]  cluster distance_matrices: A list of n_clusters where each member is a square matrix that defines the structural distance between each of the members of each cluster. in other words, this quantifies how similar (in terms of RMSD, in units Angstroms).
[3]  cluster_centroids A list of n_clusters where each element is the index associated with each cluster trajectory that defines the cluster centroid (i.e. a ‘representative’ conformation). As an example  if we had 3 clusters this might look like [4,1,6], which means the 4th, 1st, and 6th frame from each of the respective mdtraj trajectories in the cluster_trajectories list would correspond to the centroid.
[4]  cluster frames: List of lists, where each sublist contains the frame indices associated with that cluster. Useful if clustering on a single chain and want to use that information over an entire trajectory
region [list/tuple of length 2] {[]} Defines the first and last residue (INCLUSIVE) for a region to be examined
n_clusters [int] {10} Number of clusters to be returned through Ward’s clustering algorithm.
backbone [bool] {True} Flag to determine if backbone atoms or full chain should be used
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride [int] {20} Defines the spacing betwen frames to compare with  i.e. take every $strideth frame. Setting stride=1 would mean every frame is used, which would mean you’re doing an all vs. all comparions, which would be ideal BUT may be slow.

get_contact_map
(distance_thresh=5.0, mode='closestheavy', stride=1, weights=False)[source]¶ get_contact_map() returns 2position tuple with the contact map (N x N matrix) and a contact order vector (N x 1) that describe the contacts (heavy atom  heavy atom interactions) made by each of the residues over the simulation.
Each element is normalized such that it is between 0 and 1. i+1 and i+2 elements are excluded, and the minimum distances is the distance between the closest heavy atoms on each residue (backbone or sidechain).
 Parameters
distance_thresh (float {5.0}) – Distance threshold used to define a ‘contact’ in Angstroms. Contacts are taken as frames in which the atoms defined by the scheme are within $distance_thresh angstroms of one another
mode (string {'closestheavy'}) –
Mode allows the user to define differnet modes for computing contacts. The default value is ‘closestheavy’. Other options are detailed below and are identical to those offered by mdtraj in compute_contacts
’ca’  same as setting ‘atom’ and A1=’CA’ and A2=’CA’, this uses the Calpha atoms
 ’closest’  closest atom associated with each of the residues, i.e. the is the point
of closest approach between the two residues
’closestheavy’  same as closest, except only nonhydrogen atoms are considered
 ’sidechain’  closest atom where that atom is in the sidechain. Note this requires
mdtraj version 1.8.0 or higher.
 ’sidechainheavy’  closest atom where that atom is in the sidechain and is heavy.
Note this requires mdtraj version 1.8.0 or higher.
stride (int {1}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary.
[list or array of floats] {False} (weights) – Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
Returns –
 –
of size 2 (tuple) – Returns a tuple where: 0  contact map 1  contact order

get_dihedral_mutual_information
(angle_name='psi', bwidth=0.6283185307179586, stride=1, weights=False)[source]¶ Generate the full mutual information matrix for a specific diehdral type. The resulting matrix describes the mutual information between each dihedral angle as defined by the variable angle_name. A weights parameter can be passed if frames are to be reweighted, but this requires that the (number of frames) / stride = the number of weights.
The mutual information for a pair of angles is determined by generating a histogram of each dihedral induvidually (p(phi1), p(phi2)) and the joint probability histogram (p(phi1,phi2)), and then computing the Shannon entropy associated with the single and joing probability histograms (H_phi1, H_phi2, H_phi1_phi2). The mutual information is then returned as
H_phi1 + H_phi2  (H_phi1 * H_phi2 )
The easiest way to interpret these results is to normalize the inferred matrix using an equivalent matrix generated using a limiting polymer model (e.g. an EV or FRC simulation).
Return: Mutual information matrix ( n x n) where n is the number of that type of bonds in the protein.
angle_name String, must be one of the following options ‘chi1’,’phi, ‘psi’, ‘omega’.
bwidth [np.array} {np.pi/5.0) The width of the bins that will stretch from pi to pi. np.pi/5.0 is probablty the smallest binsize you want  even np.pi/2 should work well. You may want to experiment with this parameter…
stride [int] {20} Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame.
weights [list or array of floats] {False} Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.

get_distance_map
(mode='CA', RMS=False, stride=1, weights=False, verbose=True)[source]¶ Function to calculate the CA defined distance map for a protein of interest. Note this function doesn’t take any arguments and instead will just calculate the complete distance map.
NOTE that distance maps are calculated between all CACA distances and NOT center of mass positions. This also means ACE/NME caps are EXCLUDED from this anlysis.
Distance is described in Angstroms.
 Parameters
mode (str {'CA'}) – String, must be one of either ‘CA’ or ‘COM’.  ‘CA’ = alpha carbon.  ‘COM’ = center of mass (associated withe the residue).
RMS (bool {False}) – If set to False, scaling map reports ensemble average distances (this is the standard and default behaviour). If True, then the distance reported is the root mean squared (RMS) = i.e. SQRT(<r_ij^2>), which is the formal order parameter that should be used for polymeric distance properties.
stride (int {1}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame.
weights (list or array of floats) – Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
A 2tuple containing:  [0] := The distance map derived from the measurements between CA atoms.  [1] := The standard deviation corresponding to the distance map.
 Return type
tuple

get_end_to_end_distance
(mode='COM')[source]¶ Returns the vector assocaiated with the endtoend distance for the simulation.
The vector of Endtoend distances is returned in angstroms
 Parameters
[string] {'CA'} (mode) –
String –
be one of either 'CA' or 'COM'. (must) –
= alpha carbon ('CA') –
= center of mass (associated withe the residue) ('COM') –

get_end_to_end_vs_rg_correlation
(mode='COM')[source]¶ Computes the correlation between Rg^2 and endtoend^2.
 Parameters
mode (str {'CA'}) – String, must be one of either ‘CA’ or ‘COM’.  ‘CA’ = alpha carbon.  ‘COM’ = center of mass (associated withe the residue).
 Returns
A single float describing the correlation (as calculated by np.corrcoef).
 Return type
float

get_gyration_tensor
(R1=None, R2=None, correctOffset=True, verbose=True)[source]¶ Returns the instantaneous gyration tensor associated with each frame.
 Parameters
R1 (int {None}) – Index value for first residue in the region of interest. If not provided (False) then first residue is used.
 R1int {None}
Index value for last residue in the region of interest. If not provided (False) then last residue is used.
 correctOffset: bool {True}
Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again
 verbosebool
Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
Returns a numpy array where each position is the framespecific gyration tensor value
 Return type
np.ndarray

get_hydrodynamic_radius
(R1=None, R2=None, alpha1=0.216, alpha2=4.06, alpha3=0.821, correctOffset=True)[source]¶ Returns the apparent hydrodynamic radius as calculated based on the approximation derived by Nygaard et al. [1]. Returns a hydrodynamic radius in Angstroms.
Parameters (alpha1/2/3 should not be altered to recapitulate behaviour defined by Nygaard et al.
[1] Nygaard M, Kragelund BB, Papaleo E, LindorffLarsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113: 550–557.
Radius of gyration is returned in Angstroms.
 Parameters
R1 (int {False}) – Index value for first residue in the region of interest. If not provided (False) then first residue is used.
R2 (int {False}) – Index value for last residue in the region of interest. If not provided (False) then last residue is used.
alpha1 (float {0.216}) – First parameter in equation (7) from Nygaard et al.
alpha2 (float {4.06}) – Second parameter in equation (7) from Nygaard et al.
alpha3 (float {0.821}) – Third parameter in equation (7) from Nygaard et al.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 Returns
Returns a numpy array with perframe instantaneous hydrodynamic radii
 Return type
np.ndarray

get_inter_residue_COM_distance
(R1, R2, correctOffset=True, stride=1)[source]¶ Function which calculates the complete set of distances between two residues’ centers of mass (COM) and returns a vector of the distances.
Distance is returned in Angstroms.
R1 [int] Residue index of first residue
R2 [int] Residue index of second residue
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride [int] {1} Defines the spacing betwen frames to compare with  i.e. take every $strideth frame. Setting stride=1 would mean every frame is used, which would mean you’re doing an all vs. all comparions, which would be ideal BUT may be slow.

get_inter_residue_COM_vector
(R1, R2, correctOffset=True, stride=1)[source]¶ Function which calculates the complete set of distances between two residues’ centers of mass (COM) and returns the interresidue distance vector.
NOTE: This gives a VECTOR and not the distance between the two centers of mass (which is calculated by get_inter_residue_COM_distance)
R1 [int] Residue index of first residue
R2 [int] Residue index of second residue
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride [int] {20} Defines the spacing betwen frames to compare with  i.e. take every $strideth frame. Setting stride=1 would mean every frame is used, which would mean you’re doing an all vs. all comparions, which would be ideal BUT may be slow.

get_inter_residue_atomic_distance
(R1, R2, A1='CA', A2='CA', mode='atom', correctOffset=True, stride=1)[source]¶ Function which returns the distance between two specific atoms on two residues. The atoms selected are based on the ‘name’ field from the topology selection language. This defines a specific atom as defined by the PDB file. By default A1 and A2 are CA (Calpha) but one can define any residue of interest.
We do not perform any sanity checking on the atom name  this gets really hard  so have an explicit try/except block which will warn you that you’ve probably selected an illegal atom name from the residues.
Distance is returned in Angstroms.
R1 [int] Residue index of first residue
R2 [int] Residue index of second residue
A1 [string] {CA} Atom name of the atom in R1 we’re looking at
A2 [string {CA} Atom name of the atom in R2 we’re looking at
mode [string] {‘atom’} Mode allows the user to define differnet modes for computing atomic distance. The default is ‘atom’ whereby a pair of atoms (A1 and A2) are provided. Other options are detailed below and are identical to those offered by mdtraj in compute_contacts
‘ca’  same as setting ‘atom’ and A1=’CA’ and A2=’CA’, this uses the Calpha atoms
 ‘closest’  closest atom associated with each of the residues, i.e. the is the point
of closest approach between the two residues
‘closestheavy’  same as closest, except only nonhydrogen atoms are considered
 ‘sidechain’  closest atom where that atom is in the sidechain. Note this requires
mdtraj version 1.8.0 or higher.
 ‘sidechainheavy’  closest atom where that atom is in the sidechain and is heavy.
Note this requires mdtraj version 1.8.0 or higher.
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride [int] {1} Defines the spacing betwen frames to compare with  i.e. take every $strideth frame. Setting stride=1 would mean every frame is used, which would mean you’re doing an all vs. all comparions, which would be ideal BUT may be slow.

get_internal_scaling
(R1=None, R2=None, mode='COM', mean_vals=False, correctOffset=True, stride=1, weights=False, verbose=True)[source]¶ Calculates the raw internal scaling info for the protein in the simulation. R1 and R2 define a subregion to operate over if subregional analysis is required. When residues are not provided the full protein’s internal scaling (EXCLUDING* the caps, if present) is calculated.
Distance is measured in Angstroms.
Returns two lists of the same length
List of arrays, where each array is the simulation average set of interresidue distances for the primary sequence separation defined by the equivalent position in the second array. Each array in this list will be a different length as there are many more i to i+1 pairs of residues than i to i+10 (e.g. in a 15 residue sequence).
The sequence separation associated with each set of measurements (i.e. a single list which will normally be 0,1,2,3,…n where n = number of residues in sequence.
The internal scaling profile is a plot of sequence separation vs. mean throughspace distance for all pairs of residues at a given sequence separation. What this means is that if we had a 6 residue peptide the internal scaling profile would be calculated as follows.
sequence separation = 0 average distance(average distance of 1to1, 2to2, 3to3, etc.)
sequence separation = 1 average distance(average distance of 1to2, 2to3, 3to4, etc.)
sequence separation = 2 average distance(average distance of 1to3, 2to4, 3to5, etc.)
sequence separation = 3 average distance(average distance of 1to4, 2to5, 3to6.)
sequence separation = 4 average distance(average distance of 1to5, 2to6)
sequence separation = 5 average distance(average distance of 1to6)
The residues considered for internal scaling analysis DO NOT include the ACE/NME peptide caps if present. This differs from CAMPARI, which DOES include the peptide caps.
For more information on this and other ideas for how polymerphysics can be a useful way of thinking about proteins, take a look at
Mao, A.H., Lyle, N., and Pappu, R.V. (2013). Describing sequenceensemble relationships for intrinsically disordered proteins. Biochem. J 449, 307318.
and
Pappu, R.V., Wang, X., Vitalis, A., and Crick, S.L. (2008). A polymer physics perspective on driving forces and mechanisms for protein aggregation  Highlight Issue: Protein Folding. Arch. Biochem. Biophys. 469, 132141.
The exclusion of the caps is different to how this is calculated in CAMPARI,
which includes the caps, if present.
R1 [int] {False} Index value for first residue in the region of interest. If not provided (False) then first residue is used.
R1 [int] {False} Index value for last residue in the region of interest. If not provided (False) then last residue is used.
mode [‘CA’ or ‘COM’] {‘CA’} Defines the mode used to define the residue position, either the residue center or mass or the residue CA atom. The provided mode must be one of these two options.
mean_vals [Book] {False} This is False by default, but if true the mean IS is returned instead of the explicit values. In reality the nondefault behaviour is probably preferable…
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 strideint {1}
Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary.
weights [list or array of floats] {False} Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
 verbosebool
Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

get_internal_scaling_RMS
(R1=None, R2=None, mode='COM', stride=1, correctOffset=True, weights=False, verbose=True)[source]¶ If \(r_{i,j} = \langle \langle \sum \sigma_{1}\) equals \(\sigma_{2}\) then etc, etc.
Calculates the averaged internal scaling info for the protein in the simulation in terms of root mean square (i.e. sqrt(<Rij^2>) vs  i  j .
R1 and R2 define a subregion to operate over if subregional analysis is required. When residues are not provided the full protein’s internal scaling (EXCLUDING* the caps, if present) is calculated.
Distance is measured in Angstroms.
Returns two lists of the same length:
sequence separation (i  j)
mean sqrt(<Rij^2>)
The internal scaling profile is a plot of sequence separation vs. mean throughspace distance for all pairs of residues at a given sequence separation. What this means is that if we had a 6 residue peptide the internal scaling profile would be calculated as follows.
sequence separation = 0 average distance(average distance of 1to1, 2to2, 3to3, etc.)
sequence separation = 1 average distance(average distance of 1to2, 2to3, 3to4, etc.)
sequence separation = 2 average distance(average distance of 1to3, 2to4, 3to5, etc.)
sequence separation = 3 average distance(average distance of 1to4, 2to5, 3to6.)
sequence separation = 4 average distance(average distance of 1to5, 2to6)
sequence separation = 5 average distance(average distance of 1to6)
The residues considered for internal scaling analysis DO NOT include the ACE/NME peptide caps if present. This differs from CAMPARI, which DOES include the peptide caps.
For more information on this and other ideas for how polymerphysics can be a useful way of thinking about proteins, take a look at
Mao, A.H., Lyle, N., and Pappu, R.V. (2013). Describing sequenceensemble relationships for intrinsically disordered proteins. Biochem. J 449, 307318.
and
Pappu, R.V., Wang, X., Vitalis, A., and Crick, S.L. (2008). A polymer physics perspective on driving forces and mechanisms for protein aggregation  Highlight Issue: Protein Folding. Arch. Biochem. Biophys. 469, 132141.
The exclusion of the caps is different to CAMPARI, which includes the caps, if present.
R1 [int] {None} Index value for first residue in the region of interest. If not provided (False) then first residue is used.
R1 [int] {None} Index value for last residue in the region of interest. If not provided (False) then last residue is used.
mode [‘CA’ or ‘COM’] {‘CA’} Defines the mode used to define the residue position, either the residue center or mass or the residue CA atom. The provided mode must be one of these two options.
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
weights [list or array of floats] {False} Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
 verbosebool
Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

get_local_collapse
(window_size=10, bins=None, verbose=True)[source]¶ local collapse calculates a vectorial representation of the radius of gyration along a polypeptide chain. This makes it very easy to determine where you see local collapse vs. local expansion.
 Parameters
window_size (int, default=10) – Size of the window over which conformations are examined. Default is 10.
bins (np.arange or list) – A range of values (np.arange or list) spanning histogram bins. Default is np.arange(0, 10, 0.1).
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
 return[0]list of floats of len n, where each float reports on the mean local collapse a specific
position along the sequence as defined by the fragment_size
 return[1]list of floats of len n , where each float reports on the standard deviation of the
local collapse at a specific position along the sequence as defined by the fragment_size
 return[2]List of np.ndarrays of len n, where each subarray reports on the histogram values associated
with the local collapse at a given position along the sequence
return[3] : np.ndarray which corresponds to bin values for each of the histograms in return[2]
 Return type
tuple (len = 4)

get_local_heterogeneity
(fragment_size=10, bins=None, stride=20, verbose=True)[source]¶ Function to calculate the vector of D values used to calculate the Phi parameter from Lyle et al[1]. The stride defines the spacing between frames which are analyzed. This is just for practical purposes. The Phi calulation computes a D value for each frame vs. frame comparison  for a 2000 frame simulation this would be 4 Million D values if every value was calculated which is a bit much, so the stride let’s you define how many frames you should skip.
For a 2000 frame trajectory of a 80 residue protein with a stride=20 allows the calculation to take about 5 seconds. However, as protein size increases the computational cost of this process grows rapidly.
 Parameters
fragment_size (int {10}) – Size of local region that is considered to be a single unit over which structural heterogeneity is examined. Should be between 2 and the length of the sequence.
bins (np.ndarray {np.arange(0,1,0.01)}) – Bins used to capture the heterogeneity at each position. If default a set of bins from 0 to 1 with an interval of 0.01 is used
stride (int {20}) – Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame.
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
 return[0]list of floats of len n, where each float reports on the mean RMSDdeviation at a specific
position along the sequence as defined by the fragment_size
 return[1]list of floats of len n , where each float reports on the standard deviation of the
RMSDdeviation at a specific position along the sequence as defined by the fragment_size
 return[2]List of np.ndarrays of len n, where each subarray reports on the histogram values associated
with the full RMSD distribution at a given position along the sequence
return[3] : np.ndarray which corresponds to bin values for each of the histograms in return[2]
 Return type
tuple (len = 4)
 Raises
CTException –

get_local_to_global_correlation
(mode='COM', n_cycles=100, max_num_pairs=10, stride=20, weights=False, verbose=True)[source]¶ Method to analyze how well ensemble average distances taken from a finite number of interresidue distances correlate with global dimensions as measured by the radius of gyration. This is a new analysis, that is best explained through a formal write up.
 Parameters
mode (str {'COM'}) –
Must be one of either ‘CA’ or ‘COM’.
’CA’ = alpha carbon.
’COM’ = center of mass (associated withe the residue).
n_cycles (int {100}) – Number of times, for each number of pairs, we reselect a different set of paris to use. This depends (weakly) on the number of residues, but we do not recommend a value < 50. For larger proteins this number should be increased. Again, it’s worth examining how your results chain as a function of n_cycles to determine the optimal tradeoff between speed and precision.
max_num_pairs (int {10}) – The maximum number of pairs to be consider for correlation analysis. In general we’ve found above 1020 the average correlation tends to plateau towards 1 (note it will NEVER be 1) so 120 is a reasonable set to use.
stride (int {20}) – Defines the spacing between frames for calculating the ensemble average. As stride gets larger this analysis gets slower. It is worth experimenting with to see how the results change as a function of stride, but in theory the accuracy should remain fixed but precision improved as stride is reduced.
weights (bool {False}) – Flag that indicates if frame weights should be used or not.
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
Returns a fourplace tuple.
[0] = This is an 2 by (n_cycles*max_num_pairs) array, where the first column is the number of pairs and the second column is the Rglambda correlation for a specific set of pairs (the pairs in question are not included). This can be thought of as the ‘raw’ data, and may only be useful if distributions of the correlation are of interest (e.g. for generating 2D histograms).
[1] = Array with the number of pairs used (e.g. if max_num_pairs = 10 then this would be [1,2,3,4,5,6,7,8,9]).
[2] = Array with the mean correlation associated with the numbers of pairs in position 2 and the radius of gyration.
[3] = Array with the standard deviation of the correlation associated with the number of pairs in position 2 and the radius of gyration. Note the inclusion of the standard deviation makes the assumption that the distribution is Gaussian which may or may not be true.
 Return type
tuple
 Raises
CTException – Raised when the mode is not ‘CA’ or ‘COM’; or, when the lengths of the Rgstride derived calculations did not match the lengths of the internal distances; or, when the installed numpy version doesn’t support the aweights keyword for the numpy.cov function.

get_multiple_CA_index
(resID_list=None, correctOffset=True)[source]¶ Returns the atom indices associated with the Calpha (CA) atom for the residues defined in the resID_list OR for all residues, if no list is provided.
 Parameters
resID_list (list of int {None}) – Defines a list of residues for which the Calpha atom index will be retrieved. If no list is provided we simply select the list of residues with Calphas, whose indices have been corrected.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 Returns
CAlist – The list (int) of CAlpha indices of the input list of residue IDs.
 Return type
list

get_offset_residue
(R1)[source]¶ Returns the true residue index (TRI) for this protein by taking into account the residue offset. Checks the residue first to ensure a valid residue has been passed.
NOTE: As of MDTraj 1.9.5 this is ACTUALLY overkill as when new trajectories are created the resid numbering resets to starting at 0 (as opposed to retaining the old trajectory numbering). However, for 1.9.4 or lower this did not happen, such that usin the get_offset_residue() function provides longlasting backwards compatibility.
 Parameters
R1 (int) – The zeroindexed input residue offset (int). This value is examined by __check_single_residue().
 Returns
TRI – The updated index which reflects the input offset to determine the true starting residue index.
 Return type
int
See also

get_overlap_concentration
()[source]¶ Returns the overlap concentration for the chain.
The overlap concentration reflects the concentration at which a flexible polymer begins to ‘collide’ in trans with other polymers  i.e. the concentration at which the chains begin to overlap.
 Returns
Molar concentration for the overlap concentration.
 Return type
float

get_polymer_scaled_distance_map
(nu=None, A0=None, min_separation=10, mode='fractionalchange', stride=1, weights=False, verbose=True)[source]¶ Function that allows for a global assesment of how well all ij distances conform to standard polymer scaling behaviour (i.e. $r_ij = A0*ij^{nu}$).
Essentially, this generates a distance map (2D matrix of i vs. j distances) where that distance is either normalized by the expected distance for a provided homopolymer model, or quantifies the fractional deviation from a homopolymer model fit to the data. These two modes are explained in more detail below.
In this standard scaling relationship:
r_ij : Average interresidue distance of residue i and j
 A0Scaling prefactor. Note this is NOT the same numerical value as the R0 prefactor that
defines the relationship Rg = R0*N^{nu}.
ij : Sequence separation between residues i and j
nu : The intrinsic polymer scaling exponent
This is the scaling behaviour expected for a standard homopolymer. This function then assess how well this relationship holds for ALL possible interresidue distances.
This function returns a four position tuple. Position 1 is an n x n numpy matrix (where n = sequence length), where the element is either the default value OR quantifes the deviation from a polymer model in one of two ways. Positions two and three are the nu and A0 values used, respectively. Finally, position 4 will be the reduced chisquare fitting to the polymer model for the internal scaling profile (i.e. how A0 and nu are originally calculated). NOTE that the reduced chisquared value will be 1 if nu and A0 are provided manually.
If no options are provided, the function calculates the best fit to a homopolymer mode using the default parameters associated with the get_scaling_exponent() function, and then uses this model to determine pairwise deviations.
 Parameters
nu (float {None}) – Scaling exponent used (if provided). Note for a provided nu to be used, both nu and A0 must be provided.
A0 (float {None}) – Scaling prefactor used (if provided). Note for a provided A0 to be used, both A0 and nu must be provided.
min_separation (int {10}) – Minimum distance for which deviations are calculated. At close distances, we expect local steric effects to cause deviations from a polymer model, so this value defines the threshold minimum distance to be used.
mode (str {'fractionalchange'}) –
Defines the mode in which deviation from a homopolymer model is calculated. Options are: ‘fractionalchange’, ‘signedfractionalchange’, ‘signedabsolutechange’, ‘scaled’.
fractionalchange: Each interresidue deviation is calculated as
d_ij = abs(r_ij  polymer_ij)/polymer_ij
Where r_ij is the mean distance from the simulation for residues i and j, and polymer_ij is the expected distance for any pair of residues that are separated by ij distance in the polymer model.
signedfractionalchange: Each interresidue deviation is calculated as:
d_ij = (r_ij  polymer_ij)/polymer_ij
i.e. the same as the fractionalchange, except a sign is now also included. Positive values mean there is expansion with respect to the homopolymer behaviour, while negative values mean there is contraction with respect to the homopolymer model.
signedabsolutechange: Each interresidue deviation is calculated as:
d_ij = (r_ij  polymer_ij)
i.e. the same as the signedfractionalchange, except now it is no longer fraction but in absolute distance units. This can be useful for getting a sense of by howmuch the real behaviour deviates from the model in terms of Angstroms.
scaled: Each interresidue deviation is calculated as:
d_ij = r_ij/polymer_ij
Where r_ij is and polymer_ij are defined as above.
weights (list or array of floats {None}) – Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!
 Returns
 return[0]n x n numpy matrix (where n = sequence length), where the element is either the default value
OR quantifes the deviation from a polymer model in one of two ways.
 Return type
tuple (len = 4)
return[1] : float defining the scaling exponent nu
return[2] : float defining the A0 prefactor
return[3] : reduced chisquared fitting to the polymer model (goodness of fit)
 Raises
CTException –

get_radius_of_gyration
(R1=None, R2=None, correctOffset=True)[source]¶ Returns the radius of gyration associated with the region defined by the intervening stretch of residues between R1 and R2. When residues are not provided the full protein’s radius of gyration (INCLUDING the caps, if present) is calculated.
Radius of gyration is returned in Angstroms.
 Parameters
R1 (int {None}) – Index value for first residue in the region of interest. If not provided (None) then first residue is used.
R2 (int {None}) – Index value for last residue in the region of interest. If not provided (False) then last residue is used.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 Returns
Returns a numpy array with perframe instantaneous radius of gyration
 Return type
np.ndarray

get_regional_SASA
(R1, R2, probe_radius=0.14, correctOffset=True, stride=20)[source]¶ Returns the Solvent Accessible Surface Area (SASA) for a local region in every strideth frame. SASA is determined using shrake_rupley algorithm.
SASA is returned in Angstroms squared, BUT PROBE RADIUS is in nanometers!
R1 [int] Index value for the first residue in the region
R2 [int] Index value for the last residue in the region
probe_radius [float] {0.14} Radius of the solvent probe used in nm. Uses the GoldenSpiral algorithm. 0.14 nm is pretty standard. NOTE  the probe radius must be in nanometers
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
stride [int] {20} Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame

get_residue_COM
(R1, correctOffset=True)[source]¶ Returns the COM vector for the residue across the trajectory.
R1 [int] Index value for the residue of interest
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.

get_residue_mass
(R1, correctOffset=True)[source]¶ Returns the mass associated with a specific residue.
R1 [int] Residue index of residue to examine
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.

get_scaling_exponent
(inter_residue_min=15, end_effect=5, correctOffset=True, subdivision_batch_size=20, mode='COM', num_fitting_points=40, fraction_of_points=0.5, fraction_override=False, stride=1, weights=False, verbose=True)[source]¶ Estimation for the A0 and nu exponents for the standard polymer relationship
sqrt(<Rij^2>) = A0ij^(nu)
Nu reports on the solvent quality, while the prefactor (A0) reports on the average chain persistence length. For polymers that are above a 0.5 scaling exponent this works, but below this they deviate from fractal behaviour, so formally this relationship stops working. In practice, the best possible fit line does still track with relative compactness.
Returns a 9 position tuple with the following associated values: 0  best nu
1  best A0
2  minimum nu identified in bootstrap fitting
3  maximum nu identified in bootstrap fitting
4  minimum A0 identified in bootstrap fitting
5  maximum A0 identified in bootstrap fitting
6  reduced chi^2 for the fit region
7  reduced chi^2 for ALL points
 8  2column array, where col 1 is the sequence separation and col 2
is the real spatila separation for the ACTUAL data used to fit to the polymer model (i.e. these points are uniformly spaced from one another on a loglog plot). Reduced chi^2 for the fit region is calculated using this dataset.
 9  3column array, where col 1 is the sequence separation, col 2 is
the real spatial separation observed and col 3 is the best fit curve, for ALL ij distances. Reduced chi^2 for all points is calculated using this dataset.
 NOTE: Despite their precision nu and A0 should be treated as qualitative metrics, and are subject to finite
chain effects. The idea of a polymer scaling behaviour is only necessarily useful in the case of a homopolymer, whereas heterpolymers engender massive averaging that can mask underlying conformational complexity. We _strongly_ caution against over interpretation of the scaling exponent. For a better assement of how your chain actually deviates from homopolymer behaviour, see the function get_polymer_scaled_distance_map()
inter_residue_min [int] {15} Minimum distances used when selecting pairs of residues. This 25 threshold was determined previously, and essentially avoids scenarios where the two residues (i and j) are close to each other. The goal of this limit is to avoid finite chain size strongly influencing the scaling exponent limit.
end_effect [int] {5} Avoid pairs where one of the residues is $end_effect residues from the end. Helps mitigate endeffects. 5 chosen as it’s around above the bloblength in a polypeptide. Note that for homopolymers this is much less of an issue.
correctOffset [Bool] {True} Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
mode [string, either ‘COM’ or ‘CA’] {‘COM’} Defines the mode in which the internal scaling profile is calculated, can use either COM (center of mass) of each residue or the CA carbon of each residue. COM is more appropriate as CA will inherently give a larger profile.
num_fitting_points [int] {40} Number of evenly spaced points to used to fit the scaling exponent in loglog space. 40 seems to be a decent number that scales well
fraction_of_points [float between 0 and 1] {0.5} This is only used if fraction_override is set to True OR the sequence has less than the num_of_fitting_points residues. Means that instead of using a an absolute number of points (e.g. 40) to fit the loglog data, we use this fraction of residues. i.e. if the protein had 20 residues and fraction_of_points = 0.5 we’d use 10 points
fraction_override [bool] {False} If set to False then fraction_of_points ONLY used if the length of the sequence is less than the num_fitting points. If true then we explicitly use fraction_of_points and ignore num_fitting_points.
 strideint {1}
Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary.
weights [list or array of floats] {False} Defines the framespecific weights if reweighted analysis is required. This can be useful if an ensemble has been reweighted to better match experimental data, or in the case of analysing replica exchange data that is recombined using TWHAM.
 verbosebool
Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

get_secondary_structure_BBSEG
(R1=None, R2=None, correctOffset=True)[source]¶ Returns a dictionary where eack keyvalue pair is keyed by a BBSEG classification type (09) and each value is a vector showing the fraction of time each residue is in that particular BBSEG type.
BBSEG classification types are listed below
0  unclassified 1  beta (turn/sheet) 2  PII (polyproline type II helix) 3  Unusual region 4  Right handed alpha helix 5  Inverse C7 Equatorial (gammaprime turn) 6  Classic C7 Equatorial (gamma turn) 7  Helix with 7 Residues per Turn 8  Left handed alpha helix
Parameters R1 and R2 are optional and allow a local subregion to be defined.
The return dictionary provides a classification vector for each of the 9 possible types of classification (note 0 = unclassified).
 Parameters
R1 (int) – Default value is False. Defines the value for first residue in the region of interest. If not provided (False) then first residue is used.
 R2int
Default value is False. Defines the value for last residue in the region of interest. If not provided (False) then last residue is used.
 correctOffsetBool
Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again. If you’re calling this function you can probably ignore this variable.
 Returns
return_bbseg – Dictionary of 9 keyvalue pairs where keys are integers 08 and values are numpy arrays showing the fractional occupancy of each of the distinct types of defined secondary structure. Note the three classifications will sum to 1 (within numerical precision).
 Return type
dict

get_secondary_structure_DSSP
(R1=None, R2=None, correctOffset=True)[source]¶ Returns the a 4 by n numpy array inwhich column 1 gives residue number, column 2 is local helicity, column 3 is local ‘extended’ (beta strand/sheet) and column 4 is local coil on a perresidue basis.
Parameter R1 and R2 define a local region if only a subregion is required.
Return vector provides normalized secondary structure (between 0 and 1) which reflects the fraction of the simulation each residue is in that particular secondary structure type.
 Parameters
R1 (int {None}) – Default value is None. Defines the value for first residue in the region of interest. If not provided (False) then first residue is used.
 R2int {None}
Default value is None. Defines the value for last residue in the region of interest. If not provided (False) then last residue is used.
 correctOffsetBool
Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again. If you’re calling this function you can probably ignore this variable.
 Returns
ddsp_vector – A 4xn numpy array (where n is the number of residues) in which column 1 defines the residue index and column 24 defines the fractional occupancy of helical (H), extended (E) (betalike) and coil (C) states. Note the three classifications will sum to 1 (within numerical precision).
 Return type
np.array

get_sidechain_alignment_angle
(R1, R2, sidechain_atom_1='default', sidechain_atom_2='default', correctOffset=True)[source]¶ Function that computes the angle alignment between two residue sidechains. Sidechain vectors are defined as the unit vector between the CA of the residue and a designated ‘sidechain’ atom on the sidechain. The default sidechain atoms are listed below, but custom atom names can also be provided using the sidechain_atom_1/2 variables.

get_site_accessibility
(input_list, probe_radius=0.14, mode='residue_type', stride=20)[source]¶ Function to compute site/residue type accessibility. This can be done using one of two modes. Under ‘residue_type’ mode, the input_list should be a list of canonical 3letter amino acid names. Under ‘resid’ mode, the input list should be a list of residue id positions (recall that resid ALWAYS starts from 0 and will include the ACE cap if present).
Returns a dictionary with the key = residue “typenumber” string and the value equal the average and standard devaition associated with the SASA. Useful for looking at comparative accessibility of the same residue accross the sequence.
input_list [list] List of either residue names (e.g. [‘TRP’,’TYR’,’GLN’] or resid values ([1,2,3,4]) which will be taken and the SASA calculated for.
probe_radius [float] {0.14} Radius of the solvent probe used in nm. Uses the GoldenSpiral algorithm. 0.14 nm is pretty standard. NOTE  the probe radius must be in nanometers
mode [string] {‘residue_type’} Mode used to examine sites. MUST be one of ‘residue_type’ or ‘resid’
stride [int] {20} Defines the spacing between frames to compare  i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every strideth frame

get_t
(R1=None, R2=None, correctOffset=True)[source]¶ Returns the <t>, a dimensionless parameter which describes the size of the ensemble.
 Parameters
R1 (int {None}) – Index value for first residue in the region of interest. If not provided (False) then first residue is used.
R2 (int {None}) – Index value for last residue in the region of interest. If not provided (False) then last residue is used.
correctOffset (bool {True}) – Defines if we perform local protein offset correction or not. By default we do, but some internal functions may have already performed the correction and so don’t need to perform it again.
 Returns
Returns a numpy array with perframe instantaneous tvalues
 Return type
np.ndarray

property
idx_with_CA
¶ Return a list of zeroindexed IDs that have CA atoms. Note these indices start at zero regardless of what the internal .topology resid numbering is. Recall that for for mdtraj 1.9.5 or greater this should be the same as the values returned by resid_with_CA.
The values associated with this property are built using the internal function __get_resid_with_CA()
 Returns
 Return type
list of integers
See also

property
n_frames
¶ Returns the number of frames in the trajectory
 Returns
Returns the number of frames in the simulation trajectory
 Return type
int

property
n_residues
¶ Returns the number of residues in the protein (including caps)
 Returns
Returns the number of frames in the protein
 Return type
int

property
ncap
¶ Flag that returns if an Nterminal capping residue is present (or not).
 Returns
True if Nteminal cap is present, False if not
 Return type
bool

print_residues
(verbose=True)[source]¶ Function to help determine the mapping of residue ID to PDB residue value. Prints the mapping between resid and PDB residue, and returns this information in a list.
Returns a list of lists, where each list element is itself a list of two elements, index position and the resnameresid from the PDB file.
 Parameters
verbose (bool {True}) – If set to True, print_residues() will print out to screen and also return a list. If set to False, means nothing is printed to the screen.
 Returns
List containing a mapping of the zeroindexed residues and their names.
 Return type
return_list

property
resid_with_CA
¶ Return a list of resids that have CA atoms. Note these residues have already have their offset correction applied, and so are valid to be used for selection via the underlying mdtraj topology object (self.topology).
The values associated with this property are built using the internal function __get_resid_with_CA()
 Returns
 Return type
list of integers
See also

property
residue_index_list
¶ Returns the list of residue index values for this protein. i.e. this is the correctly offset residue list and will include NME/ACE caps if present.
For backend implementation details, this returns the res.index values from all res in topology.residues (mdtraj.mdtrajectory)

References
 BiDB79
Biskup, J.; Dayal, U.; Bernstein, P.A..: Synthesizing independent database schemas. In: ACM SIGMOD 1979 Int. Conf. On Management of Data Proceedings, S. 143151.