Preprocessing Utilities#
This module contains utility functions for preprocessing tasks.
Clustering Utilities#
- class atomworks.ml.preprocessing.utils.clustering.ClusterMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
IntEnum
MMSeqs2 cluster modes. For details, see github.com/soedinglab/mmseqs2/wiki#clustering-modes
- CONNECTED_COMPONENT = 1#
- GREEDY_INCREMENTAL = 2#
- GREEDY_SET_COVER = 0#
- class atomworks.ml.preprocessing.utils.clustering.CoverageMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
IntEnum
MMSeqs2 coverage modes for clustering. For details, see github.com/soedinglab/mmseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster
- BIDIRECTIONAL = 0#
- QUERY = 2#
- QUERY_IN_TARGET = 4#
- SHORTER_IN_OTHER = 5#
- TARGET = 1#
- TARGET_IN_QUERY = 3#
- class atomworks.ml.preprocessing.utils.clustering.MMSeqs2Config(cluster_identity: float = 0.4, coverage: float = 0.8, sensitivity: float = 8.0, cluster_mode: ClusterMode = ClusterMode.GREEDY_SET_COVER, coverage_mode: CoverageMode = CoverageMode.BIDIRECTIONAL)[source]#
Bases:
object
Input configuration for MMseqs2 clustering.
- cluster_identity: float = 0.4#
- cluster_mode: ClusterMode = 0#
- coverage: float = 0.8#
- coverage_mode: CoverageMode = 0#
- sensitivity: float = 8.0#
- atomworks.ml.preprocessing.utils.clustering.cluster_all_sequences(pn_units_df: ~os.PathLike | str | ~pandas.core.frame.DataFrame, sequence_col_name: str, sequence_hash_col_name: str, output_col_prefix: str = 'q_pn_unit_cluster', set_to_cluster_col: bool = False, output_path: str | None = None, clustering_configs: list[~atomworks.ml.preprocessing.utils.clustering.MMSeqs2Config] = [MMSeqs2Config(cluster_identity=0.4, coverage=0.8, sensitivity=8.0, cluster_mode=<ClusterMode.GREEDY_SET_COVER: 0>, coverage_mode=<CoverageMode.BIDIRECTIONAL: 0>), MMSeqs2Config(cluster_identity=1.0, coverage=0.8, sensitivity=8.0, cluster_mode=<ClusterMode.GREEDY_SET_COVER: 0>, coverage_mode=<CoverageMode.BIDIRECTIONAL: 0>)]) DataFrame [source]#
Clusters input sequences from a DataFrame and merges the cluster information back into the DataFrame.
This function performs the following steps: 1. Creates (deduplicated) FASTA files for sequences from the input DataFrame. 2. Runs MMseqs2 clustering on the generated FASTA files for each specified parameter configuration. 3. Merges the clustering results back into the original DataFrame. 4. Saves the updated DataFrame with clustering information to a specified output path.
- Parameters:
pn_units_df_path (PathLike | str | pd.DataFrame) – Path to the input DataFrame stored as a Parquet file, or the DataFrame directly.
sequence_col_name (str) – The name of the column containing the canonical sequences to be clustered.
sequence_hash_col_name (str) – The name of the column containing the sequence hash for the sequences to be clustered.
output_col_prefix (str) – Prefix for the name of the output column containing the representative sequence hash for each cluster.
set_to_cluster_col (bool) – If True, sets the ‘cluster’ column of the output dataframe equal to last of the computed clusters.
output_path (str | None) – Path to save the output DataFrame with clustering information. If None, returns the dataframe without saving.
clustering_configs (list[MMSeqs2Config]) – List of MMSeqs2Config objects specifying the clustering configurations to run.
- Columns added to DataFrame:
{output_col_prefix}_{cluster_mode_str}_rep_seq_hash: Representative sequence hash for protein clusters with the given configuration.
<optional> cluster: If set_to_cluster_col is True, this column is set equal to the above column.
- Returns:
None
- atomworks.ml.preprocessing.utils.clustering.run_mmseqs2_clustering(input_fasta: str | ~os.PathLike, clustering_config: ~atomworks.ml.preprocessing.utils.clustering.MMSeqs2Config = MMSeqs2Config(cluster_identity=0.4, coverage=0.8, sensitivity=8.0, cluster_mode=<ClusterMode.GREEDY_SET_COVER: 0>, coverage_mode=<CoverageMode.BIDIRECTIONAL: 0>), temp_dir: ~os.PathLike | str | None = None) DataFrame [source]#
Runs MMseqs2 clustering on the input FASTA file.
- Parameters:
input_fasta (str) – Path to the input FASTA file. The headers for the sequences should be unique, as they are used to identify the sequences in the output.
clustering_config (MMSeqs2Config) – Configuration for MMseqs2 clustering. See MMSeqs2Config for details.
tmp_dir (PathLike | str) – Path to the temporary directory where MMseqs2 will write intermediate files. Default is None.
- Returns:
DataFrame containing the clustering results with columns [“cluster_rep_seq_hash”, “seq_hash”].
- Return type:
pd.DataFrame
Example
cluster_rep_seq_hash, seq_hash afe56282ba3, afe56282ba3 afe56282ba3, ee1f80a23f3 afe56282ba3, 4a2caa18797 afe56282ba3, 19f7ce1eed1
References
PDB clustering approach: https://www.rcsb.org/docs/grouping-structures/sequence-based-clustering
MMseqs2 documentation: soedinglab/mmseqs2
CLI documentation for the easy-cluster command: mmseqs easy-cluster -h
FASTA Utilities#
- atomworks.ml.preprocessing.utils.fasta.create_fasta_file_from_df(pn_units_df: PathLike | str | DataFrame, sequence_col_name: str, output_path: PathLike | str) None [source]#
Create a FASTA file from sequences stored as a dataframe in a Parquet file.
- Parameters:
pn_units_df (pd.DataFrame | PathLike | str) – Dataframe, as a path Parquet or object directly, containing a column with the sequences to be clustered.
sequence_col_name (str) – The name of the column containing the canonical sequences to be clustered.
output_path (PathLike | str) – Path to where the fasta file will be saved. Must end in .fasta extension.
- atomworks.ml.preprocessing.utils.fasta.wrap_sequence(sequence: str, line_length: int = 80) str [source]#
Wrap a sequence string to a specified line length.
- Parameters:
sequence (str) – The sequence string to wrap.
line_length (int) – The maximum line length. Default is 80.
- Returns:
The wrapped sequence string.
- Return type:
str
Structure Utilities#
Utilities to create the dataset to be loaded by RF2AA. See the main script for term Glossary.
- atomworks.ml.preprocessing.utils.structure_utils.calculate_molecule_diameter(full_molecule_atoms: AtomArray) float [source]#
Calculates the molecular diameter, defined as the maximum number of bonds between any two vertices in the molecule, using a maximum spanning tree.
- Parameters:
full_molecule_atoms (AtomArray) – The array of atoms representing the molecule. Must include all atoms, including those with zero occupancy.
- Returns:
The molecular diameter. If the diameter cannot be computed, returns 0.0.
- Return type:
float
- atomworks.ml.preprocessing.utils.structure_utils.get_atom_mask_from_cell_list(coord: array, cell_list: CellList, cell_list_size: int, cutoff: float, chunk_size: int = 2000000000) ndarray [source]#
Builds a mask indicating which atoms clash with the query PN unit. If the number of comparisons is too large, the computation is split into manageable chunks along the rows of coord.
- Parameters:
coord (ndarray) – The coordinates of the query PN unit. Shape is (n, 3).
cell_list (CellList) – A CellList object that allows efficient vicinity searches.
cell_list_size (int) – The number of atoms in the cell list.
clash_distance (float) – The distance threshold below which atoms are considered to be clashing.
chunk_size (int) – The maximum number of comparisons allowed in a single chunk.
- Returns:
Mask indicating which atoms in cell_list clash with the atoms in coord. Shape is (n, cell_list_size), dtype is bool.
- Return type:
ndarray
- atomworks.ml.preprocessing.utils.structure_utils.get_bonded_polymer_pn_units(query_pn_unit_iid: str, filtered_atom_array: AtomArray) set[str] [source]#
Returns a set of polymer PN units that are covalently bonded to a given PN unit. For example, useful to detect oligosaccharides that are covalently bonded to a protein.
- Parameters:
query_pn_unit_iid (str) – The full ID of the non-polymer PN unit to check for bonds.
filtered_atom_array (AtomArray) – AtomArray with non-zero occupancy
- Returns:
A set of full IDs of polymer PN units that are covalently bonded to the query PN unit.
- Return type:
set[str]
- atomworks.ml.preprocessing.utils.structure_utils.get_clashing_pn_units(pn_unit_iids_to_consider: np.Array, atom_array: AtomArray, cell_list: CellList, clash_distance: float) tuple[set[str], dict[str, set[str]]] [source]#
Wrapper function to find clashing PN units within an atom array.
- atomworks.ml.preprocessing.utils.structure_utils.get_contacting_pn_units(query_pn_unit: AtomArray, filtered_atom_array: AtomArray, cell_list: CellList, contact_distance: float = 4.5, min_contacts_required: float = 1, mask: ndarray = None, calculate_min_distance: bool = False) list[dict[str, str]] [source]#
Finds PN units (proteins, nucleic acids, or small molecules) with a minimum number of atoms within a given distance of the query PN unit.
- Parameters:
query_pn_unit – AtomArray containing the query PN unit that we want to check for contacts (could be a single chain or multiple covalently bonded chains)
filtered_atom_array – AtomArray containing the set of atoms to consider when looking for contacts (must correspond to the cell_list)
cell_list – CellList of atom_array for rapid distance computations; must correspond to the filtered_atom_array
contact_distance – Distance threshold for contacting atoms
min_contacts_required – Minimum number of atoms within the cutoff distance to consider a PN unit as a potential partner
mask – Mask of PN units to consider as potential partners within the atom_array. If None, all PN units (except the query) are considered.
calculate_min_distance – Whether to calculate the minimum distance between the query PN unit and each PN unit within the cutoff distance
- Returns:
- A list of dictionaries, each containing:
’pn_unit_iid’ (str): A string representing a unique partner PN unit in contact with the query PN unit.
’num_atoms’ (int): Number of non-zero occupancy atoms in the partner PN unit.
’num_contacts’ (int): Number of atoms in the partner PN unit within the contact_distance of the query PN unit.
’min_distance’ (float): Minimum distance between the query PN unit and the partner PN unit.
- Return type:
list
- atomworks.ml.preprocessing.utils.structure_utils.get_inter_pn_unit_bond_mask(atom_array: AtomArray) ndarray [source]#
Returns a mask indicating which bonds in atom_array.bonds are between two distinct PN units. Because we are operating at the PN unit-level, such bonds cannot be bonds between non-polymers.
WARNING: Must be applied before reassigning PN unit IIDs (e.g., as is done for covalent modifications).
- Parameters:
atom_array (AtomArray) – The full atom array. Must have PN unit-level annotations.
- Returns:
A boolean mask indicating which bonds are between two PN units.
- Return type:
numpy.ndarray
- atomworks.ml.preprocessing.utils.structure_utils.get_intra_pn_unit_bonds(pn_unit_iid: str, full_atom_array: AtomArray) ndarray [source]#
Retrieve all bonds within the PN unit. Includes inter-chain bonds if the PN unit is composed of multiple chains. Does NOT include bonds between the PN unit and any other PN units (e.g., protein-ligand bonds).
- Parameters:
pn_unit_iid (str) – The polymer/non-polymer unit instance ID (e.g., ‘A1’, ‘C2,B3’, etc.)
full_atom_array (AtomArray) – The full structure AtomArray. Must include all atoms, including those with zero occupancy.
- Returns:
Array of Biotite bond objects (atom_a_index, atom_b_index, bond_type) within the specified PN unit.
- Return type:
numpy.ndarray
- atomworks.ml.preprocessing.utils.structure_utils.get_ligand_validity_scores_from_pdb_id(pdb_id: str) list[dict[str, str | int | float | None]] [source]#
Query the RCSB PDB for ligand validity scores for a given PDB ID.
- Parameters:
pdb_id (str) – The PDB ID to query.
- Returns:
- (list[dict[str, str | int | float | None]]): A list of dictionaries, each containing
the ligand validity scores for a ligand (e.g. RSCC, RSR) as well identifiers such as the residue name, chain ID, and entity ID. Can easily be converted to a pandas DataFrame for easier handling via pd.DataFrame(records).
- Return type:
records
References: - https://www.rcsb.org/docs/general-help/ligand-structure-quality-in-pdb-structures
- atomworks.ml.preprocessing.utils.structure_utils.get_pn_units_clashing_with_pn_unit(query_pn_unit: AtomArray, filtered_atom_array: AtomArray, cell_list: CellList, clash_distance: float) set[str] [source]#
Finds clashes between a query PN unit and the rest of the structure. A clash is defined as any pair of atoms from the query PN unit and the rest of the structure that are closer than clash_distance.
- Parameters:
query_pn_unit – AtomArray containing the query PN unit that we want to check for clashes
filtered_atom_array – AtomArray containing atoms with non-zero occupancy
cell_list – CellList of structure for rapid distance computations
clash_distance – Distance threshold for clashing atoms
- Returns:
Set of polymer/non-polymer unit instance IDs that are clashing with the query PN unit
- atomworks.ml.preprocessing.utils.structure_utils.get_pn_units_with_non_biological_bonds(atom_array: AtomArray, bond_mask: ndarray) ndarray [source]#
Checks for non-biological bonds between PN units within an assembly. Note that “inter-PN-unit bonds” in this instance do not include bonds between non-polymers, which are treated as the same PN unit (by definition).
Specifically, this function looks for inter-molecular: - Oxygen-oxygen bonds - Fluorine-fluorine bonds - Bonds involving free oxygen or hydroxyl groups (e.g. HOH, OH, O)
- Parameters:
atom_array – AtomArray containing the relevant structure
bond_mask – Mask of inter-molecular bonds
- Returns:
Array of polymer/non-polymer unit instance IDs that contain non-biological bonds
- Return type:
numpy.ndarray
- atomworks.ml.preprocessing.utils.structure_utils.get_soi_ligands_from_pdb_id(pdb_id: str) set[str] [source]#
This function takes a PDB ID and returns a set of ligand names that are annotated as subject of investigation (SOI) in the PDB. Such ligands are often considered biologically meaningful.
NOTE: This function is kept from the old pipeline for convenience and testing, but not used in the current processing pipeline, where the LOI is extracted directly from the cif file instead of via an FTP query.
- Parameters:
pdb_id (str) – The PDB ID to query.
- Returns:
A set of ligand names that are annotated as SOI in the PDB.
- Return type:
Set[str]
- atomworks.ml.preprocessing.utils.structure_utils.handle_clashing_pn_units(clashing_pn_units_set: set, clashing_pn_units_dict: dict, atom_array: AtomArray, pn_unit_iid_map: dict) AtomArray [source]#
Resolves clashing PN units according to the following process: 1. Sort clashing PN units by the number of atoms within the PN unit 2. Iterate through the sorted list keeping (1) the larger PN unit (2) the lower transformation number until all clashes are resolved
- Parameters:
clashing_pn_units_set – Set of polymer/non-polymer unit instance IDs that contain clashing atoms
clashing_pn_units_dict – Dictionary mapping polymer/non-polymer unit instance IDs to a list of clashing polymer/non-polymer unit instance IDs
atom_array – AtomArray containing atoms with non-zero occupancy
pn_unit_iid_map – Dictionary mapping integer representations of polymer/non-polymer unit instance IDs to the polymer/non-polymer unit instance IDs themselves
- Returns:
AtomArray with clashing atoms removed ClashSeverity: Enum representing the severity of the clash
- Return type:
AtomArray