Preprocessing#
This module contains utilities for preprocessing molecular structures and data.
Core Preprocessing Functions#
Pre-process mmCIF files and return a dataframe containing a record for each PN unit in the structure.
See the atomworks README for a term glosssary.
- class atomworks.ml.preprocessing.get_pn_unit_data_from_structure.DataPreprocessor(from_rcsb: 'bool' = True, close_distance: 'float' = 30.0, contact_distance: 'float' = 5, clash_distance: 'float' = 1.0, ignore_residues: 'list[str]' = <factory>, polymer_pn_unit_limit: 'int' = 1000, add_missing_atoms: 'bool' = True, remove_waters: 'bool' = True, remove_ccds: 'list' = <factory>, fix_ligands_at_symmetry_centers: 'bool' = True, build_assembly: 'str' = 'all', fix_arginines: 'bool' = True, convert_mse_to_met: 'bool' = True, hydrogen_policy: "Literal['remove', 'infer', 'keep']" = 'remove')[source]#
Bases:
object
- add_missing_atoms: bool = True#
- build_assembly: str = 'all'#
- clash_distance: float = 1.0#
- close_distance: float = 30.0#
- contact_distance: float = 5#
- convert_mse_to_met: bool = True#
- fix_arginines: bool = True#
- fix_ligands_at_symmetry_centers: bool = True#
- from_rcsb: bool = True#
- get_rows(path_to_structure: PathLike, ligand_scores: list[str] = ['RSCC', 'RSR', 'completeness', 'intermolecular_clashes', 'is_best_instance', 'ranking_model_fit', 'ranking_model_geometry']) list[dict[str, Any]] [source]#
Processes a structure file, applies filters, and generates a list of records to be loaded at train-time.
We create a record for each PN unit (protein, nucleic acid, or non-polymer) in the structure. Each record contains information about a query PN unit and its partner (contacting) PN units in the structure.
- Parameters:
path_to_structure (PathLike) – The path to the structure file to process. Must be readable by CIFUtils.
- Returns:
A list of dictionaries. Each dictionary contains information about a query PN unit and its partner PN units.
- Return type:
list
- hydrogen_policy: Literal['remove', 'infer', 'keep'] = 'remove'#
- ignore_residues: list[str]#
- polymer_pn_unit_limit: int = 1000#
- remove_ccds: list#
- remove_waters: bool = True#
Constants#
- class atomworks.ml.preprocessing.constants.ClashSeverity(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Enum representing the severity of clashes in a PDB file.
- MILD = 'mild'#
- MODERATE = 'moderate'#
- NO_CLASH = 'no-clash'#
- SEVERE = 'severe'#