Preprocessing#

This module contains utilities for preprocessing molecular structures and data.

Core Preprocessing Functions#

Pre-process mmCIF files and return a dataframe containing a record for each PN unit in the structure.

See the atomworks README for a term glosssary.

class atomworks.ml.preprocessing.get_pn_unit_data_from_structure.DataPreprocessor(from_rcsb: 'bool' = True, close_distance: 'float' = 30.0, contact_distance: 'float' = 5, clash_distance: 'float' = 1.0, ignore_residues: 'list[str]' = <factory>, polymer_pn_unit_limit: 'int' = 1000, add_missing_atoms: 'bool' = True, remove_waters: 'bool' = True, remove_ccds: 'list' = <factory>, fix_ligands_at_symmetry_centers: 'bool' = True, build_assembly: 'str' = 'all', fix_arginines: 'bool' = True, convert_mse_to_met: 'bool' = True, hydrogen_policy: "Literal['remove', 'infer', 'keep']" = 'remove')[source]#

Bases: object

add_missing_atoms: bool = True#
build_assembly: str = 'all'#
clash_distance: float = 1.0#
close_distance: float = 30.0#
contact_distance: float = 5#
convert_mse_to_met: bool = True#
fix_arginines: bool = True#
fix_ligands_at_symmetry_centers: bool = True#
from_rcsb: bool = True#
get_rows(path_to_structure: PathLike, ligand_scores: list[str] = ['RSCC', 'RSR', 'completeness', 'intermolecular_clashes', 'is_best_instance', 'ranking_model_fit', 'ranking_model_geometry']) list[dict[str, Any]][source]#

Processes a structure file, applies filters, and generates a list of records to be loaded at train-time.

We create a record for each PN unit (protein, nucleic acid, or non-polymer) in the structure. Each record contains information about a query PN unit and its partner (contacting) PN units in the structure.

Parameters:

path_to_structure (PathLike) – The path to the structure file to process. Must be readable by CIFUtils.

Returns:

A list of dictionaries. Each dictionary contains information about a query PN unit and its partner PN units.

Return type:

list

hydrogen_policy: Literal['remove', 'infer', 'keep'] = 'remove'#
ignore_residues: list[str]#
polymer_pn_unit_limit: int = 1000#
remove_ccds: list#
remove_waters: bool = True#

Constants#

class atomworks.ml.preprocessing.constants.ClashSeverity(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Enum representing the severity of clashes in a PDB file.

MILD = 'mild'#
MODERATE = 'moderate'#
NO_CLASH = 'no-clash'#
SEVERE = 'severe'#

Utilities#