Dataset Parsers#

This module contains parsers for different types of dataset metadata and structures.

Base Parser#

atomworks.ml.datasets.parsers.base.DEFAULT_CIF_PARSER_ARGS = {'add_bond_types_from_struct_conn': ['covale'], 'add_id_and_entity_annotations': True, 'add_missing_atoms': True, 'convert_mse_to_met': True, 'fix_arginines': True, 'fix_ligands_at_symmetry_centers': True, 'hydrogen_policy': 'remove', 'model': None, 'remove_ccds': ['SO4', 'GOL', 'EDO', 'PO4', 'ACT', 'PEG', 'DMS', 'TRS', 'PGE', 'PG4', 'FMT', 'EPE', 'MPD', 'MES', 'CD', 'IOD'], 'remove_waters': True}#

Default cif parser arguments for atomworks.io.parse. This dictionary exists to provide a convenient import for the default parameters.

class atomworks.ml.datasets.parsers.base.MetadataRowParser[source]#

Bases: ABC

Abstract base class for MetadataRowParsers.

A MetadataRowParser is a class that parses a row from a DataFrame on disk into a format digestible by the load_example_from_metadata_row function.

In the common case that a model is trained on multiple datasets, each with their own dataframe and base data format, we must ensure that the data pipeline receives a consistent input format. By way of example, when training an AF-3-style model, we might have a “PDB Chains” dataset of mmCIF files, a “PDB Interfaces” dataset of mmCIF files, and a distillation dataset of computationally-generated PDB files, and many others.

We enforce the following common schema for all datasets:
  • “example_id”: A unique identifier for the example within the dataset.

  • “path”: The path to the data file (which we will load with CIFUtils).

WARNING: For many transforms, additional keys are required. For example:
  • For cropping, the query_pn_unit_iids field is used to center the crop on the interface or pn_unit. If not provided, the AF-3-style crop transforms will crop randomly.

  • For loading templates, the “pdb_id” is required to load the correct template from disk (at least with the legacy code).

parse(row: Series) dict[str, Any][source]#

Wrapper to parse and validate a DataFrame row.

required_schema: ClassVar[dict[str, type]] = {'example_id': <class 'str'>, 'extra_info': <class 'dict'>, 'path': <class 'pathlib.Path'>}#
validate_output(output: dict[str, Any]) None[source]#

Validate the output dictionary for required keys and their types.

atomworks.ml.datasets.parsers.base.load_example_from_metadata_row(metadata_row: Series, metadata_row_parser: MetadataRowParser, *, cif_parser_args: dict | None = None) dict[source]#

Load training/validation example from a DataFrame row into a common format using the given metadata row parsing function and CIF parser arguments.

Performs the following steps:
  1. Parse the row into a common dictionary format using the provided row parsing function and metadata row.

  2. Load the CIF file from the information in the common dictionary format (i.e., the “path” key).

  3. Combine the parsed row data and the loaded CIF data into a single dictionary.

Parameters:
  • metadata_row (pd.Series) – The DataFrame row to parse.

  • metadata_row_parser (MetadataRowParser) – The parser to use for converting the row into a dictionary format.

  • cif_parser_args (dict, optional) – Additional arguments for the CIF parser. Defaults to None.

Returns:

A dictionary containing the parsed row data and additional loaded CIF data.

Return type:

dict

Custom Metadata Parsers#

Row parser for non-standard metadata dataframes

class atomworks.ml.datasets.parsers.custom_metadata_row_parsers.AF2FB_DistillationParser(base_dir: str, file_extension: str = '.cif')[source]#

Bases: MetadataRowParser

DEPRECATION WARNING: This parser is deprecated and will be removed in a future release. We should use the GenericDFParser instead, providing path and example_id columns.

Parser for AF2FB distillation metadata.

The AF2FB distillation dataset is provided courtesy of Meta/Facebook. It contains ~7.6 Mio AF2 predicted structures from UniRef50.

Metadata (i.e. which sequences, which cluster identities @ 30% seq.id, whether a sequence has an msa & template, sequence_hash etc.) are stored in the af2_distillation_facebook.parquet dataframe.

The parquet has the following columns:
  • example_id

  • n_atoms

  • n_res

  • mean_plddt

  • min_plddt

  • median_plddt

  • sequence_hash

  • has_msa

  • msa_depth

  • has_template

  • cluster_id

  • seq (!WARNING: this is a relatively data-heavy column)

class atomworks.ml.datasets.parsers.custom_metadata_row_parsers.ValidationDFParserLikeAF3(base_dir: Path = None, file_extension: str = '.cif.gz')[source]#

Bases: MetadataRowParser

Parser for AF-3-style validation DataFrame rows.

As output, we give:
  • pdb_id: The PDB ID of the structure.

  • assembly_id: The assembly ID of the structure, required to load the correct assembly from the CIF file.

  • path: The path to the CIF file.

  • example_id: An identifier that combines the pdb_id and assembly_id.

  • ground_truth: A dictionary containing non-feature information for loss and validation. For validation, we initialize with the following:
    • interfaces_to_score: A list of tuples like (pn_unit_iid_1, pn_unit_iid_2, interface_type), which represent low-homology interfaces to score.

    • pn_units_to_score: A list of tuples like (pn_unit_iid, pn_unit_type), which represent low-homology pn_units to score.

Default Metadata Parsers#