Category Transforms#

Transforms operating on Biotite’s CIFBlock and CIFCategory objects.

These transforms are used to extract information from the CIFBlock and return a dictionary containing processed information.

atomworks.io.transforms.categories.category_to_df(cif_block: CIFBlock, category: str) DataFrame | None[source]#

Convert a CIF block to a pandas DataFrame.

atomworks.io.transforms.categories.category_to_dict(cif_block: CIFBlock, category: str) dict[str, ndarray][source]#

Convert a CIF block to a dictionary.

atomworks.io.transforms.categories.extract_crystallization_details(crystal_dict: dict) dict[str, list[float] | None][source]#

Extracts crystallization details from the crystallization dictionary.

Parameters:

crystal_dict – Dictionary for the exptl_crystal_grow CIF category.

Returns:

  • “pH”: A list of two floats [min_pH, max_pH], or None if unavailable.

Return type:

A dictionary with crystallization details. Currently includes

atomworks.io.transforms.categories.get_ligand_of_interest_info(cif_block: CIFBlock) dict[source]#

Extract ligand of interest information from a CIF block.

Reference:
atomworks.io.transforms.categories.get_metadata_from_category(cif_block: CIFBlock, fallback_id: str | None = None) dict[source]#

Extract metadata from the CIF block. If the entry.id field is not present in the CIF block, the fallback_id is used instead (e.g., the filename of the CIF).

From RCSB CIF files, this function extracts:
  • ID (e.g., PDB ID)

  • Method (e.g., X-ray, NMR, etc.)

  • Deposition date (initial)

  • Release date (smallest revision date)

  • Resolution (e.g., 5.0, 3.0, etc.)

For custom CIF files (e.g., distillation), this function extracts:
  • Extra metadata (all other categories)

Parameters:
  • cif_block (CIFBlock) – The CIF block to extract metadata from.

  • fallback_id (str) – A fallback ID to use if the entry.id field is not present in the CIF block.

atomworks.io.transforms.categories.initialize_chain_info_from_category(cif_block: CIFBlock, atom_array: AtomArray) dict[source]#

Extracts chain entity-level information from the CIF block.

Requires the categories ‘entity’ and ‘entity_poly’ to be present in the CIF block.

In particular, this function adds the following information to the chain_info_dict:
  • The RCSB entity ID for each chain (e.g., 1, 2, 3, etc.)

  • The chain type as an IntEnum (e.g., polypeptide(L), non-polymer, etc.)

  • The unprocessed one-letter entity canonical and non-canonical sequences.

  • A boolean flag indicating whether the chain is a polymer.

  • The EC numbers for the chain.

Note that three-letter sequence information is added to the chain_info_dict in a later step.

Parameters:
  • cif_block (CIFBlock) – Parsed CIF block.

  • atom_array (AtomArray) – Atom array containing the chain information.

Returns:

Dictionary containing the sequence details of each chain.

Return type:

dict

atomworks.io.transforms.categories.load_monomer_sequence_information_from_category(cif_block: CIFBlock, chain_info_dict: dict, atom_array: AtomArray, ccd_mirror_path: PathLike = None) dict[source]#

Load monomer sequence information into a chain_info_dict

Uses:
  1. The CIFCategory ‘entity_poly_seq’ as the sequence ground-truth for polymers.

  2. The AtomArray as the ground-truth for non-polymers.

We must rely on the CIFCategory ‘entity_poly_seq’ for polymers, as the AtomArray may not contain the full sequence information (e.g., unresolved residues) For non-polymers, there’s no standard equivalent to ‘entity_poly_seq’, so we must use the AtomArray to get the sequence information.

When loading both polymer and non-polymer sequences, we also filter out unknown or otherwise ignored residues.

Parameters:
  • cif_block (CIFBlock) – The CIF block containing the monomer sequence information.

  • chain_info_dict (dict) – The dictionary where the monomer sequence information will be stored.

  • atom_array (AtomArray) – The atom array used to get the sequence for non-polymers.

Returns:

  • ‘res_name’: The CCD residue names for each chain.

  • ’res_id’: The residue IDs for each chain (does not perform re-indexing)

  • ’processed_entity_non_canonical_sequence’: The processed non-canonical sequence for each chain.

  • ’processed_entity_canonical_sequence’: The processed canonical sequence for each chain.

  • ’has_sequence_heterogeneity’: A boolean flag indicating whether the chain has

Return type:

The updated chain_info_dict with monomer sequence information. Adds the following keys