Category Transforms#

Transforms operating on Biotite’s CIFBlock and CIFCategory objects.

These transforms are used to extract information from the CIFBlock and return a dictionary containing processed information.

atomworks.io.transforms.categories.category_to_df(cif_block: CIFBlock, category: str) → DataFrame | None[source]#: Convert a CIF block to a pandas DataFrame.

atomworks.io.transforms.categories.category_to_dict(cif_block: CIFBlock, category: str) → dict[str, ndarray][source]#: Convert a CIF block to a dictionary.

atomworks.io.transforms.categories.extract_crystallization_details(crystal_dict: dict) → dict[str, list[float] | None][source]#

Extracts crystallization details from the crystallization dictionary.

Parameters:

crystal_dict – Dictionary for the exptl_crystal_grow CIF category.

Returns:

“pH”: A list of two floats [min_pH, max_pH], or None if unavailable.

Return type:

A dictionary with crystallization details. Currently includes

atomworks.io.transforms.categories.get_ligand_of_interest_info(cif_block: CIFBlock) → dict[source]#

Extract ligand of interest information from a CIF block.

Reference:

https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/small-molecule-ligands

atomworks.io.transforms.categories.get_metadata_from_category(cif_block: CIFBlock, fallback_id: str | None = None) → dict[source]#

Extract metadata from the CIF block. If the entry.id field is not present in the CIF block, the fallback_id is used instead (e.g., the filename of the CIF).

From RCSB CIF files, this function extracts:

ID (e.g., PDB ID)
Method (e.g., X-ray, NMR, etc.)
Deposition date (initial)
Release date (smallest revision date)
Resolution (e.g., 5.0, 3.0, etc.)

For custom CIF files (e.g., distillation), this function extracts:

Extra metadata (all other categories)

Parameters:

cif_block (CIFBlock) – The CIF block to extract metadata from.
fallback_id (str) – A fallback ID to use if the entry.id field is not present in the CIF block.

atomworks.io.transforms.categories.initialize_chain_info_from_category(cif_block: CIFBlock, atom_array: AtomArray) → dict[source]#

Extracts chain entity-level information from the CIF block.

Requires the categories ‘entity’ and ‘entity_poly’ to be present in the CIF block.

In particular, this function adds the following information to the chain_info_dict:

The RCSB entity ID for each chain (e.g., 1, 2, 3, etc.)
The chain type as an IntEnum (e.g., polypeptide(L), non-polymer, etc.)
The unprocessed one-letter entity canonical and non-canonical sequences.
A boolean flag indicating whether the chain is a polymer.
The EC numbers for the chain.

Note that three-letter sequence information is added to the chain_info_dict in a later step.

Parameters:

cif_block (CIFBlock) – Parsed CIF block.
atom_array (AtomArray) – Atom array containing the chain information.

Returns:

Dictionary containing the sequence details of each chain.

Return type:

dict

atomworks.io.transforms.categories.load_monomer_sequence_information_from_category(cif_block: CIFBlock, chain_info_dict: dict, atom_array: AtomArray, ccd_mirror_path: PathLike = None) → dict[source]#

Load monomer sequence information into a chain_info_dict

Uses:

The CIFCategory ‘entity_poly_seq’ as the sequence ground-truth for polymers.
The AtomArray as the ground-truth for non-polymers.

We must rely on the CIFCategory ‘entity_poly_seq’ for polymers, as the AtomArray may not contain the full sequence information (e.g., unresolved residues) For non-polymers, there’s no standard equivalent to ‘entity_poly_seq’, so we must use the AtomArray to get the sequence information.

When loading both polymer and non-polymer sequences, we also filter out unknown or otherwise ignored residues.

Parameters:

cif_block (CIFBlock) – The CIF block containing the monomer sequence information.
chain_info_dict (dict) – The dictionary where the monomer sequence information will be stored.
atom_array (AtomArray) – The atom array used to get the sequence for non-polymers.

Returns:

‘res_name’: The CCD residue names for each chain.
’res_id’: The residue IDs for each chain (does not perform re-indexing)
’processed_entity_non_canonical_sequence’: The processed non-canonical sequence for each chain.
’processed_entity_canonical_sequence’: The processed canonical sequence for each chain.
’has_sequence_heterogeneity’: A boolean flag indicating whether the chain has

Return type:

The updated chain_info_dict with monomer sequence information. Adds the following keys

Category Transforms#

This Page