DNA Transforms#

This module contains transformations specific to DNA processing.

Transforms for augmentation of nucleic acids

class atomworks.ml.transforms.dna.pad_dna.PadDNA(x3dna_path: PathLike | None = None, p_skip: float = 0, max_overhang: int = 2, max_pad: int = 100, max_pad_tot: int = 100, min_pad: int = 0, pad_type_weights: dict = {'none': 0, 'pdb': 0, 'uniform': 1}, pad_nt_weights: dict = {'A': 1, 'C': 1, 'G': 1, 'T': 1}, align_len_weights: dict = {1: 1}, no_hbond_dist_cut: float = 4.0)[source]#

Bases: Transform

Structurally pads DNA duplexes by extending them with randomly sampled DNA in B-form conformation.

This transform identifies DNA duplexes in the structure, completes any overhanging single-stranded regions with complementary bases, and optionally extends the duplex with additional base pairs. The padding is done both at the sequence level and structural level, ensuring proper base pairing and B-form DNA geometry. The original sequence is not modified and placed at a random position in the padded sequence.

Parameters:
  • x3dna_path (-) – Path to the X3DNA installation directory or executable. If None, this assumes the ‘X3DNA’ environment variable is set to infer the x3dna executable path.

  • p_skip (-) – Probability of skipping the transform. Must be between 0 and 1. Defaults to 0.

  • max_overhang (-) – Maximum allowed length of single-stranded overhangs. Defaults to 2. If the overhang is longer than this, the transform will skip the DNA chain.

  • max_pad (-) – Maximum number of base pairs to add in a single padding event. Defaults to 100. If the total length of the padded sequence is longer than this, the transform will skip the DNA chain.

  • max_pad_tot (-) – Maximum total length of padded DNA duplex. Defaults to 100.

  • min_pad (-) – Minimum number of base pairs to add when padding. Defaults to 20.

  • pad_type_weights (-) – Weights for different padding strategies. Keys are ‘none’, ‘pdb’, ‘uniform’. Defaults to {“none”: 0, “pdb”: 0, “uniform”: 1}.

  • pad_nt_weights (-) – Weights for nucleotide selection during padding. Keys are ‘A’, ‘T’, ‘C’, ‘G’. Defaults to {“A”: 1, “C”: 1, “G”: 1, “T”: 1}.

  • align_len_weights (-) – Weights for selecting alignment lengths. Keys are integers. Defaults to {1: 1}.

Raises:
  • - AssertionError – If p_skip is not between 0 and 1.

  • - X3DNAExecutableError – If X3DNA executable validation fails.

check_input(data: dict) None[source]#

Check if the input dictionary is valid for the transform. Raises an error if the input is invalid.

forward(data: dict[str, Any]) dict[str, Any][source]#

Apply a transformation to the input dictionary and return the transformed dictionary.

Parameters:

data (dict) – The input dictionary to transform.

Returns:

The transformed dictionary.

Return type:

dict

pdb_dna_lengths: ClassVar[dict[int, int]] = {0: 0, 1: 5, 2: 53, 3: 60, 4: 342, 5: 616, 6: 764, 7: 817, 8: 797, 9: 675, 10: 1258, 11: 971, 12: 1713, 13: 985, 14: 795, 15: 669, 16: 1224, 17: 464, 18: 697, 19: 363, 20: 435, 21: 858, 22: 336, 23: 218, 24: 313, 25: 280, 26: 317, 27: 351, 28: 318, 29: 161, 30: 217, 31: 154, 32: 218, 33: 99, 34: 109, 35: 231, 36: 155, 37: 84, 38: 157, 39: 98, 40: 320, 41: 46, 42: 285, 43: 45, 44: 84, 45: 93, 46: 61, 47: 69, 48: 267, 49: 124, 50: 276, 51: 40, 52: 52, 53: 46, 54: 94, 55: 41, 56: 58, 57: 35, 58: 29, 59: 31, 60: 90, 61: 39, 62: 17, 63: 38, 64: 30, 65: 18, 66: 15, 67: 13, 68: 14, 69: 8, 70: 61, 71: 22, 72: 18, 73: 4, 74: 10, 75: 14, 76: 3, 77: 9, 78: 13, 79: 18, 80: 26, 81: 14, 82: 0, 83: 5, 84: 32, 85: 45, 86: 3, 87: 1, 88: 4, 89: 1, 90: 28, 91: 2, 92: 3, 93: 3, 94: 6, 95: 3, 96: 10, 97: 0, 98: 2, 99: 35, 100: 15, 101: 0, 102: 0, 103: 0, 104: 1, 105: 11, 106: 42, 107: 0, 108: 4, 109: 2, 110: 0, 111: 0, 112: 1, 113: 0, 114: 1, 115: 2, 116: 4, 117: 0, 118: 5, 119: 2, 120: 8, 121: 2, 122: 4, 123: 5, 124: 2, 125: 4, 126: 0, 127: 3, 128: 1, 129: 0, 130: 0, 131: 0, 132: 0, 133: 4, 134: 0, 135: 0, 136: 4, 137: 2, 138: 2, 139: 12, 140: 0, 141: 2, 142: 0, 143: 4, 144: 9, 145: 203, 146: 154, 147: 317, 148: 3, 149: 670}#

Distribution of DNA lengths in PDB as tuples of (length, count).

atomworks.ml.transforms.dna.pad_dna.base_pairs(atom_array: AtomArray, min_atoms_per_base: int = 3, unique: bool = True, no_hbond_dist_cut: float = 4.0) ndarray[source]#
atomworks.ml.transforms.dna.pad_dna.generate_bform_dna(seq: str) AtomArray[source]#

Uses x3dna’s ‘fiber’ executable to generate ideal bform DNA with the given sequence, then returns the structure parsed into an AtomArray.

atomworks.ml.transforms.dna.pad_dna.to_reverse_complement(seq: str) str[source]#

Get a Watson-Crick complement of a nucleic acid sequence (assuming one-letter codes).