Encoding Definitions#
This module contains definitions for various encoding schemes used in the atomworks.ml package.
Definitions of the various standard encodings.
- atomworks.ml.encoding_definitions.AF2_ATOM14_ENCODING = Encoding(n_tokens=21, n_atoms_per_token=14) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 --------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | NE1 | CE2 | CE3 | CZ2 | CZ3 | CH2 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | 20 : UNK | | | | | | | | | | | | | | #
AF2’s atom14 encoding.
- Reference:
- atomworks.ml.encoding_definitions.AF2_ATOM37_ENCODING = Encoding(n_tokens=21, n_atoms_per_token=37) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | CB | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 1 : ARG | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | NE | | | | | | NH1 | NH2 | | CZ | | | | OXT 2 : ASN | N | CA | C | CB | O | CG | | | | | | | | | | ND2 | OD1 | | | | | | | | | | | | | | | | | | | | OXT 3 : ASP | N | CA | C | CB | O | CG | | | | | | | | | | | OD1 | OD2 | | | | | | | | | | | | | | | | | | | OXT 4 : CYS | N | CA | C | CB | O | | | | | | SG | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 5 : GLN | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | NE2 | OE1 | | | | | | | | | | OXT 6 : GLU | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | OE1 | OE2 | | | | | | | | | OXT 7 : GLY | N | CA | C | | O | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 8 : HIS | N | CA | C | CB | O | CG | | | | | | | | CD2 | ND1 | | | | | | CE1 | | | | | NE2 | | | | | | | | | | | OXT 9 : ILE | N | CA | C | CB | O | | CG1 | CG2 | | | | | CD1 | | | | | | | | | | | | | | | | | | | | | | | | OXT 10 : LEU | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | | | | | | | | | | | | | | | | OXT 11 : LYS | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | CE | | | | | | | | | | | | | | | | NZ | OXT 12 : MET | N | CA | C | CB | O | CG | | | | | | | | | | | | | SD | CE | | | | | | | | | | | | | | | | | OXT 13 : PHE | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | | CZ | | | | OXT 14 : PRO | N | CA | C | CB | O | CG | | | | | | CD | | | | | | | | | | | | | | | | | | | | | | | | | OXT 15 : SER | N | CA | C | CB | O | | | | OG | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 16 : THR | N | CA | C | CB | O | | | CG2 | | OG1 | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 17 : TRP | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | | CE2 | CE3 | | NE1 | | | | CH2 | | | | | CZ2 | CZ3 | | OXT 18 : TYR | N | CA | C | CB | O | CG | | | | | | | CD1 | CD2 | | | | | | | CE1 | CE2 | | | | | | | | | | OH | CZ | | | | OXT 19 : VAL | N | CA | C | CB | O | | CG1 | CG2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OXT 20 : UNK | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #
AF2’s atom37 encoding
- Reference:
(extracted via: ```python atom37 = {} for res1 in restype_order.keys():
res3 = restype_1to3[res1] arr = np.array([“”]*37, dtype=”<U3”) for atom in restype_name_to_atom14_names[res3]:
- if atom != ‘’:
arr[atom_order[atom]] = f”{atom:<3}” if atom != “” else “ “
arr[-1] = “OXT”
atom37[res3] = arr
- class atomworks.ml.encoding_definitions.AF3SequenceEncoding[source]#
Bases:
object
Encodes and decodes sequence tokens for AlphaFold 3.
This class provides functionality to convert between residue names and their corresponding integer encodings as used in AlphaFold 3. It handles standard amino acids, RNA, DNA, and unknown residues.
- tokens()#
Property that returns the list of AF3 tokens.
- n_tokens()#
Property that returns the number of AF3 tokens.
- property idx_to_token: ndarray#
- property n_tokens: int#
- property token_to_idx: dict[str, int]#
- property tokens: list[str]#
- atomworks.ml.encoding_definitions.AF3_TOKENS = ('ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL', 'UNK', 'A', 'C', 'G', 'U', 'N', 'DA', 'DC', 'DG', 'DT', 'DN', '<G>')#
Sequence tokens in AF3
- atomworks.ml.encoding_definitions.RF2AA_ATOM36_ENCODING = Encoding(n_tokens=80, n_atoms_per_token=36) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | HE | 1HH1 | 2HH1 | 1HH2 | 2HH2 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD2 | 2HD2 | | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | | | | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | HG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HE2 | 2HE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | | | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | | | | | | | | | | | H | 1HA | 2HA | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 2HD | 1HE | 2HE | | | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | | | | | | | | | | | H | HA | HB | 1HG2 | 2HG2 | 3HG2 | 1HG1 | 2HG1 | 1HD1 | 2HD1 | 3HD1 | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | HG | 1HD1 | 2HD1 | 3HD1 | 1HD2 | 2HD2 | 3HD2 | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | 1HE | 2HE | 1HZ | 2HZ | 3HZ 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HG | 2HG | 1HE | 2HE | 3HE | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 2HD | 1HE | 2HE | HZ | | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | | | | | | | | | | | HA | 1HB | 2HB | 1HG | 2HG | 1HD | 2HD | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | | | | | | | | | | | H | HG | HA | 1HB | 2HB | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | | | | | | | | | | | H | HG1 | HA | HB | 1HG2 | 2HG2 | 3HG2 | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 1HE | HZ2 | HH2 | HZ3 | HE3 | | | 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | | | | | | | | | | | H | HA | 1HB | 2HB | 1HD | 1HE | 2HE | 2HD | HH | | | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | | | | | | | | | | | H | HA | HB | 1HG1 | 2HG1 | 3HG1 | 1HG2 | 2HG2 | 3HG2 | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 3HB | | | | | | | | 22 : DA | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N6 | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H2 | H61 | H62 | H8 | | 23 : DC | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H42 | H41 | H5 | H6 | | 24 : DG | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N2 | O6 | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H1 | H22 | H21 | H8 | | 25 : DT | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C7 | C6 | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | H3 | H71 | H72 | H73 | H6 | 26 : DN | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | | | | | | | | | | | | | H5'' | H5' | H4' | H3' | H2'' | H2' | H1' | | | | | | 27 : A | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N3 | C4 | C5 | C6 | N6 | N7 | C8 | N9 | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H2 | H61 | H62 | H8 | | 28 : C | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H42 | H41 | H5 | H6 | | 29 : G | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N2 | N3 | C4 | C5 | C6 | O6 | N7 | C8 | N9 | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H1 | H22 | H21 | H8 | | 30 : U | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C6 | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | H3 | H5 | H6 | | | 31 : N | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | | | | | | | | | | | | H5' | H5'' | H4' | H3' | H2' | HO2' | H1' | | | | | | 32 : HIS_D | N | CA | C | O | CB | CG | NE2 | CD2 | CE1 | ND1 | | | | | | | | | | | | | | H | HA | 1HB | 2HB | 2HD | 1HE | 1HD | | | | | | 33 : 13 | | 13 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 34 : 33 | | 33 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 35 : 79 | | 79 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 36 : 5 | | 5 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 37 : 4 | | 4 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 38 : 35 | | 35 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 39 : 6 | | 6 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 40 : 20 | | 20 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 41 : 17 | | 17 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 42 : 27 | | 27 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 43 : 24 | | 24 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 44 : 29 | | 29 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 45 : 9 | | 9 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 46 : 26 | | 26 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 47 : 80 | | 80 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 48 : 53 | | 53 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 49 : 77 | | 77 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 50 : 19 | | 19 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 51 : 3 | | 3 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 52 : 12 | | 12 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 53 : 25 | | 25 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 54 : 42 | | 42 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 55 : 7 | | 7 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 56 : 28 | | 28 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 57 : 8 | | 8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 58 : 76 | | 76 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 59 : 15 | | 15 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 60 : 82 | | 82 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 61 : 46 | | 46 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 62 : 59 | | 59 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 63 : 78 | | 78 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 64 : 75 | | 75 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 65 : 45 | | 45 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 66 : 44 | | 44 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 67 : 16 | | 16 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 68 : 51 | | 51 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 69 : 34 | | 34 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 70 : 14 | | 14 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 71 : 50 | | 50 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 72 : 65 | | 65 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 73 : 52 | | 52 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 74 : 92 | | 92 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 75 : 74 | | 74 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 76 : 23 | | 23 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 77 : 39 | | 39 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 78 : 30 | | 30 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 79 : 0 | | 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #
RF2AA all atom encoding for proteins, nucleic acids and various other elements - Encodes heavy atoms and hydrogens (max 36 in total) - Includes 3 unknown tokens: UNK for proteins, DN for dna, N for RNA - Covers:
20 amino acids (+ unknown, + mask),
4 DNA bases (+ unknown),
4 RNA bases (+ unknown),
1 outdated histindine token HIS_D
45 atom tokens (+ unknown)
- atomworks.ml.encoding_definitions.RF2AA_STANDARDIZED_TOKENS = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL', 'UNK', '<M>', 'DA', 'DC', 'DG', 'DT', 'DN', 'A', 'C', 'G', 'U', 'N', 'HIS_D', 13, 33, 79, 5, 4, 35, 6, 20, 17, 27, 24, 29, 9, 26, 80, 53, 77, 19, 3, 12, 25, 42, 7, 28, 8, 76, 15, 82, 46, 59, 78, 75, 45, 44, 16, 51, 34, 14, 50, 65, 52, 92, 74, 23, 39, 30, 0]#
List of standardized tokens in RF2AA.
- atomworks.ml.encoding_definitions.RF2AA_TOKEN_TO_STANDARD_TOKEN = {' DA': 'DA', ' DC': 'DC', ' DG': 'DG', ' DT': 'DT', ' DX': 'DN', ' RA': 'A', ' RC': 'C', ' RG': 'G', ' RU': 'U', ' RX': 'N', 'ALA': 'ALA', 'ARG': 'ARG', 'ASN': 'ASN', 'ASP': 'ASP', 'ATM': 0, 'Al': 13, 'As': 33, 'Au': 79, 'B': 5, 'Be': 4, 'Br': 35, 'C': 6, 'CYS': 'CYS', 'Ca': 20, 'Cl': 17, 'Co': 27, 'Cr': 24, 'Cu': 29, 'F': 9, 'Fe': 26, 'GLN': 'GLN', 'GLU': 'GLU', 'GLY': 'GLY', 'HIS': 'HIS', 'HIS_D': 'HIS_D', 'Hg': 80, 'I': 53, 'ILE': 'ILE', 'Ir': 77, 'K': 19, 'LEU': 'LEU', 'LYS': 'LYS', 'Li': 3, 'MAS': '<M>', 'MET': 'MET', 'Mg': 12, 'Mn': 25, 'Mo': 42, 'N': 7, 'Ni': 28, 'O': 8, 'Os': 76, 'P': 15, 'PHE': 'PHE', 'PRO': 'PRO', 'Pb': 82, 'Pd': 46, 'Pr': 59, 'Pt': 78, 'Re': 75, 'Rh': 45, 'Ru': 44, 'S': 16, 'SER': 'SER', 'Sb': 51, 'Se': 34, 'Si': 14, 'Sn': 50, 'THR': 'THR', 'TRP': 'TRP', 'TYR': 'TYR', 'Tb': 65, 'Te': 52, 'U': 92, 'UNK': 'UNK', 'V': 23, 'VAL': 'VAL', 'W': 74, 'Y': 39, 'Zn': 30}#
Dictionary to interconvert between RF2AA token names and standardized token names.
- atomworks.ml.encoding_definitions.RF2_ATOM14_ENCODING = Encoding(n_tokens=22, n_atoms_per_token=14) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 --------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | #
- RF2 atom14 encoding for proteins.
Encodes only the heavy atoms (max 14, for TRP)
Includes 1 unknown tokens: UNK
Print it out to see a visual representation of the encoding.
- atomworks.ml.encoding_definitions.RF2_ATOM23_ENCODING = Encoding(n_tokens=32, n_atoms_per_token=23) Token | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 --------------------------------------------------------------------------------------------------------------------------------------------------- 0 : ALA | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 1 : ARG | N | CA | C | O | CB | CG | CD | NE | CZ | NH1 | NH2 | | | | | | | | | | | | 2 : ASN | N | CA | C | O | CB | CG | OD1 | ND2 | | | | | | | | | | | | | | | 3 : ASP | N | CA | C | O | CB | CG | OD1 | OD2 | | | | | | | | | | | | | | | 4 : CYS | N | CA | C | O | CB | SG | | | | | | | | | | | | | | | | | 5 : GLN | N | CA | C | O | CB | CG | CD | OE1 | NE2 | | | | | | | | | | | | | | 6 : GLU | N | CA | C | O | CB | CG | CD | OE1 | OE2 | | | | | | | | | | | | | | 7 : GLY | N | CA | C | O | | | | | | | | | | | | | | | | | | | 8 : HIS | N | CA | C | O | CB | CG | ND1 | CD2 | CE1 | NE2 | | | | | | | | | | | | | 9 : ILE | N | CA | C | O | CB | CG1 | CG2 | CD1 | | | | | | | | | | | | | | | 10 : LEU | N | CA | C | O | CB | CG | CD1 | CD2 | | | | | | | | | | | | | | | 11 : LYS | N | CA | C | O | CB | CG | CD | CE | NZ | | | | | | | | | | | | | | 12 : MET | N | CA | C | O | CB | CG | SD | CE | | | | | | | | | | | | | | | 13 : PHE | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | | | | | | | | | | | | 14 : PRO | N | CA | C | O | CB | CG | CD | | | | | | | | | | | | | | | | 15 : SER | N | CA | C | O | CB | OG | | | | | | | | | | | | | | | | | 16 : THR | N | CA | C | O | CB | OG1 | CG2 | | | | | | | | | | | | | | | | 17 : TRP | N | CA | C | O | CB | CG | CD1 | CD2 | CE2 | CE3 | NE1 | CZ2 | CZ3 | CH2 | | | | | | | | | 18 : TYR | N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | OH | | | | | | | | | | | 19 : VAL | N | CA | C | O | CB | CG1 | CG2 | | | | | | | | | | | | | | | | 20 : UNK | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 21 : <M> | N | CA | C | O | CB | | | | | | | | | | | | | | | | | | 22 : DA | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N6 | | 23 : DC | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | | 24 : DG | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N9 | C4 | N3 | C2 | N1 | C6 | C5 | N7 | C8 | N2 | O6 | 25 : DT | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C7 | C6 | | | 26 : DN | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C2' | C1' | | | | | | | | | | | | 27 : A | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N3 | C4 | C5 | C6 | N6 | N7 | C8 | N9 | 28 : C | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | N4 | C5 | C6 | | | 29 : G | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | N2 | N3 | C4 | C5 | C6 | O6 | N7 | C8 | N9 30 : U | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | N1 | C2 | O2 | N3 | C4 | O4 | C5 | C6 | | | 31 : N | OP1 | P | OP2 | O5' | C5' | C4' | O4' | C3' | O3' | C1' | C2' | O2' | | | | | | | | | | | #
- RF2 atom23 encoding for proteins and nucleic acids.
Encodes only the heavy atoms (max 22, for RG)
Includes 3 unknown tokens: UNK for proteins, DN for dna, N for RNA
Print it out to see a visual representation of the encoding.
- class atomworks.ml.encoding_definitions.TokenEncoding(token_atoms: dict[str | int, ndarray], chemcomp_type_to_unknown: dict[str, str] = None)[source]#
Bases:
object
A class to represent an fixed length token encoding.
- Parameters:
token_atoms (dict[str, np.ndarray]) – A dictionary mapping token names to atom names. The order of the tokens in the sequence determines the integer encoding of the token. The order of the atom names in the tuple determines the integer encoding of the atom name within the token.
chemcomp_type_to_unknown (dict[str, str]) – A dictionary mapping chemical component types to unknown token names. This is used to map unknown residues to the respective unknown token. Different chemical component types may map to different unknown token names. Defaults to {}, meaning that no unknown tokens are defined, leading to a KeyError if an unknown residue is encountered.
NOTE: We follow these conventions for tokens to make them compatible with the CCD for robust and easy tokenization. If you want to use the Transforms written for automatically tokenizing and encoding, you need to follow these conventions.
- When encoding a residue, we use the standardized (up to) 3-letter residue name from the CCD,
e.g. ‘ALA’ for Alanine, or DA for Deoxyadenosine, or U for Uracil.
- When encoding unknown tokens, we may define different unknown tokens for different
chemical components (e.g. a different unknown for proteins, vs. dna, …). The unkown tokens can take on any arbitrary 3-letter code that we want to map to, but they should not clash with existing residue names in the CCD.
- When encoding an atom, we use the atomic number of the element as a string as the
token name. E.g. ‘1’ for Hydrogen, ‘6’ for Carbon, ‘9’ for Fluorine, … For unknown atoms, we use ‘0’ as the token name. # TODO: Deal with ligand names such as 100 which is also an atomic number
- To denote masked tokens, we use a ‘<…>’ syntax. E.g. ‘<M>’ for a generic mask token,
or ‘<MP>’ for a mask token for proteins. The … can be any arbitrary string. We use the angle brackets to avoid clashes with existing residue names in the CCD.
- property atom_to_idx: dict[tuple[str | int, str], int]#
For encoding atoms (token, atom) to atom indices. (token, atom) -> atom_idx
- chemcomp_type_to_unknown: dict[str, str] = None#
- property idx_to_atom: ndarray#
For rapid decoding of token & atom indices to atom names via numpy indexing.
- property idx_to_element: ndarray#
For rapid decoding of token & atom indices to atom names via numpy indexing.
- property idx_to_token: ndarray#
For rapid decoding of token indices to token names via numpy indexing.
- property n_atoms_per_token: int#
- property n_tokens: int#
- token_atoms: dict[str | int, ndarray]#
- property token_to_idx: dict[str, int]#
For encoding token names to token indices. (token) -> token_idx
- property tokens: ndarray#
- property unknown_tokens: ndarray#
- atomworks.ml.encoding_definitions.UNKNOWN_ELEMENT_TOKEN = 0#
The token to use for an unknown element.