Vocabulary
minnt.Vocabulary
A class for managing mapping between strings and indices.
The vocabulary is initialized with a list of strings, and additionally can contain two special tokens:
- a padding token minnt.Vocabulary.PAD_TOKEN, which, if present, is always at index minnt.Vocabulary.PAD=0;
- an unknown token minnt.Vocabulary.UNK_TOKEN, which, if present, is either at index minnt.Vocabulary.UNK 0 or 1 (depending on whether the padding token is present); the index of this token is returned when looking up a string not present in the vocabulary.
Info
A Vocabulary instance can be pickled and unpickled efficiently as a list of strings;
the required string-to-index mapping is reconstructed upon unpickling.
Source code in minnt/vocabulary.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
PAD_TOKEN
class-attribute
instance-attribute
PAD_TOKEN: str = '[PAD]'
The string representing the padding token.
UNK_TOKEN
class-attribute
instance-attribute
UNK_TOKEN: str = '[UNK]'
The string representing the unknown token.
__init__
Initialize the vocabulary with the given list of strings.
The strings might be prepended with special tokens for padding and unknown tokens, respectively,
depending on the values of add_pad and add_unk.
Note
If the given strings already contain special tokens on expected indices, they are recognized
correctly and no duplicates are added even if add_pad and/or add_unk are True.
Parameters:
-
strings(Iterable[str]) –An iterable of strings to include in the vocabulary.
-
add_pad(bool, default:False) –Whether to add a padding token minnt.Vocabulary.PAD_TOKEN at index 0 and set minnt.Vocabulary.PAD=0.
-
add_unk(bool, default:False) –Whether to add an unknown token minnt.Vocabulary.UNK_TOKEN at index 0 or 1 (depending on whether the padding token is added) and set minnt.Vocabulary.UNK accordingly.
Source code in minnt/vocabulary.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
__len__
__len__() -> int
The number of strings in the vocabulary.
Returns:
-
int–The size of the vocabulary.
Source code in minnt/vocabulary.py
82 83 84 85 86 87 88 | |
__iter__
Return an iterator over strings in the vocabulary.
Returns:
Source code in minnt/vocabulary.py
90 91 92 93 94 95 96 | |
add
If not already present, add the given string to the end of the vocabulary.
Parameters:
-
string(str) –The string to add.
Returns:
-
int–The index of the newly added string (or the index of the existing string if it was already present).
Source code in minnt/vocabulary.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
string
Convert vocabulary index to string.
Parameters:
-
index(int) –The vocabulary index.
Returns:
-
str–The string corresponding to the given index.
Source code in minnt/vocabulary.py
123 124 125 126 127 128 129 130 131 132 | |
strings
Convert a sequence of vocabulary indices to strings.
Parameters:
Returns:
Source code in minnt/vocabulary.py
134 135 136 137 138 139 140 141 142 143 | |
index
Convert string to vocabulary index.
Parameters:
-
string(str) –The string to convert.
-
add_missing(bool, default:False) –Whether to add the string to the vocabulary if not present.
Returns:
-
int | None–The index corresponding to the given string. If the string is not found in the vocabulary, then
- if
add_missingisTrue, the string is added to the end of the vocabulary and its index returned; - if the minnt.Vocabulary.UNK_TOKEN was added to the vocabulary, its index is returned;
- otherwise,
Noneis returned.
- if
Source code in minnt/vocabulary.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
indices
Convert a sequence of strings to vocabulary indices.
Parameters:
-
strings(Iterable[str]) –An iterable of strings to convert. add_missing: Whether to add strings not present in the vocabulary.
Returns:
-
list[int | None]–A list of indices corresponding to the given strings. For each string not found in the vocabulary:
- if
add_missingisTrue, the string is added to the end of the vocabulary and its index returned; - if the minnt.Vocabulary.UNK_TOKEN was added to the vocabulary, its index is returned;
- otherwise,
Noneis returned.
- if
Source code in minnt/vocabulary.py
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |