How decision trees are used in Kaldi

Introduction

This page gives an overview of how phonetic decision trees are built and used in Kaldi and how this interacts with training and graph-building. For a description of the internals, of the tree-building code, see Decision tree internals; for more details on our approach to building decoding graphs, see Decoding graph construction in Kaldi.

The basic algorithm that is being implemented is a top-down greedy splitting, where we have a number of ways we can split the data by asking about, say, the left phone, the right phone, the central phone, the state we're in, and so on. The algorithm we implement is similar to the standard algorithm, see for example the paper "Tree-based State Tying for High Accuracy Acoustic Modeling" by Young, Odell and Woodland. In this algorithm, we split the data up by asking the locally optimal question, i.e. the one that gives the most likelihood increase, supposing we model the data on each side of the split by a single Gaussian. Differences from standard implementations include added flexibility about how to configure the tree roots; the ability to ask questions about the HMM-state and the central phone; and the fact that by default in the Kaldi scripts, the questions are automatically generated by a top-down binary clustering of the data, which means that it is not necessary to supply hand-generated questions. Regarding the configuration of the tree roots: it's possible to have all the statistics from all the phones in a single shared group that gets split (including questions about the central phone and the HMM-states), or to have individual phones, or HMM-states of phones, be the tree roots that get split, or to have groups of phones be the tree roots. For how to configure it using the standard scripts, see Data preparation. In practice we generally let each tree-root correspond to a "real phone", meaning that we group together all word-position-dependent, tone-dependent or stress-dependent versions of each phone into one group that becomes a tree root.

The rest of this page mostly gives details at the code level of what is happening.

Phonetic context windows

Here we explain how we describe phonetic context in our code. A particular tree will have two integer values that describe the width and "central position" of the context window. The table below summarizes these values:

Name in code Name in command-line arguments Value (triphone) Value (monophone)
N –context-width=? 3 1
P –central-position=? 1 0

N is the width of the context window and P is the identity of the designated "central phone". Normally P is exactly in the middle of the window (hence the name "central-position"); for example, with N=3, we would normally have P=1, but you are free to choose any value from 0 to N-1; for instance, P=2 and N=3 means two phones of left context and no right context at all. In the code, when we talk about the "central phone" we always mean the P'th phone which may or may not actually be the central phone of the context window.

A vector of integers representing a typical triphone context window might be:

// probably not valid C++
vector<int32> ctx_window = { 12, 15, 21 };

Assuming N=3 and P=1, this would represent phone 15 with a right context of 21 and a left context of 12. The way we handle end effects is using zero (which is not a valid phone because it's reserved in OpenFst for the epsilon meaning "no symbol"), so for instance:

vector<int32> ctx_window = { 12, 15, 0 };

means phone 15 with a left-context of 12 and no right-context because it's the end of the utterance. At the end of utterance in particular, the use of zero this way may be a little unexpected because the last "phone" is actually the subsequential symbol "$" (see Making the context transducer), but for the convenience of the decision-tree code we don't put the subsequential symbol in these context windows, we put zero. Note that if we had N=3 and P=2, the above context window would be invalid because its P'th element would be zero which is not a real phone; also of course, if we had a tree with N=1, neither of the windows above would be valid because they are the wrong size. In the monophone case, we would have a window like:

vector<int32> ctx_window = { 15 };

so monophone systems are just treated as a special case of context-dependent systems, with a window size N of 1 and a tree that doesn't do anything very interesting.

The tree building process

In this section we give an overview of the tree-building process in Kaldi.

Even a monophone system has a decision tree, but a trivial one; see the functions MonophoneContextDependency() and MonophoneContextDependencyShared() which return such trivial trees. These are called by the command-line program gmm-init-mono; its main input is the HmmTopology object and it outputs the tree, normally written as an object of type ContextDependency to a file called "tree", and also the model file (the model file contains a TransitionModel object and an AmDiagGmm object). If the program gmm-init-mono receives an option called –shared-phones, it will share the pdfs between specified sets of phones; otherwise it makes all the phones separate.

After training a monophone system starting from a flat start, we take the monophone alignments and use the function AccumulateTreeStats() (called from acc-tree-stats) to accumulate statistics for training the tree. This program is not limited to reading in monophone alignments; it works from context-dependent alignments too so we can build trees based on e.g. triphone alignments. The statistics for tree building are written to disk as the type BuildTreeStatsType (see Statistics for building the tree). The function AccumulateTreeStats() takes the values N and P, as explained in the previous section; the command-line programs will set these by default to 3 and 1 respectively, but this can be overridden using the –context-width and –central-position options. The program acc-tree-stats takes a list of context-independent phones (e.g. silence), but this is not required even if there are context-independent phones; it is just a mechanism to reduce the size of the statistics. For context-independent hones, the program will accumulate the corresponding statistics without the keys corresponding to the left and right phones defined (c.f. Event maps).

When the statistics have been accumulated we use the program build-tree to build the tree. This outputs the tree. The program build-tree requires three things:

  • The statistics (of type BuildTreeStatsType)
  • The questions config (of type Questions)
  • The roots file (see below)

The statistics would typically come from the program acc-tree-stats; the questions configuration class would be output by the compile-questions program, which takes in a topology list of phonetic questions (in our scripts, these are automatically obtained from tree-building statistics by the program cluster-phones. The roots file specifies sets of phones that are going to have shared roots in the decision-tree clustering process, and says for each phone set the following two things:

  • "shared" or "not-shared" says whether or not there should be separate roots for each of the pdf-classes (i.e. HMM-states, in the typical case), or if the roots should be shared. If it says "shared" there will be a single tree-root for all HMM states (e.g. all three states, in a normal topology) ; if "not-shared" there would be (e.g.) three tree-roots, one for each pdf-class.
  • "split" or "not-split" says whether or not the decision tree splitting should actually be done for the roots in question (for silence, we typically don't split). If the line says "split" (the normal case) then we do the decision tree splitting. If it says "not-split" then no splitting is done and the roots are left un-split.

The following will clarify some aspects of how this works:

  • If we say "shared split", then even though there is one root node for all three HMM-states, the different HMM states can still get different leaves because the tree can ask questions about the pdf-class as well as about phonetic context.
  • We always share together the roots of all the phones that appear on a single lines of the roots file. This is not configurable via these strings because if you don't want to share the phones' roots, you can just put them on separate lines of the roots file.

Below is an example of a roots file; this assumes that phone 1 is silence and all the other phones have separate roots.

not-shared not-split 1
shared split 2
shared split 3
...
shared split 28

Having multiple phones on the same line is most useful when we have things like position and stress-dependent phones; in this case each "real" phone would correspond to a set of integer phone ids. In that case we share the roots for all versions of a particular underlying phone. Below is an example of a roots file for Wall Street Journal, from the egs/wsj/s5 scripts (this is in text, not integer form; it would have to be converted to integer form before being read by Kaldi):

not-shared not-split SIL SIL_B SIL_E SIL_I SIL_S SPN SPN_B SPN_E SPN_I SPN_S NSN NSN_B NSN_E NSN_I NSN_S
shared split AA_B AA_E AA_I AA_S AA0_B AA0_E AA0_I AA0_S AA1_B AA1_E AA1_I AA1_S AA2_B AA2_E AA2_I AA2_S
shared split AE_B AE_E AE_I AE_S AE0_B AE0_E AE0_I AE0_S AE1_B AE1_E AE1_I AE1_S AE2_B AE2_E AE2_I AE2_S
shared split AH_B AH_E AH_I AH_S AH0_B AH0_E AH0_I AH0_S AH1_B AH1_E AH1_I AH1_S AH2_B AH2_E AH2_I AH2_S
shared split AO_B AO_E AO_I AO_S AO0_B AO0_E AO0_I AO0_S AO1_B AO1_E AO1_I AO1_S AO2_B AO2_E AO2_I AO2_S
shared split AW_B AW_E AW_I AW_S AW0_B AW0_E AW0_I AW0_S AW1_B AW1_E AW1_I AW1_S AW2_B AW2_E AW2_I AW2_S
shared split AY_B AY_E AY_I AY_S AY0_B AY0_E AY0_I AY0_S AY1_B AY1_E AY1_I AY1_S AY2_B AY2_E AY2_I AY2_S
shared split B_B B_E B_I B_S
shared split CH_B CH_E CH_I CH_S
shared split D_B D_E D_I D_S

When creating the roots file, you should ensure that at least one phone on each line is seen. For instance, in this case, if the phone AY was seen in at least some combination of stress and word-position, we would be OK.

In this example, we have various word-position-dependent variants of silence and so on. In this example they will all share their pdf's because they are on the same line and are "not-split"– but they may have different transition parameters. In fact, most of these variants of silence would never be used as silence never appears inside words; this is for future-proofing the setup in case someone does something weird in future.

We do the initial phases of Gaussian mixing up using the alignments from the previous (e.g. monophone) build; the alignments are converted from one tree to another using the program convert-ali.

PDF identifiers

The PDF identifier (pdf-id) is a number, starting from zero, that is used as an index for the probability distribution function (p.d.f.). Each p.d.f. in the system has its own pdf-id, and these are contiguous (typically there are several thousand of these in an LVCSR system). They are originally assigned when the tree is first built. Depending how the tree is built, it may or may not be possible to say, for each pdf-id, which phone it corresponds to.

Context dependency objects

The ContextDependencyInterface object is a virtual base-class for the tree that specifies how it interacts with the graph-building code. This interface contains only four functions:

  • ContextWidth() returns the value of N (context-width) that the tree requires.
  • CentralPosition() returns the value of P (central-position) that the tree requires.
  • NumPdfs() returns the number of pdfs defined by the tree; these are numbered from zero to NumPdfs()-1.
  • Compute() is the function that computes the pdf-id for a particular context (and pdf-class).

The ContextDependencyInterface::Compute() Compute() function is declared as follows:

class ContextDependencyInterface {
...
virtual bool Compute(const std::vector<int32> &phoneseq, int32 pdf_class,
int32 *pdf_id) const;
}

It returns true if it was able to compute the pdf-id for this context and pdf-class. A return value of false will indicate some kind of error or mismatch. An example of using this function is:

ContextDependencyInterface *ctx_dep = ... ;
vector<int32> ctx_window = { 12, 15, 21 }; // not valid C++
int32 pdf_class = 1; // probably central state of 3-state HMM.
int32 pdf_id;
if(!ctx_dep->Compute(ctx_window, pdf_class, &pdf_id))
KALDI_ERR << "Something went wrong!"
else
KALDI_LOG << "Got pdf-id, it is " << pdf_id;

The only class that currently inherits from ContextDependencyInterface is the class ContextDependency, which has marginally richer interface; the only important addition is the function GetPdfInfo which is used by the TransitionModel class to work out which phones a particular pdf can possibly correspond to (this function could be emulated given only the interface of ContextDependencyInterface, by enumerating all contexts).

The ContextDependency object is actually a fairly thin wrapper for the EventMap object; see Decision tree internals. We wanted to hide the actual implementation of the tree as much as possible to make it easy to refactor the code later if needed.

An example of a decision tree

The decision-tree file format was not created with human readability as the first priority, but due to popular demand we will try to explain how to interpret this file. Look at the example below this, which is a triphone tree from the Wall Street Journal recipe. It starts with ContextDependency which is the name of the object; then N (the context-width), which is 3; then P (the "central position" of the context window), which is 1, i.e. the center of the phone context positions since we are in zero-based numbering. The rest of the file contains a single EventMap object. EventMap is a polymorphic type which may contain pointers to other EventMap objects. See Event maps for more details; it is a representation of a decision tree or set of decision trees, that maps from a set of key-value pairs (e.g. left-phone=5, central-phone=10, right-phone=11, pdf-class=2) to a pdf-id (e.g. 158). Briefly, it comes in three types: a SplitEventMap (like a split in a decision tree), a ConstantEventMap (like a leaf of a decision tree, containing just a number representing a pdf-id), and a TableEventMap (which is like a table lookup containing other EventMaps). The SplitEventMap and TableEventMap both have a "key" that they query, which in this case would be 0, 1 or 2 corresponding to left, central or right context, or -1 corresponding to the identity of the "pdf-class". Normally the value of the pdf-class is the same as the index of the HMM state, i.e. 0, 1 or 2. Try not to get confused by this: the key is -1, but the value is 0, 1 or 2, and this has no connection to the 0, 1 or 2 which are the keys of the phones in the context-window. The SplitEventMap has a set of values that will trigger the "yes" branch of the tree. Below is a kind of quasi-BNF notation that explains the tree-file format.

         EventMap := ConstantEventMap | SplitEventMap | TableEventMap | "NULL"
 ConstantEventMap := "CE" <numeric pdf-id>
    SplitEventMap := "SE" <key-to-split-on> "[" yes-value-list "]" "{" EventMap EventMap "}"
    TableEventMap := "TE" <key-to-split-on> <table-size> "(" EventMapList ")"

In the example below, the top-level EventMap of the tree is a SplitEventMap (SE) that splits on key 1, which is the central phone. In square brackets are a contiguous range of phone-ids. As it happens, these don't represent a question, but are just a way of splitting on phones so we can get to the "real" decision trees which are per phone. The issue is that this tree was built with "shared roots", so there are various phone-ids, corresponding to different word-position-and-stress-marked versions of the same phone, that share the root. We can't use a TableEventMap (TE) at the top level of the tree, or we'd have to repeat each decision tree several times (since the EventMap is a pure tree, not a general graph, it has no mechanism for pointers to be "shared"). The next few instances of the "SE" label are also part of this "quasi-tree" which is initially splitting on the central phone (as we go down this file we are going deeper into the tree; notice that the braces "{" are opening but not yet closing). Then we have the string "TE -1 5 ( CE 0 CE 1 CE 2 CE 3 CE 4 )", which represents splitting with a TableEventMap on the pdf-class "-1" (effectively, the HMM-position), and returning values 0 through 4. The values represent the five pdf-ids for the silence and noise phones SIL, NSN and SPN; in our setup, the pdfs are shared between these three non-speech phones (only the transition matrix is specific to each non-speech phone). Note: we have a 5-state rather than 3-state HMM for these phones, hence 5 different pdf-ids. Next is "SE -1 [ 0 ]"; and this can be considered the first "real" question in the tree. We can see from the SE questions above it that it applies when the central phone takes values 5 through 19, which are various versions of the phone AA; and question is asking whether the pdf-class (key -1) has value 0 (i.e. the leftmost HMM-state). Assuming the answer is "yes", the next question is "SE 2 [ 220 221 222 223 ]", which is asking whether the phone to the right is one of various forms of the phone "M" (a rather unintuitive question to ask, since we're in the leftmost HMM-state); if yes, we ask "SE 0 [ 104 105 106 107... 286 287 ]" which is a question about the phone to the left; if yes, then the pdf-id is 5 ("CE 5") and if no, 696 ("CE 696").

s3# copy-tree --binary=false exp/tri1/tree - 2>/dev/null | head -100
ContextDependency 3 1 ToPdf SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 \
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59\
 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 9\
3 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 1\
20 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 14\
5 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170\
 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 \
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34\
 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 6\
8 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 10\
1 102 103 104 105 106 107 108 109 110 111 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34\
 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ]
{ SE 1 [ 1 2 3 ]
{ TE -1 5 ( CE 0 CE 1 CE 2 CE 3 CE 4 )
SE -1 [ 0 ]
{ SE 2 [ 220 221 222 223 ]
{ SE 0 [ 104 105 106 107 112 113 114 115 172 173 174 175 208 209 210 211 212 213 214 215 264 265 266 \
267 280 281 282 283 284 285 286 287 ]
{ CE 5 CE 696 }
SE 2 [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 132 \
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 248 249 250 251 252 253 254 255 256 257 2\
58 259 260 261 262 263 268 269 270 271 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 30\
3 ]

Below is a simpler example: the monophone tree from the Resource Management recipe. The top-level EventMap is a TableEventMap ("TE 0 49 ..."). The key "0" is the phone-position of zero which represents the central (and only) phone since the context width (N) is 1. The number of entries in the table is 49 (in this case, the number of phones plus one). The first EventMap in the table (index zero) is NULL, because there is no phone with index zero. The next one is a TableEventMap with three elements, corresponding to the three HMM-states (technically, pdf-classes) of the first phone: "TE -1 3 ( CE 0 CE 1 CE 2 )".

s3# copy-tree --binary=false exp/mono/tree - 2>/dev/null| head -5
ContextDependency 1 0 ToPdf TE 0 49 ( NULL TE -1 3 ( CE 0 CE 1 CE 2 )
TE -1 3 ( CE 3 CE 4 CE 5 )
TE -1 3 ( CE 6 CE 7 CE 8 )
TE -1 3 ( CE 9 CE 10 CE 11 )
TE -1 3 ( CE 12 CE 13 CE 14 )

The ilabel_info object

The CLG graph (see Decoding graph construction in Kaldi) has symbols on its input side that represent context-dependent phones (as well as disambiguation symbols and possibly epsilon symbols). In the graph, as always, these are represented by integer labels. We use an object that, in code and in filenames, is generally called ilabel_info. The ilabel_info object 4has a strong connection to the ContextFst objects, see The ContextFst object. As with many other Kaldi types, ilabel_info is a generic (STL) type but we use a consistent variable name to make it identifiable. It is of the following type:

std::vector<std::vector<int32> > ilabel_info;

It is a vector, indexed by the FST input label, that gives for each input label the corresponding phonetic context window (see above, Phonetic context windows). For example, suppose symbol 1500 is phone 30 with a right-context of 12 and a left-context of 4, we would have

// not valid C++
ilabel_info[1500] == { 4, 30, 12 };

In the monophone case, we would have things like:

ilabel_info[30] == { 28 };

There are special cases to deal with disambiguation symbols (see Disambiguation symbols or the Springer Handbook paper referenced above for an explanation of what these are). If an ilabel_info entry corresponds to a disambiguation symbol, we put in it the negative of the symbol-table entry of the disambiguation symbol (note that this is not the same as the number of the printed form of the disambiguation symbol as in #0, #1, #2 etc., it is the number corresponding to it in a symbol-table file, which in our current scripts is called phones_disambig.txt). For example,

ilabel_info[5] == { -42 };

would mean that symbol number 5 on HCLG corresponds to the disambiguation symbol whose integer id is 42. We negate these for scripting convenience, so the programs that interpret the ilabel_info object don't need to be given a list of disambiguation symbols in order to be able to distinguish them from real phones in the monophone case. There are two additional special cases: we have

ilabel_info[0] == { }; // epsilon
ilabel_info[1] == { 0 }; // disambig symbol #-1;
// we use symbol 1, but don't consider this hardwired.

The first of these is the normal epsilon symbol, and we give it an empty vector as its ilabel_info entry. This symbol would not normally appear on the left of CLG. The second is a special disambiguation symbol that in printed form is called "#-1". We use it where epsilons are used on the input of the C transducer in the normal (Springer Handbook) recipe; it is needed to ensure determinizability of CLG in the presence of words with empty phonetic representations.

The program fstmakecontextsyms is able to create a symbol-table that corresponds to the printed form of the ilabel_info object; this is mainly of use for debugging and diagnostics.

As you can see, the ilabel_info object is not very pleasant as it relates to disambiguation symbols, but there are not many parts of the code that interact closely with it: only the fst::ContextFst class (and related things; see context_fst_group), the program fstmakecontextsyms.cc, and some functions listed in Classes and functions for creating FSTs from HMMs). The ContextDependency object, in particular, only sees valid sequences of length N that represent phone context windows.