Just my 2 cents on what might be going on internally with your Pattern and the Tokenizer.
By default the entire architecture of SIML has been designed and tested for LTR languages. Hebrew is an RTL
language and the default SIML Tokenizer is tailored for LTR languages.
Let me take your Pattern as an example.
כמה משתמשים במערכת *
In the above, the fragment כמה משתמשים במערכת
is RTL whereas the symbol *
is LTR. Now despite being an LTR the symbol is expected to capture an RTL text fragment as per your Pattern. Now here comes the problem in the current Tokenizer
After the tokenization is completed the list of tokens would be arranged in the following order (ascending index):
- 0. כמה
- 1. משתמשים
- במערכת .2
- 3. * (This symbol must have been the first in the list)
The above order is actually correct given the fact that the words are tokenized in their logical
order and not their display
Now as per the generated decision tree nodes, the tokenized pattern now expects to capture a wildcard entry at the end of your Hebrew sentence instead of the beginning. This is why a combination of RTL + LTR tokens may generate shuffled decision tree nodes giving unexpected results.