Description
In the current code, quotation marks are removed from the sentence.
Actually, they are converted to whitespace. This means they serve as a word separator.
For example:
This"is a test"
is converted to:
This is a test
In addition, a word just after quotation mark is considered to be in a capitalizable position.
This is true even for closing quotation mark... including when it is a "right mark".
In my new tokenization code, quotes are tokenized. They are defined in RPUNC and LPUNC, thus they get strip off words from their LHS and RHS. However, this doesn't preserve its "separator" behavior that exit in the current code.
In English (and I guess some other languages, maybe many) this seems to be a desired behavior,
because if we see qwerty"yuiop" we may guess it is actually qwerty "yuiop".
However, to generally do this in Hebrew will be wrong, as double quote U+0022 is a de facto replacement for the Hebrew character "gershayim" that can be an integral part of Hebrew words (as
gershayim is not found on the Hebrew keyboard). It is also a de facto replacement for Hebrew quotation marks (form the same reason). So a general tokenization code cannot blindly use it as a word separator.
In order to solve this, I would like to introduce an affix-class WORDSEP, which will be a list of characters to be used as a word separator, when blank would be the default. Character listed there will still be able to be listed in other affix classes and thus serve as tokens.
Is this solution sensible?
Another option is just not to use it as a word separator, at least in the first version of "quotation mark as token" (this is what my current code does).
Regarding the capitalizable position after closing quote, meanwhile I will mostly preserve this behavior in the hard-coded capitalization handling, because we are going to try to implement capitalization using the dict.