Skip to content

Handling quotation marks #42

Open
Open
@ampli

Description

@ampli

In the current code, quotation marks are removed from the sentence.
Actually, they are converted to whitespace. This means they serve as a word separator.
For example:
This"is a test"
is converted to:
This is a test

In addition, a word just after quotation mark is considered to be in a capitalizable position.
This is true even for closing quotation mark... including when it is a "right mark".

In my new tokenization code, quotes are tokenized. They are defined in RPUNC and LPUNC, thus they get strip off words from their LHS and RHS. However, this doesn't preserve its "separator" behavior that exit in the current code.

In English (and I guess some other languages, maybe many) this seems to be a desired behavior,
because if we see qwerty"yuiop" we may guess it is actually qwerty "yuiop".

However, to generally do this in Hebrew will be wrong, as double quote U+0022 is a de facto replacement for the Hebrew character "gershayim" that can be an integral part of Hebrew words (as
gershayim is not found on the Hebrew keyboard). It is also a de facto replacement for Hebrew quotation marks (form the same reason). So a general tokenization code cannot blindly use it as a word separator.

In order to solve this, I would like to introduce an affix-class WORDSEP, which will be a list of characters to be used as a word separator, when blank would be the default. Character listed there will still be able to be listed in other affix classes and thus serve as tokens.
Is this solution sensible?
Another option is just not to use it as a word separator, at least in the first version of "quotation mark as token" (this is what my current code does).

Regarding the capitalizable position after closing quote, meanwhile I will mostly preserve this behavior in the hard-coded capitalization handling, because we are going to try to implement capitalization using the dict.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions