Let and be tokens of length N and M respectively. Each token in and in have length and respectively.
The alignment of to is such that $ \forall j \in AL_{AB,i} => B_j \cap A_i $. ( means t partially matches s.)
For example, .
The goal of this algorithm is to find such and
- Normalize tokens in the unicode normalization form "NFKD", then lowercase all characters.
- Concatenate all tokens and to generate and respectively
- Calculate shortest path on edit graph of and
- Get character mapping and from the edit graph
- Get and from the character alignments and
Details:
- Normalize tokens in the unicode normalization form "NFKD"
To compare the token positions, we must compare each characters in tokens. Because the two tokenizations may be partially different, we normalize them in "NFKD" and lowercase them first.
- Concatenate all tokens and to generate and respectively
Before calculating the edit graph, we combine tokens into text. For example, if we have tokens ["Foo", "bar"]
, we concatenate them into one text Foobar
.
- Calculate shortest path on edit graph from and
We calculate the shortest path on edit graph from texts and to get character map between them. The path can be calculated, for example, by Myers' algorighm
- Get character alignments and from the edit graph
Let and be the i-th and j-th character in the text and , respectively. is a mapping from to such that . For example, then .
We can calculate and from the shortest path on the edit graph. If there exists diagonal edge in the path, and . If there doesn't exist any diagonal edge to then .
- Get and from the character alignments and
Now we can calculate the desired and from the previous calculated and .