Skip to content

Conversation

@dgw
Copy link
Member

@dgw dgw commented May 9, 2021

Description

I noticed that fetching etymology from Wiktionary would fail sometimes, even though the section was there. Tracked down the exact code path with the help of pdb. Two related fixes:

  • Don't check for extraneous id attributes in the HTML when in etymology mode (in wikt() function)
  • Allow for multiple citations (<sup>/</sup> pairs) in one line when stripping HTML (in text() function)

Checklist

  • I have read CONTRIBUTING.md
  • I can and do license this contribution under the EFLv2
  • No issues are reported by make qa (runs make quality and make test)
  • I have tested the functionality of the things this change touches

Notes

I really want to convert this to a proper parser, but don't want to further delay 7.1. Onto the idea pile it goes.

dgw added 2 commits May 7, 2021 15:23
Excluding markup with 'id="' in it is only for definition modes. When
looking for etymology, we always want the whole paragraph.
Long etymology might have multiple citations. Greedy matching selects
everything between the first and last citation, which we do not want;
such a match can remove a large portion of desired text.
@dgw dgw added Medium Priority Bugfix Generally, PRs that reference (and fix) one or more issue(s) labels May 9, 2021
@dgw dgw added this to the 7.1.0 milestone May 9, 2021
@dgw dgw requested a review from a team May 9, 2021 05:46
@dgw dgw merged commit 93f1e76 into master May 13, 2021
@dgw dgw deleted the wiktionary-etymology-citations branch May 13, 2021 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bugfix Generally, PRs that reference (and fix) one or more issue(s) Medium Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants