Skip to content

Prototype: Scrape E_0003 from docx#2

Merged
hf-kklein merged 19 commits intomainfrom
first_test
Dec 13, 2022
Merged

Prototype: Scrape E_0003 from docx#2
hf-kklein merged 19 commits intomainfrom
first_test

Conversation

@hf-kklein
Copy link
Copy Markdown
Contributor

No description provided.

@hf-kklein hf-kklein self-assigned this Dec 13, 2022
Comment thread src/ebddocx2table/__init__.py Outdated
Comment thread src/ebddocx2table/docxtableconverter.py Outdated
yield _Cell(table_column, docx_table_row.table)


_subsequent_step_pattern = re.compile(r"^(?P<bool>(?:ja)|(?:nein))\s*(?P<subsequent_step_number>(?:\d)|ende)?")
Copy link
Copy Markdown
Contributor

@lord-haffi lord-haffi Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Müsste es nicht eher

Suggested change
_subsequent_step_pattern = re.compile(r"^(?P<bool>(?:ja)|(?:nein))\s*(?P<subsequent_step_number>(?:\d)|ende)?")
_subsequent_step_pattern = re.compile(r"^(?P<bool>(?:ja)|(?:nein))\s*(?P<subsequent_step_number>(?:\d+\*?)|Ende)?")

sein? Zumindest hattest du (?:\d+\*?)|(Ende) als regex Validator in den EbdTable Feldern für die subsequent step number angegeben.
Edit: Hat der vorher beim ende überhaupt gematched? Der regex ist ja case-sensitive, wenn ich nicht irre und in den Tabellen ist Ende immer groß geschrieben?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hat der vorher beim ende überhaupt gematched?

er matched weiter unten gegen cell.text.lower() deswegen klappt das

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<-- jetzt dokumentiert

Comment thread src/ebddocx2table/docxtableconverter.py
Comment on lines +42 to +43
if subsequent_step_number == "ende":
subsequent_step_number = "Ende"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wenn ich das richtig sehe, ist Ende in den Tabellen immer groß geschrieben. Hab den regex oben angepasst, die Zeilen hier müssten also obsolet sein, oder?

Suggested change
if subsequent_step_number == "ende":
subsequent_step_number = "Ende"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

der regex verwendet lowercase weil ich nicht ausschließen will, dass die leute manchmal "Ja" und manchmal "ja" schreiben. aber ich sehe, dass ein erklärender kommentar gut wäre.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erkläreneder kommentar: 74f1988

Comment thread src/ebddocx2table/docxtableconverter.py Outdated
self._column_index_result_code: int
self._column_index_note: int
self._row_index_last_header: Literal[0, 1] # either 0 or 1
for row_index in range(0, 2): # just check the first two rows in the constructor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hier wäre vielleicht ein Kommentar ganz nett, dass es sich dabei um die Kopfzeilen der Tabelle handelt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break # because the prüfende rolle is always a full row with identical column cells
if table_cell.text == "Nr.":
self._column_index_step_number = column_index
self._row_index_last_header = row_index # type:ignore[assignment]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In welchem Fall kann _row_index_last_header denn 0 sein?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wenn die erste row der tabelle nicht die Prüfende Rolle enthält: c606c61

Comment thread unittests/test_highlevel.py Outdated
A class for tests of the entire package/library
"""

@pytest.mark.datafiles("./test_data/ebd20221128.docx")
Copy link
Copy Markdown
Contributor

@lord-haffi lord-haffi Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.datafiles("./test_data/ebd20221128.docx")
@pytest.mark.datafiles("./test_data")

Wenn ich deine get_document Methode richtig verstanden habe, müsstest du den Dateinamen beim ersten Argument immer weglassen, oder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aber der datafiles marker braucht die angabe des vollen dateinamens. Die dort genannten Dateien werden dann in ein temporäres Verzeichnis kopiert, auf das man im code dann über datafiles/<dateiname> zugreifen kann.

Comment on lines +20 to +23
with open(docx_file_path, "rb") as docx_file:
source_stream = BytesIO(docx_file.read())
# switched from StringIO to BytesIO because of:
# UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 605: character maps to <undefined>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ich gehe davon aus, dass du das encoding bei open angegeben hast oder?

open('readme.txt', encoding="utf-8")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opens and returns the document specified in the docx_file_path using python-docx
"""
# https://python-docx.readthedocs.io/en/latest/user/documents.html#opening-a-file-like-document
with open(docx_file_path, "rb") as docx_file:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

müsste man noch den Fall abfangen, dass die Datei nicht da ist?

from pathlib import Path

p = Path.home()
print(p)
print(p.exists())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ich muss den fall nicht abfangen. Es ist ok, wenn er mit einem FileNotFoundError stirbt. Den Fehler zu kaschieren bringt ja auch nichts

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joa, man könnte ihn mit einer schöneren Fehlermeldung aussteigen lassen. Aber ja, passt für mich

@@ -1,3 +1,67 @@
"""
src contains all your business logic
Contains high level functions to process .docx files
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sind die Funktionen hier in der __init__.py sinnvoll hinterlegt?
Bin mir selbst nie so ganz sicher, was man am besten in eine init Datei schreibt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mir auch nicht :D

Comment on lines +63 to +64
next_table_is_requested_table = paragraph.text.startswith(ebd_key)
if isinstance(table_or_paragraph, Table) and next_table_is_requested_table:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart move

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ich denke das übernehme ich für den kohlrahbi

Comment thread src/ebddocx2table/docxtableconverter.py Outdated
@@ -0,0 +1,128 @@
"""
This a docstring for the module.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

achso, wirklich? Das hätte ich jetzt nicht erwartet ;P

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/ebddocx2table/docxtableconverter.py Outdated

class _EbdSubRowPosition(Enum):
"""
describes the position of a subrow in the Docx Table
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kannst du hier ein Beispiel anfügen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if row_index == 0 and _is_pruefende_rolle_cell(table_cell):
role = table_cell.text.split(":")[1].strip()
break # because the prüfende rolle is always a full row with identical column cells
if table_cell.text == "Nr.":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

willst du hier extra so streng sein oder wäre ein startswith auch in Ordnung?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ich würde solange streng sein, bis es failed.

converts docx tables to EbdTables
"""

def __init__(self, docx_table: Table, ebd_key: str, chapter: str, sub_chapter: str):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vielleicht wäre es hier sinnvoll auch eine classmethod zu verwenden, um eine Instanz von DocxTableConverter zu erstellen, die etwas unabhängiger ist von dem Datenmodell.
Vergleiche https://github.com/Hochfrequenz/kohlrahbi/blob/53cf14c10cf1b965281dd705ce52126d9a3f3f50/src/kohlrahbi/helper/elixir.py#L48-L52

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hf-krechan kannst du dafür einen eigenen PR aufmachen? Ich sehe gerade den Nutzen noch nicht.

Comment thread src/ebddocx2table/docxtableconverter.py Outdated
sub_rows = []
step_number = row_cells[self._column_index_step_number].text.strip()
description = row_cells[self._column_index_description].text.strip()
boolean_outcome, subsequent_step_number = _cell_to_bool(row_cells[self._column_index_check_result])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wenn du oben einen besseren Namen für den boolean hast, wäre er hier auch angebracht ;)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findest du boolean_outcome noch schlecht? oder soll ich es besser "ja"/"nein" nennen? mir fällt kein besserer name ein.

@@ -0,0 +1,10 @@
# Test Data (.docx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

das ist natürlich nun sehr praktisch, dass du nur spezifische EBDs aus den docx ziehst.
mmmh ich überlege mal ob mir das auch gelingt im Kohlrahbi 👍

@hf-kklein hf-kklein requested a review from lord-haffi December 13, 2022 15:22
self._column_index_result_code = column_index
elif table_cell.text == "Hinweis":
self._column_index_note = column_index
self._metadata = EbdTableMetaData(ebd_code=ebd_key, sub_chapter=sub_chapter, chapter=chapter, role=role)
Copy link
Copy Markdown
Contributor

@lord-haffi lord-haffi Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wenn die erste row der tabelle nicht die Prüfende Rolle enthält: c606c61

In dem Fall ist die Variable role aber dann nicht gesetzt, oder? Der müsste hier ja dann eigentlich crashen, testest du den Fall ab?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ja das crasht dann. Fände ich aber ok, dann muss man sehen was der Grund ist.

Ich verstehe deinen Punkt: der Code erweckt den Eindruck er sei ganz flexibel und auf alle eventualitäten vorbereitet aber tatsächlich kann er nur den einen aktuell abgetesteten Fall handlen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dem Fall ist die Variable role aber dann nicht gesetzt, oder?

#9 da isser, der fall :)

@lord-haffi
Copy link
Copy Markdown
Contributor

Der Rest lgtm

@hf-kklein hf-kklein merged commit 24d4ef1 into main Dec 13, 2022
@hf-kklein hf-kklein deleted the first_test branch December 13, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants