✨Support tables spanning over multiple pages by hf-kklein · Pull Request #7 · Hochfrequenz/ebdamame

hf-kklein · 2022-12-18T18:19:35Z

fun fun fun

hf-kklein · 2022-12-18T18:31:47Z

    next_table_is_requested_table: bool = False
-    for table_or_paragraph in _get_tables_and_paragaphs(document):
+    tables: List[Table] = []
+    tables_and_paragraphs = _get_tables_and_paragaphs(document)


habe den aus der loop rausgezogen, damit wir weiter unten weiter drüber loopen können (an der stelle wo wir die erste treffende tabelle gefunden haben.

hf-kklein · 2022-12-18T18:33:03Z


-    def convert_docx_table_to_ebd_table(self) -> EbdTable:
+    def _handle_single_table(
+        self, table: Table, row_offset: int, rows: List[EbdTableRow], sub_rows: List[EbdTableSubRow]
+    ) -> None:
        """
-        Converts the raw docx table of an EBD to an EbdTable.
-        The latter contains the same data but in an easily accessible format that can be used to e.g. plot real graphs.
+        Handles a single table (out of possible multiple tables for 1 EBD).
+        The results are written into rows and sub_rows. Those will be modified.
        """
-        rows: List[EbdTableRow] = []
-        sub_rows: List[EbdTableSubRow] = []
        for table_row, sub_row_position in zip(
-            self._docx_table.rows[self._row_index_last_header + 1 :],
+            table.rows[row_offset:],
            cycle([_EbdSubRowPosition.UPPER, _EbdSubRowPosition.LOWER]),
        ):


die funktion, die vorher die eine einzige tabelle gehandlet hat, ist jetzt eine private methode, die ihre ergebnisse an zwei übergebene listen fürs rows und subrows anhängt.

hf-kklein · 2022-12-18T18:33:40Z

-            result_code = row_cells[self._column_index_result_code].text.strip()
-            note = row_cells[self._column_index_note].text.strip()
            sub_row = EbdTableSubRow(
                check_result=EbdCheckResult(subsequent_step_number=subsequent_step_number, result=boolean_outcome),
-                result_code=result_code or None,
-                note=note or None,
+                result_code=row_cells[self._column_index_result_code].text.strip() or None,
+                note=row_cells[self._column_index_note].text.strip() or None,


pylint hat gemeckert: too-many-locals (also zu viele lokale variablen): hier ist so ein klassiker für: das gefällt zwar dem linter aber die lesbarkeit leidet vllt ein bisschen.

hf-kklein · 2022-12-18T18:34:21Z

+            offset: int = 0
+            if table_index == 0:
+                offset = self._row_index_last_header + 1
+            self._handle_single_table(table, offset, rows, sub_rows)


die vormals einzige konvertierungsfunktion wird jetzt hier aufgerufen.

…t_more_tables

lord-haffi

Insgesamt würde ich sagen lgtm. Bei get_ebd_docx_tables ist mir nur eine Sache aufgefallen:
Meiner Meinung nach, bekommst du ein Problem, wenn nach einer Tabelle direkt ein Paragraph mit Ebd-key (und natürlich darauffolgender eigener Tabelle) kommt. Hab mal kurz die PDF überflogen. An den meisten Stellen scheint ein "unnötiger" Paragraph zu folgen, bei dem es dann nicht auffällt, wenn der geskippt wird. Falls das immer so ist, gerne kommentieren. Falls nicht, müsstest du wahrscheinlich nochmal nachbessern - in dem Fall wäre ein entsprechender Testcase bestimmt auch sinnvoll.

lord-haffi · 2022-12-19T11:37:57Z

+            tables.append(table)
+            # Now we have to check if the EBD table spans multiple pages and _maybe_ we have to collect more tables.
+            # The funny thing is: Sometimes the authors create multiple tables split over multiple lines which belong
+            # together, sometimes they create 1 proper table that spans multiple pages. This we won't notice here.


Was wird hier nicht "genoticed"? Der Unterschied zwischen den beiden Fällen, die du präsentiert hast? Oder wird einer der Fälle nicht abgefangen?

Edit: In Anbetracht des Codes hierunter nehme ich an, du meinst, dass zwar beide Fälle berücksichtigt werden, aber der Unterschied nicht im Ergebnis mit gespeichert wird?

f8d9306 <-- so hfftl klarer

lord-haffi · 2022-12-19T11:48:26Z

+                    # sometimes the authors add blank lines before they continue with the next table
+                    continue
+                else:
+                    break  # because if no other table follows, we're done collecting the tables for this EBD key


tables_and_paragraphs ist ja ein Generator. D.h. ja dass die Liste "on the fly" generiert wird, indem für jede Iteration der Generator die entsprechende Funktion (in diesem Fall _get_tables_and_paragaphs) bis zum nächsten yield ausführt.
Wenn du jetzt den Fall hast, dass ein Paragraph auf die Tabelle folgt, wo etwas drinsteht (z.B. ein neuer Ebd-key), dann führt der zwar den break hier aus, in der outer-loop wird der dann aber geskippt oder überseh ich hier irgendwas?

die outer loop ist uns egal, sobald wir einmal in der inner loop waren.
91297fe

lord-haffi · 2022-12-19T11:50:44Z

    Opens the file specified in docx_file_path and returns the table that relates to the given ebd_key.
    Raises an ValueError if the table was not found.


Den Docstring musst du noch anpassen.

Co-authored-by: Leon Haffmans <49658102+lord-haffi@users.noreply.github.com>

lord-haffi

Stimmt, hab ganz vergessen, dass die Methode get_ebd_docx_tables ja nur eine einzelne Ebd-Tabelle auslesen soll. Da war ich wohl 'n bissl verwirrt ^^
Passt so für mich 👍

✨Support tables over multiple pages

cbfb77a

fun fun fun

hf-kklein self-assigned this Dec 18, 2022

hf-kklein changed the title ~~✨Support tables over multiple pages~~ ✨Support tables spanning over multiple pages Dec 18, 2022

hf-kklein added 2 commits December 18, 2022 19:20

ignore mypy

c9a5625

ja, pylint hat recht

33645af

hf-kklein commented Dec 18, 2022

View reviewed changes

Update src/ebddocx2table/docxtableconverter.py

4059267

hf-kklein requested review from hf-krechan and lord-haffi December 18, 2022 18:35

hf-kklein added 3 commits December 18, 2022 19:37

comment

a8a4012

Merge remote-tracking branch 'origin/extract_more_tables' into extrac…

6e4b749

…t_more_tables

fix outdated type hints in unittest helper methods

c2d49d5

lord-haffi reviewed Dec 19, 2022

View reviewed changes

hf-kklein and others added 5 commits December 19, 2022 13:24

Update unittests/__init__.py

86b2488

Co-authored-by: Leon Haffmans <49658102+lord-haffi@users.noreply.github.com>

fix outdated type hints in unittest helper methods

8257a7b

update docstring

2f484ba

be more explicit

f8d9306

break outer loop, too

91297fe

hf-kklein requested a review from lord-haffi December 19, 2022 12:34

lord-haffi approved these changes Dec 19, 2022

View reviewed changes

hf-kklein merged commit 3c6f710 into main Dec 19, 2022

hf-kklein deleted the extract_more_tables branch December 19, 2022 13:53

		Opens the file specified in docx_file_path and returns the table that relates to the given ebd_key.
		Raises an ValueError if the table was not found.

Conversation

hf-kklein commented Dec 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hf-kklein Dec 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hf-kklein Dec 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lord-haffi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lord-haffi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hf-kklein Dec 18, 2022 •

edited

Loading

hf-kklein Dec 18, 2022 •

edited

Loading