Skip to content

issue with diff method #501

Open
Open
@valeriocos

Description

@valeriocos

I'm using gitpython to collect diff information between a commit and his parent.
Generally, the following code works fine when the number of diffs to retrieve is small:
diffs = c.parents[0].diff(c, create_patch=True)
Conversely, when the number of diffs is huge (https://git.eclipse.org/c/papyrus/org.eclipse.papyrus.git/commit/?id=f5f817279baa2008450aa32b18e576c2fcda02bb), that code is not able to produce an output after 24h (at least).
Is there another way I could use to retrieve the diff information between two commits?

Below you can find the code to replicate this behaviour:

from git import *

REPO_PATH = ""C:/Users/.../org.eclipse.papyrus"" (you can clone it from here: https://git.eclipse.org/c/papyrus/org.eclipse.papyrus.git/)

BRANCH = "2.0.0"

def main():
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            diffs = c.parents[0].diff(c, create_patch=True)
            print str(len(diffs))
            break

if __name__ == "__main__":
    main()

Activity

Byron

Byron commented on Aug 21, 2016

@Byron
Member

Unfortunately, I cannot reproduce the issue despite of the fabulous reproduction script. This is what I did:

  • git clone http://git.eclipse.org/gitroot/papyrus/org.eclipse.papyrus.git
  • time python reproduce.py

The latter produced this output:

➜  GitPython git:(master) ✗ time python reproduce.py
7241
python reproduce.py  4.97s user 0.66s system 99% cpu 5.670 total

It appears there is something else going on. Maybe you are not using the latest version ? Maybe it's something related to windows particularly. In any case, we will have to dig deeper to find a solution for this one.

The actual script I ended up using is behind the fold.

from git import *

REPO_PATH = "./org.eclipse.papyrus"

BRANCH = "2.0.0"

def main():
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            diffs = c.parents[0].diff(c, create_patch=True)
            print str(len(diffs))
            break

if __name__ == "__main__":
    main()

For completeness, here is the memory usage when trying to show the diff in the WEB-GUI - it took a long time to load as well.
screen shot 2016-08-21 at 20 27 57

valeriocos

valeriocos commented on Aug 22, 2016

@valeriocos
Author

I've updated gitpython to the last version (2.0.8), however the problem is still there. As you said, it may depend on Windows-related stuff.
I found a workaround that seems to work fine, below the code.

from git import *

REPO_PATH = "./org.eclipse.papyrus"

BRANCH = "2.0.0"

def main():
    diffs = []
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            files = repo.git.execute(["git", "diff", "--name-only", c.parents[0].hexsha, c.hexsha]).split('\n')
            for f in files:
                diff = c.parents[0].diff(c, paths=f, create_patch=True)
                diffs = diffs + diff

if __name__ == "__main__":
    main()
Byron

Byron commented on Aug 23, 2016

@Byron
Member

Thanks for the feedback, and for posting the workaround !
Given that the project is not tested on Windows anymore, and is supporting Windows only on a 'best-effort' basis, I believe there is nothing that can be done here to fix this particular case.
Thus I am closing this issue. If you disagree or would like to contribute some sort of fix, please let me know in the comments.

ankostis

ankostis commented on Oct 11, 2016

@ankostis
Contributor

I can definitely reproduce this. git.diff code has been retrofitted on #519 to use threads when reading stream, but STILL I've seen a case where it blocked with particularly big streams.
Maybe using additionally queues might solve the problem for good. See http://eyalarubas.com/python-subproc-nonblock.html and http://stackoverflow.com/a/4896288/548792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Byron@ankostis@valeriocos

        Issue actions

          issue with diff method · Issue #501 · gitpython-developers/GitPython