Closed
Description
I'd like to extract the test results and do some calculations later. My current approach is to set report=true
and then parse the resulting xml
file. Is there any other better approach?
More specifically, can we return testitems
instead of nothing
here?
ReTestItems.jl/src/ReTestItems.jl
Line 396 in 60d93f1
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
nickrobinson251 commentedon Jan 16, 2024
thanks for the issue! Can you say more about what the use-case is?
Returning
testitems
there is not something we'd want to do by default (since in the REPL it'd then print out that object), but it could perhaps be made an options... although that's aReTestItems.TestItems
object (which is currently a private internal structure which i'd be reluctant to make public as is).Also the whole
runtests()
function will throw (specifically theTest.finish
call will throw) if there was any non-passing tests, so returning anything fromruntests
would only be possible in cases where all tests succeed... unlessruntests()
is itself called from within another testset (which is basically what we do in the tests of this package itself, we callruntests
inside anEncasedTestSet
and inspect the results).So what's possible/desirable to do here depends on what exactly you're trying to achieve. Can you share an example?
Also, for context: i'm a little wary of committing to any kind of support for interacting with test results without a bit of a plan for what would be in or out of scope, and taking a look at prior art e.g. a natural next step would be to use the returned test results when re-running tests, e.g. to only run the failures/error, or to run those first -- but maybe we can add something that helps you without that, if i can understand the use-case better
findmyway commentedon Jan 16, 2024
First, thanks for your prompt reply!
I understand your concern and your explanation makes sense to me.
Sure! So I'm working on a project similar to openai/human-eval and evalplus, but in JuliaLang. Generally it will ask different LLMs to generate the code based on given prompt and then execute the code to calculate the
p@k
(a metric based on the test case pass rate to measure the performance of code LLMs). I think this package already provides many interesting features, like parallel execution, timeout, and many important metrics. All these features combined help me a lot when analyzing the performance of existing LLMs.I think all I need are already in the
report.xml
file. It would be great if we can somehow make it easier to EXTRACT these details (instead of parsing the raw xml file)FYI: you can view the test cases here: https://github.com/oolong-dev/OolongEval.jl/tree/add_human_eval/benchmarks/HumanEval/src/tasks
nickrobinson251 commentedon Jan 16, 2024
i see. yeah, i'm not sure i see an easy way to add this to ReTestItems due to how Test.jl works (at least not one i'd be comfortable committing too 😅)
Maybe right now using the
report.xml
is actually the easiest way (daft as it might sound). Thereport.xml
is a somewhat standard format ("JUnit XML"). I don't think there are any Julia parsers for it (just generic XML parsers), but python hasjunitparser
(if you're open to using python) and there are probably other, or writing your own parsers in Julia based on an XML parsers shouldn't be too arduous.But an alternative to parsing the
report.xml
could maybe to use MetaTesting? It's designed for "testing your test functions" i.e. testing functions that themselves run tests... but that's pretty close to testing what happened when tests run, so could be adapted to that use-case i think... something like:This gives you a
Test.DefaultTestSet
, which only stores the failures/errors and the number of passes (no other data about the passes), but maybe that's enough for your use-case?findmyway commentedon Jan 17, 2024
Great thanks!
I wasn't aware of MetaTesting before. I'll try both approaches and see which one works better.
findmyway commentedon Feb 22, 2024
I used
results.xml
in the end.FYI: https://github.com/01-ai/HumanEval.jl