Skip to content
This issue has been moved to a discussionGo to the discussion

✨[Feature] Print out Python file name, line number, and/or op for warnings, compilation failures #949

@chaoz-dev

Description

@chaoz-dev
Contributor

Is your feature request related to a problem? Please describe.
By default, when running Torch-TensorRT from Python, most errors encountered during compilation will print a stack trace originating from either the Python compilation command (eg. printing something like torch_tensorrt_pypi/torch_tensorrt/ts/_compiler.py", line 119, in compile compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec)). RuntimeError: ....) or the CPP file backing the implementation.

Warnings during runtime may be printed without any identifying information.

The current stack traces and output by themselves do not provide enough information to diagnose or pinpoint which PyTorch op is causing the offending behavior. This is especially egregious when attempting to apply Torch-TensorRT to large and complex models, where failures with the conversion process are currently difficult to identify without unwinding and stepping through the model op by op to locate the root cause of conversion failures.

Describe the solution you'd like
Ideally, when an error or warning occurs, we should print output identifying the offending op (at minimum), as well as the file and line number at which the op occurs, in addition to other stack traces, errors, and warnings.

Describe alternatives you've considered
Currently, I've tried log printing at the graph level to better identify errors during the conversion process. However, there are issues with this approach:

  1. We may encounter the error before printing the next node or subgraph where the error occurs;
  2. The output appears correct, and does not identify the offending op, anyway

Additional context
I think this ask is potentially complicated but somewhat possible... On the one hand, Torch-TensorRT converts using the JIT output, which means obtaining metadata could be more complex than just reading the model directly. On the other hand, looking at the Interpreting Graphs section of the documentation, it appears file and line number information should be accessible somewhere...

Activity

narendasan

narendasan commented on Mar 31, 2022

@narendasan
Collaborator

We already print out the trace to the location of unsupported operators. In the case you require full compilation, these will be printed as errors. In the case that you are using partial compilation, you will see these messages as Debug/Info messages. I guess you are likely talking about issues like TORCHTRT_CHECK failures where these are likely issues with torch-tensorrt itself so IMO the source location is not especially relevant. Really here the key information is the operator at least in the UX model we envision where we want to avoid model modification as much as possible, since the first thing we would expect users to try is disabling that operator. This information is provided in the debug level messaging with messages like INFO: [TRTorch Conversion Context] - Adding Layer %2 : Tensor = aten::gt(%0, %1) (ctx.AddLayer) before entering any particular layer.

We could try to formalize this a bit so we can summarize the context in the converter, maybe something like:

Context: 
Converting: %2 : Tensor = aten::gt(%0, %1) 
Arguments: [%0: ..., %1: ...] 
chaoz-dev

chaoz-dev commented on Apr 1, 2022

@chaoz-dev
ContributorAuthor

Yeah, when fallback occurs due to unsupported operators, this already prints out the unsupported operator as you mentioned with the file and line number (eg. #803). For this request, I was thinking more along the failures seen in something like #922:

Traceback (most recent call last):
  File "/scripts/tril.py", line 39, in <module>
    model_trt = torchtrt.compile(
  File "/opt/conda/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 97, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 119, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:66] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* 47 produced from %47 : int = prim::dtype(%52) in lowering graph for mini graph input.

Having better printouts here would be helpful in pinpointing the offending op in large models, if at least for error isolation and bug reporting. I believe this is in line with your statement (Really here the key information is the operator at least in the UX model we envision where we want to avoid model modification as much as possible, since the first thing we would expect users to try is disabling that operator. ); currently, to address any issue that does not print enough file/line number metadata requires heavy modification of the original graph to isolate and understand the root cause, which necessarily modifies the lowered graph and may hide the original failure. The above message gives no indication as to what the op (or Python structure) is causing the failure, and therefore it is impossible to disable ops (or report bugs) on this output alone.

Regarding debug level messaging, I am opposed to using this as the default solution for addressing issues in general, since large and complex graphs (where Torch-TensorRT will frequently be used) will output massive amounts of unnecessary information that is irrelevant to the root issue; having to parse this adds additional friction to the debugging process, when frequently the user is really only looking for one offending op. It is also easy to change the lowered graph when searching for a problematic op, or combination of ops, especially when attempting to isolate errors, which may hide certain failures, so being able to print out a failing op adds an additional layer of sanity checking. Additionally, frequently the output here looks successful and correct, so it's not apparent at a glance (if at all) where the issue lies. That said, however, I would still like to see your suggestion of formalizing the printout here realized, as this may indeed already be sufficient (and I'm just not reading the debug output correctly or well enough).

In summary, I'm in favor of giving these failures the same printout treatment as what we currently have for unsupported ops, since the end result is that the user must either 1. explicitly tell Torch-TensorRT to ignore the op, 2. modify the graph to remove the op, or 3. report the op conversion failure as a bug or feature request. I guess I would frame this as raising the most basic information (ie. op, file, line number that may already be present) in the lowered (debug) output to regular output in all situations where a warning or error appears, because this is usually the best place to start debugging.

narendasan

narendasan commented on Apr 1, 2022

@narendasan
Collaborator

Yeah in general I am not against raising the severity of messages and providing more context, but I do want to be purposeful in what we report to the user in these error messages and ideally make our errors actionable. i.e. Here is an issue we detected, here is a potential solution. What I want to avoid is dumping a ton of information on people in cases where its misleading or there is no action. This would include your Could not find torch::jit::Value* 47 produced from %47 : int = prim::dtype(%52). This is likely an issue with the partitioning system itself, but at least for me, dumping a backtrace implies this is an issue with the users code and not Torch-TensorRT and may cause people to spend a lot of time doing blackbox debugging on their own code base. As an aside in general the partition system just needs a lot of work in general in better logging.

Totally agree cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/conversion/evaluators/aten.cpp#L220 A backtrace would be super useful.

In cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/partitioning/shape_analysis.cpp#L67 This to me does not seem actionable by the user. If there is a failure here not sure what options a user has. Perhaps dumping context at a lower level may help in reporting but just throwing out the trace here obfuscates the issue.

Philosophically, I would prefer logging to be as close to how people expect compiler logging to work. Warnings imply a potential risk and Errors imply some incompatibility with the source code and the expectations of the compiler. In both cases the action is on the user to accept the risk or modify their code. The issue really is that unlike GCC, Torch-TRT is still fairly new, so hitting bugs in the compiler happen more often and really the action is on us, the user's code is fine.

One thing that might help here is better distinguish between RuntimeErrors which come from TORCHTRT_CHECK and the TorchTRT exception system and Log Level "Errors" which are detected incompatibilities in the user source code.

therefore it is impossible to disable ops.
I would say the expectation here is just to throw the offending op in torch_executed_ops

TL;DR, Totally agree we can provide more information to make it easier to narrow down where failures are happening, but want to be careful to not send people off on wild bug hunts for things they are not expected to have to fix. I think its worth taking a full pass over the logging system and adding more information in general.

chaoz-dev

chaoz-dev commented on Apr 5, 2022

@chaoz-dev
ContributorAuthor

Hmm yeah that's fair; I'm in agreement with what's stated here, especially with regards to not sending users off on a wild goose chase. That said, would you say that just logging an op and file location (not the full stack trace) is still too much here? Even if it's an underlying issue with the conversion process (ie. the user's code is fine), this still seems like a good way (and easier) for the user to identify that their op usage is correct (because we're saying this op is failing, but on inspection usage is correct) and that it's the compilation that's failing, which also makes it easier to report the bug.

github-actions

github-actions commented on Jul 5, 2022

@github-actions

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

ncomly-nvidia

ncomly-nvidia commented on Jul 26, 2022

@ncomly-nvidia
Contributor
narendasan

narendasan commented on Sep 2, 2022

@narendasan
Collaborator

Converting this to a discussion as it is not a specific request but a general request for improvement

locked and limited conversation to collaborators on Sep 2, 2022
converted this issue into a discussion #1331 on Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @narendasan@chaoz-dev@ncomly-nvidia

      Issue actions

        ✨[Feature] Print out Python file name, line number, and/or op for warnings, compilation failures · Issue #949 · pytorch/TensorRT