Skip to content

Suppress C++ stacktrace on XLA_CHECK*() calls. #9448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ysiraichi
Copy link
Collaborator

This PR improves error messages in PyTorch/XLA by suppressing the display of C++ stack traces during XLA check failures, making them more user-friendly. Currently, when XLA_CHECK*() fails, the resulting error output includes a lengthy and verbose C++ stacktrace. While these can be useful for deep-dive debugging by developers, they often add noise for end-users.

Key Changes:

Before:

Traceback (most recent call last):
  File "dot.py", line 6, in <module>
    torch.dot(a, b)
RuntimeError: torch_xla/csrc/aten_xla_bridge.cpp:110 : Check failed: xtensor
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        torch_xla::bridge::GetXlaTensor(at::Tensor const&)
        torch_xla::XLANativeFunctions::dot(at::Tensor const&, at::Tensor const&)

        c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
        c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
        c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const



        c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const


        at::_ops::dot::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)





        at::_ops::dot::call(at::Tensor const&, at::Tensor const&)
        at::Tensor::dot(at::Tensor const&) const



        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        PyEval_EvalCode



        _PyRun_SimpleFileObject
        _PyRun_AnyFileObject
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***
Input tensor is not an XLA tensor: torch.FloatTensor

After:

Traceback (most recent call last):
  File "dot.py", line 6, in <module>
    torch.dot(a, b)
RuntimeError: Check failed: xtensor: Input tensor is not an XLA tensor: torch.FloatTensor

Copy link
Collaborator

@zhanyong-wan zhanyong-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

ess << file_ << ":" << line_ << " : " << sink_str;
ess << sink.str();

if (ShouldShowCppErrorContext()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the C++ stack trace when this is true?

//
// More specifically, whether the `XLA_SHOW_CPP_ERROR_CONTEXT` environment
// variable is set or not.
bool ShouldShowCppErrorContext();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: [[nodiscard]]

@@ -7,14 +7,14 @@
#include "tsl/platform/statusor.h"

#define XLA_ERROR() TF_ERROR_STREAM()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document how these macros react to XLA_SHOW_CPP_ERROR_CONTEXT?

ess << file_ << ":" << line_ << " : " << sink_str;
ess << sink.str();

if (ShouldShowCppErrorContext()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests to verify the new behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants