Open
Description
In NCHW tensor notation, the last dimension is the contiguous dimension. For column-major matrix notation, the first dimension is the contiguous dimension. We haven't needed to think that much about this since our data samples are usually 1D or 3D, but with transformers we need to do batched matrix multiplication. We should settle this question and commit to a consistent scheme to avoid confusion.
As much as it pains me coming from an applied math background, I think we should switch to C/row-major/NCHW notation. It matches PyTorch, TensorFlow, and NumPy and seems to be more natural for practitioners.
Whatever we decide, DiHydrogen should use the same scheme as LBANN. Pinging @benson31, @naoyam, @ndryden.