-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Integrating PyTorch in Alpaka heterogeneous core #47984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cms-bot internal usage |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44654
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
46072a6
to
c01d07f
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/44655
|
A new Pull Request was created by @lukaszmichalskii for master. It involves the following packages:
The following packages do not have a category, yet: DataFormats/PyTorchTest @cmsbuild, @valsdav, @y19y19 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
enable gpu |
please test |
-1 Failed Tests: Build ClangBuild BuildI found compilation error when building: >> Compiling src/PhysicsTools/PyTorch/test/testModel.cc /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.d src/PhysicsTools/PyTorch/test/testModel.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o >> Compiling src/PhysicsTools/PyTorch/test/testRunner.cc /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_15_1_X_2025-05-06-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_15_1_X_2025-05-06-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-05-06-2300/src -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/pytorch/2.6.0-a6d0e4413a9e766b40a2b79f83b4b176/include/torch/csrc/api/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cppunit/1.15.x-25a760f1303b0fca73df75b14e1358bc/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/cuda/12.8.1-f1c01abd08373a07ceeffab8d5f1930a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/protobuf/3.21.9-1126508a53768c90e66f6bf1821ac03a/include -I/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02888/el8_amd64_gcc12/external/zlib/1.2.13-d217cdbdd8d586e845e05946de2796be/include -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.d src/PhysicsTools/PyTorch/test/testRunner.cc -o tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testRunner.cc.o In file included from src/PhysicsTools/PyTorch/test/testModel.cc:5: src/PhysicsTools/PyTorch/test/testUtilities.h:4:10: fatal error: boost/filesystem.hpp: No such file or directory 4 | #include | ^~~~~~~~~~~~~~~~~~~~~~ compilation terminated. gmake: *** [tmp/el8_amd64_gcc12/src/PhysicsTools/PyTorch/test/testModel/testModel.cc.o] Error 1 >> Building binary testModel Clang BuildI found compilation error while trying to compile with clang. Command used:
>> Local Products Rules ..... done >> Creating project symlinks >> Entering Package PhysicsTools/PyTorch >> Entering Package DataFormats/PyTorchTest >> Compile sequence completed for CMSSW CMSSW_15_1_X_2025-05-06-2300 gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1 Command exited with non-zero status 1 Command being timed: "scram build -k -j 32 COMPILER=llvm compile BUILD_LOG=yes" User time (seconds): 907.23 System time (seconds): 91.81 Percent of CPU this job got: 655% |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47984/45085 |
please test |
please abort |
please test |
+1 Size: This PR adds an extra 16KB to repository Comparison SummarySummary:
CUDA Comparison SummarySummary:
ROCM Comparison SummarySummary:
|
+ml Signing for ML, but we would like to get some feedback about this from @cms-sw/core-team. Thanks in advance! |
I'll take a look in the coming days. In the future please use @cms-sw/core-l2 team. |
FYI @cms-sw/heterogeneous-l2 |
Pull request has been put on hold by @fwyzard |
I would like to give feedback from the @cms-sw/heterogeneous-l2 and alpaka point of view before this proceeds. In fact, I would suggest presenting the work at one of the upcoming GPU developments meeting ? |
assign heterogeneous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From first read through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files in this new package do not seem to technically depend on PyTorch. The dependencies are thus the same as in DataFormats/PortableTestObjects
(except PortableTestObjects
depends also on eigen
).
I'd suggest to consider moving the test data types to DataFormats/PortableTestObjects
(I'm not against a separate package, but I'd want the rationale for a separate package to be documented in the PR discussion)
@@ -0,0 +1,20 @@ | |||
#ifndef DATA_FORMATS__PYTORCH_TEST__INTERFACE__DEVICE_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferred format would be
#ifndef DATA_FORMATS__PYTORCH_TEST__INTERFACE__DEVICE_H_ | |
#ifndef DataFormats_PyTorchTest_interface_Device_h |
(see 4.1 in https://cms-sw.github.io/cms_coding_rules.html#4--technical-coding-rules-1)
(also in other files)
#include "DataFormats/Portable/interface/PortableDeviceCollection.h" | ||
#include "DataFormats/PyTorchTest/interface/Layout.h" | ||
|
||
namespace torchportable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to include test
in the namespace name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the file name Device.h
confusing, because the file does not define or declare any device. I'd suggest to rename e.g. along PyTorchTestDeviceCollections.h
. Same for Host.h
, Layout.h
, and alpaka/Collections.h
.
See also DataFormats/PortableTestObjects/interface
(and DataFormats/TestObjects/interface
) for established practice.
using ::torchportable::ClassificationCollectionDevice; | ||
using ::torchportable::ClassificationCollectionHost; | ||
using ::torchportable::ParticleCollectionDevice; | ||
using ::torchportable::ParticleCollectionHost; | ||
using ::torchportable::RegressionCollectionDevice; | ||
using ::torchportable::RegressionCollectionHost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These shouldn't be needed
using ::torchportable::ClassificationCollectionDevice; | |
using ::torchportable::ClassificationCollectionHost; | |
using ::torchportable::ParticleCollectionDevice; | |
using ::torchportable::ParticleCollectionHost; | |
using ::torchportable::RegressionCollectionDevice; | |
using ::torchportable::RegressionCollectionHost; |
process.schedule = cms.Schedule( | ||
process.path | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary. In absence of process.schedule
the framework will run all Paths.
process.schedule = cms.Schedule( | |
process.path | |
) |
|
||
function die { echo Failed $1: status $2 ; exit $2 ; } | ||
|
||
SCRIPT="PhysicsTools/PyTorch/test/testPipeline.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some other reason to have this testPipelineStandalone.sh
than avoiding the ${LOCALTOP}/src
part in the testPipeline.sh
?
<bin name="testTorch" file="testTorch.cc"> | ||
<iftool name="cuda-gcc-support"> | ||
<bin name="testTensorStride" file="testRunner.cc,testTensorStride.cu"> | ||
<use name="catch2"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catch2
doesn't seem to be used
(same for the other test executables below)
<use name="FWCore/ParameterSet"/> | ||
<use name="FWCore/ParameterSetReader"/> | ||
<use name="FWCore/PluginManager"/> | ||
<use name="FWCore/ServiceRegistry"/> | ||
<use name="FWCore/Utilities"/> | ||
<use name="HeterogeneousCore/CUDAServices"/> | ||
<use name="HeterogeneousCore/CUDAUtilities"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of these, only CUDAUtilities
seem to be used
<use name="FWCore/ParameterSet"/> | |
<use name="FWCore/ParameterSetReader"/> | |
<use name="FWCore/PluginManager"/> | |
<use name="FWCore/ServiceRegistry"/> | |
<use name="FWCore/Utilities"/> | |
<use name="HeterogeneousCore/CUDAServices"/> | |
<use name="HeterogeneousCore/CUDAUtilities"/> | |
<use name="HeterogeneousCore/CUDAUtilities"/> |
(same for the other test executables below)
<test name="TestPipelineCpu" command="testPipeline.sh cpu"/> | ||
<test name="TestPipelineCuda" command="testPipeline.sh cuda"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to have more specific names for these tests, e.g.
<test name="TestPipelineCpu" command="testPipeline.sh cpu"/> | |
<test name="TestPipelineCuda" command="testPipeline.sh cuda"> | |
<test name="TestPyTorchPipelineCpu" command="testPipeline.sh cpu"/> | |
<test name="TestPyTorchPipelineCuda" command="testPipeline.sh cuda"> |
PR description:
This PR enables seamless integration between PyTorch and the Alpaka-based heterogeneous computing backend, supporting inference workflows with usage of
pytorch
library withPortableCollection
s objects. It provides:Guard
objects specialized for each supported backend.This implementation was presented and discussed at Core Software Meeting: https://indico.cern.ch/event/1538634/
PR validation:
Included demonstration code of interoperability between
SoA
constructs withPyTorch
C++ API and CMSSW environment inplugins
andtest
packages.PyTorch Ahead-of-time compilation
This pull request also investigates AOT compilation strategy but is in beta version (proof of concept) not yet ready for production usage.
GPU support
CUDA backend is supported and tested, ROCm is not yet supported: cms-sw/cmsdist#9786
FYI @valsdav @ericcano @felicepantaleo @chrisizeh @leobeltra