Skip to content

Ecal Phase2 Weights-based Reconstruction, Alpaka integration #47124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Jakub-Gajownik
Copy link
Contributor

@Jakub-Gajownik Jakub-Gajownik commented Jan 17, 2025

PR description:

  • Migration of ECAL weights amplitude reconstruction to ALPAKA framework from CUDA.
  • Time (jitter) reconstruction has been added, to match with the existing CPU-based reconstruction.
  • The .612 workflow has been updated to use Alpaka modifier instead of the gpu.

PR validation:

  • The existing 24034.612 has been ran to validate the results of the reconstruction against the existing CPU and CUDA-based reconstructions. The tests have been performed using a sample of 8900 events.
  • A perfect agreement with the CUDA code has been achieved. A small deviation of the order of e(-22) on the mean has been observed when comparing the reconstructed amplitudes of Alpaka GPU code with the CPU, which fits the observations from the other studies. A perfect agreement of the Portable CPU with the native CPU has additionally been observed.
  • The 29634.612 has been successfully.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 17, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47124/43345

  • Found files with invalid states:

    • RecoLocalCalo/EcalRecProducers/plugins/EcalUncalibRecHitConvertPortable2CPUFormat.cc:
  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Jakub-Gajownik for master.

It involves the following packages:

  • RecoLocalCalo/EcalRecProducers (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@ReyerBand, @apsallid, @argiro, @denizsun, @missirol, @rchatter, @salimcerci, @thomreis, @wang0jin, @youyingli this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@Jakub-Gajownik
Copy link
Contributor Author

type ecal

@cmsbuild cmsbuild added the ecal label Jan 17, 2025
@Jakub-Gajownik
Copy link
Contributor Author

enable gpu

1 similar comment
@thomreis
Copy link
Contributor

enable gpu

@thomreis
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2405fb/43825/summary.html
COMMIT: de7ec1c
CMSSW: CMSSW_15_0_X_2025-01-16-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47124/43825/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3819085
  • DQMHistoTests: Total failures: 82
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3818983
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 214 log files, 184 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 33
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53038
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jfernan2
Copy link
Contributor

+1

Copy link
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code tested in any workflow?

Comment on lines 36 to 37
EcalPhase2DigiToPortableProducer::EcalPhase2DigiToPortableProducer(edm::ParameterSet const &ps)
: inputDigiToken_(consumes<EBDigiCollectionPh2>(ps.getParameter<edm::InputTag>("BarrelDigis"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in a process of requiring the Alpaka-based EDProducers to pass the edm::ParameterSet to the base class constructor (with the goal of preventing base class default constructor in 15_0_0_pre3). It would be helpful to do that already in this PR. I'd suggest to rebase to 15_0_0_pre2 (or any IB of this year) and

Suggested change
EcalPhase2DigiToPortableProducer::EcalPhase2DigiToPortableProducer(edm::ParameterSet const &ps)
: inputDigiToken_(consumes<EBDigiCollectionPh2>(ps.getParameter<edm::InputTag>("BarrelDigis"))),
EcalPhase2DigiToPortableProducer::EcalPhase2DigiToPortableProducer(edm::ParameterSet const &ps)
: EDProducer<>(ps),
inputDigiToken_(consumes<EBDigiCollectionPh2>(ps.getParameter<edm::InputTag>("BarrelDigis"))),

digisHostCollView.size() = i;

//copy collection from host to device
alpaka::memcpy(event.queue(), digisDevColl.buffer(), digisHostColl.buffer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this EDProducer only copies data from host to device (without any kernels), in 15_0_0_pre2 this explicit copy from host to device could be avoided by the EDProducer producing the EcalDigiPhase2HostCollection as a host-side data product (i.e. with edm::EDPutTokenT). The framework would then implicitly copy that to the device (see https://github.com/cms-sw/cmssw/blob/80d1847/HeterogeneousCore/AlpakaCore/README.md#edproducer for more details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify whether the Digi producer in this case becomes a normal edm::global producer and should be moved outside of the Alpaka namespace?

Copy link
Contributor

@fwyzard fwyzard Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should still be an alapaka-based producer in the ALPAKA_ACCELERATOR_NAMESPACE namespace.

See EventFilter/HcalRawToDigi/plugins/alpaka/HcalDigisSoAProducer.cc for an example.


#include "DataFormats/EcalDigi/interface/alpaka/EcalDigiPhase2DeviceCollection.h"
#include "DataFormats/EcalRecHit/interface/alpaka/EcalUncalibratedRecHitDeviceCollection.h"
#include <alpaka/alpaka.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more clear to have the #includes of external packages and CMSSW headers in separate "blocks", and order the CMSSW headers in alphabetical order.

Comment on lines 23 to 24
template <typename TAcc>
ALPAKA_FN_ACC void operator()(TAcc const &acc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be

Suggested change
template <typename TAcc>
ALPAKA_FN_ACC void operator()(TAcc const &acc,
ALPAKA_FN_ACC void operator()(Acc1D const &acc,

};

template <typename TAcc>
ALPAKA_FN_ACC void Phase2WeightsKernel::operator()(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would be less repetition (of function arguments) if the implementation would be at the function declaration few lines above.

@@ -0,0 +1,2 @@
import RecoLocalCalo.EcalRecProducers.ecalUncalibRecHitPhase2WeightsProducerPortable_cfi as _mod
ecalUncalibRecHitPhase2Portable = _mod.ecalUncalibRecHitPhase2WeightsProducerPortable.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file? I mean, is it useful to have both ecalUncalibRecHitPhase2WeightsProducerPortable_cfi and ecalUncalibRecHitPhase2Portable_cfi?


private:
const edm::EDGetTokenT<EBDigiCollectionPh2> inputDigiToken_;
const device::EDPutToken<EcalDigiPhase2DeviceCollection> outputDigiDevToken_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiousity, does the EcalDigiPhase2DeviceCollection have other potential use than the input for EcalUncalibRecHitPhase2WeightsProducerPortable, or do you foresee other alternative modules that would produce EcalDigiPhase2DeviceCollection as well?

I'm asking only because if there is high certainty that this and EcalUncalibRecHitPhase2WeightsProducerPortable will always come together, and nothing else won't use EcalDigiPhase2DeviceCollection (or EcalDigiPhase2HostCollection), then these two modules could be merged into one module (that would lead to a somewhat simpler setup especially in the configuration). I'm not suggesting to change the setup in this PR, but only to give an aspect to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have other modules in mind. The weights producer is a baseline, but we are thinking of porting the multifit as well.


namespace ALPAKA_ACCELERATOR_NAMESPACE {

class EcalPhase2DigiToPortableProducer : public stream::EDProducer<> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this module could be very easily made global::EDProducer<>.

Comment on lines 15 to 17
namespace ALPAKA_ACCELERATOR_NAMESPACE {
namespace ecal {
namespace weights {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be expressed as

Suggested change
namespace ALPAKA_ACCELERATOR_NAMESPACE {
namespace ecal {
namespace weights {
namespace ALPAKA_ACCELERATOR_NAMESPACE::ecal::weights {

in order to have shorter indentation in the code below.

double const *timeWeightsdata,
EcalDigiPhase2DeviceCollection::ConstView digisDev,
EcalUncalibratedRecHitDeviceCollection::View uncalibratedRecHitsDev) const {
constexpr int nsamples = EcalDataFrame_Ph2::MAXSAMPLES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use either ecalPh2::sampleSize or EcalDataFrame_Ph2::MAXSAMPLES (I see the latter is defined to be the same as the former) both here and where weightsData and timeWeightsdata are allocated/defined so that a future reader can easily convince themselves that the loop below does not go over the bounds of weightsData/timeWeightsdata.

Change the UncalibratedRecHitsProducer to a global producer

Run code-checks and code-format
@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-CUDA
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2405fb/45649/summary.html
COMMIT: e8bf0bd
CMSSW: CMSSW_15_1_X_2025-04-19-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47124/45649/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-CUDA

  • 12834.40612834.406_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka.log
  • 12834.40312834.403_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka_Validation.log
  • 12834.40212834.402_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka.log

Comparison Summary

Summary:

  • You potentially added 5 lines to the logs
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3916361
  • DQMHistoTests: Total failures: 49
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3916292
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2025

ignore tests-rejected with ib-failure

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2025

I've opened #47918 to track the follow-up issues discussed here and elsewhere.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2025

+heterogeneous

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2025

For me it's fine to work on the follow up issues in separate PRs.
Of course, if you prefer to fix some of them already here, that is fine as well.

@jfernan2
Copy link
Contributor

+1

@Moanwar
Copy link
Contributor

Moanwar commented Apr 22, 2025

+Upgrade

@AdrianoDee
Copy link
Contributor

+pdmv

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 88ee799 into cms-sw:master Apr 23, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants