Closed
Description
Running a recent HLT menu in CMSSW_15_0_0_pre2
, I see the runtime error in [1].
Some facts and circumstancial evidence.
- The issue is not fully reproducible (so, right now I don't really have a reproducer).
- The issue happens pretty frequently in the HLT workflow I'm testing. The latter consists in running 8 jobs in parallel, each with 32 threads and 24 concurrent events, on a machine with the same hardware as a "2022 HLT node", e.g.
hilton-c2b02-44-01
(2 AMD Milan CPUs + 2 NVIDIA GPUs). Fwiw, a readme + example of what I'm running is in [2] and [3] (the recipe assumes the use of one of the HLT/GPU nodes in the CMS online network; the instructions could be adapted tolxplus
if needed). - I have seen the issue with and without offloading to GPUs ("without" meaning
options.accelerators = ["cpu"]
). - I have seen the issue starting with
CMSSW_15_0_0_pre2
, and I see it also in more recent15_0_X
IBs. - I ran the same workflow more than once in
CMSSW_15_0_0_pre1
, and I have not seen this runtime error in that pre-release so far. - Just for the record, I saw two PRs ([DAQ] Input source improvements and initial Phase-2 DTH format support #47068 and Improve behavior of streamer system #47073) that entered
15_0_0_pre2
and seemed to me loosely related to output modules, I locally reverted both PRs on top of15_0_0_pre2
, and I still see the same runtime error as [1]. - So far, I failed to reproduce the problem with simpler configurations (as opposed to a full-blown HLT menu running on multiple jobs using all threads).
Talking to @fwyzard and @makortel, the issue looks compatible with a race condition.
One suggestion by @makortel was to check if the error occurs at the very beginning of the job or later. I managed to reproduce the error enabling more MessageLogger
outputs, and it happened ~60 events into the job (using 32 threads and 24 streams in the job), so early on in the job but not at the very beginning.
FYI: @cms-sw/hlt-l2
Edit-1 (Feb-09): a script used to reproduce the error on lxplus
was added in #47287 (comment).
[1]
----- Begin Fatal Exception 26-Jan-2025 10:02:31 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 386593 lumi: 94 event: 213402124 stream: 6
[1] Running path '@finalPath'
[2] Calling method for module GlobalEvFOutputModule/'hltOutputParkingSingleMuon8'
Additional Info:
[a] Fatal Root Error: @SUB=TClass::StreamerDefault
fStreamerImpl not properly initialized (0)
----- End Fatal Exception -------------------------------------------------
[2]
dirName=MY_TEST_DIR
cmsswRel=CMSSW_15_0_0_pre2
export SCRAM_ARCH=el8_amd64_gcc12
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SITECONFIG_PATH="/opt/offline/SITECONF/local"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
kinit $(logname)@CERN.CH
ssh -f -N -D18081 $(logname)@cmsusr.cms
mkdir -p /fff/user/"${USER}"/"${dirName}"
cd /fff/user/"${USER}"/"${dirName}"
cmsrel "${cmsswRel}"
cd "${cmsswRel}"/src
cmsenv
scram b
cd "${OLDPWD}"
git clone [email protected]:missirol/hltThroughputUtils.git -o missirol -b master
cd hltThroughputUtils
git clone [email protected]:missirol/patatrack-scripts.git -o missirol -b master_old
[3]
#!/bin/bash -e
[ $# -eq 1 ] || exit 1
jobLabel="${1}"
runNumber=386593
outDir=/fff/user/"${USER}"/output/hltThroughputUtils
run() {
[ ! -d "${3}" ] || exit 1
mkdir -p $(dirname "${3}")
foo=$(printf "%125s") && echo ${foo// /-} && unset foo
printf " %s\n" "${3}"
foo=$(printf "%125s") && echo ${foo// /-} && unset foo
rm -rf run"${runNumber}"
${2}/benchmark "${1}" -E cmsRun -r 4 -j "${4}" -t "${5}" -s "${6}" -e 40100 -g 1 -n --no-cpu-affinity -l "${3}" -k resources.json --tmpdir "${outDir}"/tmp |& tee "${3}".log
./merge_resources_json.py "${3}"/step*/pid*/resources.json > "${3}".json
mv "${3}".log "${3}".json "${3}"
cp "${1}" "${3}"
}
https_proxy=http://cmsproxy.cms:3128/ \
hltConfigFromDB --configName /dev/CMSSW_14_2_0/GRun/V11 > tmp.py
cp /gpu_data/store/data/Run2024*/EphemeralHLTPhysics/FED/run"${runNumber}"_cff.py .
# ensure MPS is disabled at the start
./stop-mps-daemon.sh
### Intermediate configuration file
cat <<@EOF >> tmp.py
process.load('run${runNumber}_cff')
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)
process.GlobalTag.globaltag = '150X_dataRun3_HLT_v1'
process.PrescaleService.lvl1DefaultLabel = '2p0E34'
process.PrescaleService.forceDefault = True
process.hltPixelTracksSoA.CAThetaCutBarrel = 0.00111685053
process.hltPixelTracksSoA.CAThetaCutForward = 0.00249872683
process.hltPixelTracksSoA.hardCurvCut = 0.695091509
process.hltPixelTracksSoA.dcaCutInnerTriplet = 0.0419242041
process.hltPixelTracksSoA.dcaCutOuterTriplet = 0.293522194
process.hltPixelTracksSoA.phiCuts = [
832, 379, 481, 765, 1136,
706, 656, 407, 1212, 404,
699, 470, 652, 621, 1017,
616, 450, 555, 572
]
# remove check on timestamp of online-beamspot payloads
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
# same source settings as used online
process.source.eventChunkSize = 200
process.source.eventChunkBlock = 200
process.source.numBuffers = 4
process.source.maxBufferedFiles = 2
# taken from hltDAQPatch.py
process.options.numberOfConcurrentLuminosityBlocks = 2
# write a JSON file with the timing information
if hasattr(process, 'FastTimerService'):
process.FastTimerService.writeJSONSummary = True
# remove HLTAnalyzerEndpath if present
if hasattr(process, 'HLTAnalyzerEndpath'):
del process.HLTAnalyzerEndpath
@EOF
### Final configuration file (dump)
edmConfigDump tmp.py > "${jobLabel}"_dump.py
rm -rf tmp.py
### Throughput measurements (benchmark)
jobDirPrefix="${jobLabel}"-"${CMSSW_VERSION}"
## GPU MPS
unset CUDA_VISIBLE_DEVICES
./start-mps-daemon.sh
sleep 1
run "${jobLabel}"_dump.py ./patatrack-scripts "${outDir}"/"${jobDirPrefix}"-gpu_mps 8 32 24
./stop-mps-daemon.sh
sleep 1
rm -rf "${jobLabel}"*{cfg,dump}.py
rm -rf run"${runNumber}"
rm -rf run"${runNumber}"_cff.py
rm -rf __pycache__ tmp
rm -rf tmp.py
Metadata
Metadata
Assignees
Type
Projects
Status
Done