Skip to content

fStreamerImpl not properly initialized error at runtime #47287

Closed
@missirol

Description

@missirol

Running a recent HLT menu in CMSSW_15_0_0_pre2, I see the runtime error in [1].

Some facts and circumstancial evidence.

  • The issue is not fully reproducible (so, right now I don't really have a reproducer).
  • The issue happens pretty frequently in the HLT workflow I'm testing. The latter consists in running 8 jobs in parallel, each with 32 threads and 24 concurrent events, on a machine with the same hardware as a "2022 HLT node", e.g. hilton-c2b02-44-01 (2 AMD Milan CPUs + 2 NVIDIA GPUs). Fwiw, a readme + example of what I'm running is in [2] and [3] (the recipe assumes the use of one of the HLT/GPU nodes in the CMS online network; the instructions could be adapted to lxplus if needed).
  • I have seen the issue with and without offloading to GPUs ("without" meaning options.accelerators = ["cpu"]).
  • I have seen the issue starting with CMSSW_15_0_0_pre2, and I see it also in more recent 15_0_X IBs.
  • I ran the same workflow more than once in CMSSW_15_0_0_pre1, and I have not seen this runtime error in that pre-release so far.
  • Just for the record, I saw two PRs ([DAQ] Input source improvements and initial Phase-2 DTH format support #47068 and Improve behavior of streamer system #47073) that entered 15_0_0_pre2 and seemed to me loosely related to output modules, I locally reverted both PRs on top of 15_0_0_pre2, and I still see the same runtime error as [1].
  • So far, I failed to reproduce the problem with simpler configurations (as opposed to a full-blown HLT menu running on multiple jobs using all threads).

Talking to @fwyzard and @makortel, the issue looks compatible with a race condition.

One suggestion by @makortel was to check if the error occurs at the very beginning of the job or later. I managed to reproduce the error enabling more MessageLogger outputs, and it happened ~60 events into the job (using 32 threads and 24 streams in the job), so early on in the job but not at the very beginning.

FYI: @cms-sw/hlt-l2

Edit-1 (Feb-09): a script used to reproduce the error on lxplus was added in #47287 (comment).

[1]

----- Begin Fatal Exception 26-Jan-2025 10:02:31 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  Event run: 386593 lumi: 94 event: 213402124 stream: 6
   [1] Running path '@finalPath'
   [2] Calling method for module GlobalEvFOutputModule/'hltOutputParkingSingleMuon8'
   Additional Info:
      [a] Fatal Root Error: @SUB=TClass::StreamerDefault
fStreamerImpl not properly initialized (0)

----- End Fatal Exception -------------------------------------------------

[2]

dirName=MY_TEST_DIR
cmsswRel=CMSSW_15_0_0_pre2

export SCRAM_ARCH=el8_amd64_gcc12
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SITECONFIG_PATH="/opt/offline/SITECONF/local"

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

kinit $(logname)@CERN.CH
ssh -f -N -D18081 $(logname)@cmsusr.cms

mkdir -p /fff/user/"${USER}"/"${dirName}"
cd /fff/user/"${USER}"/"${dirName}"

cmsrel "${cmsswRel}"
cd "${cmsswRel}"/src
cmsenv
scram b
cd "${OLDPWD}"

git clone [email protected]:missirol/hltThroughputUtils.git -o missirol -b master

cd hltThroughputUtils

git clone [email protected]:missirol/patatrack-scripts.git -o missirol -b master_old

[3]

#!/bin/bash -e

[ $# -eq 1 ] || exit 1

jobLabel="${1}"
runNumber=386593
outDir=/fff/user/"${USER}"/output/hltThroughputUtils

run() {
  [ ! -d "${3}" ] || exit 1
  mkdir -p $(dirname "${3}")
  foo=$(printf "%125s") && echo ${foo// /-} && unset foo
  printf " %s\n" "${3}"
  foo=$(printf "%125s") && echo ${foo// /-} && unset foo
  rm -rf run"${runNumber}"
  ${2}/benchmark "${1}" -E cmsRun -r 4 -j "${4}" -t "${5}" -s "${6}" -e 40100 -g 1 -n --no-cpu-affinity -l "${3}" -k resources.json --tmpdir "${outDir}"/tmp |& tee "${3}".log
  ./merge_resources_json.py "${3}"/step*/pid*/resources.json > "${3}".json
  mv "${3}".log "${3}".json "${3}"
  cp "${1}" "${3}"
}

https_proxy=http://cmsproxy.cms:3128/ \
hltConfigFromDB --configName /dev/CMSSW_14_2_0/GRun/V11 > tmp.py

cp /gpu_data/store/data/Run2024*/EphemeralHLTPhysics/FED/run"${runNumber}"_cff.py .

# ensure MPS is disabled at the start
./stop-mps-daemon.sh

### Intermediate configuration file
cat <<@EOF >> tmp.py

process.load('run${runNumber}_cff')

from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW

process = customizeHLTforCMSSW(process)

process.GlobalTag.globaltag = '150X_dataRun3_HLT_v1'

process.PrescaleService.lvl1DefaultLabel = '2p0E34'
process.PrescaleService.forceDefault = True

process.hltPixelTracksSoA.CAThetaCutBarrel = 0.00111685053
process.hltPixelTracksSoA.CAThetaCutForward = 0.00249872683
process.hltPixelTracksSoA.hardCurvCut = 0.695091509
process.hltPixelTracksSoA.dcaCutInnerTriplet = 0.0419242041
process.hltPixelTracksSoA.dcaCutOuterTriplet = 0.293522194
process.hltPixelTracksSoA.phiCuts = [
    832, 379, 481, 765, 1136,
    706, 656, 407, 1212, 404,
    699, 470, 652, 621, 1017,
    616, 450, 555, 572
]

# remove check on timestamp of online-beamspot payloads
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)

# same source settings as used online
process.source.eventChunkSize = 200
process.source.eventChunkBlock = 200
process.source.numBuffers = 4
process.source.maxBufferedFiles = 2

# taken from hltDAQPatch.py
process.options.numberOfConcurrentLuminosityBlocks = 2

# write a JSON file with the timing information
if hasattr(process, 'FastTimerService'):
    process.FastTimerService.writeJSONSummary = True

# remove HLTAnalyzerEndpath if present
if hasattr(process, 'HLTAnalyzerEndpath'):
    del process.HLTAnalyzerEndpath
@EOF

### Final configuration file (dump)
edmConfigDump tmp.py > "${jobLabel}"_dump.py
rm -rf tmp.py

### Throughput measurements (benchmark)
jobDirPrefix="${jobLabel}"-"${CMSSW_VERSION}"

## GPU MPS
unset CUDA_VISIBLE_DEVICES
./start-mps-daemon.sh
sleep 1
run "${jobLabel}"_dump.py ./patatrack-scripts "${outDir}"/"${jobDirPrefix}"-gpu_mps 8 32 24
./stop-mps-daemon.sh
sleep 1

rm -rf "${jobLabel}"*{cfg,dump}.py
rm -rf run"${runNumber}"
rm -rf run"${runNumber}"_cff.py
rm -rf __pycache__ tmp
rm -rf tmp.py

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions