Skip to content

_sim_debug_flush(): Use fsync().#454

Open
bscottm wants to merge 1 commit intoopen-simh:masterfrom
bscottm:debug_flush_loop
Open

_sim_debug_flush(): Use fsync().#454
bscottm wants to merge 1 commit intoopen-simh:masterfrom
bscottm:debug_flush_loop

Conversation

@bscottm
Copy link
Copy Markdown
Contributor

@bscottm bscottm commented Mar 17, 2025

Fix race condition in _debug_fwrite_all() where the FILE *f output file can be permanently NULL, causing fwrite() to permanently return zero and resulting in an infinite loop on mulithreaded SIMH. This pathology primarily occurs when main thread calls flush_svc() and one of the Ethernet threads (reader, writer) emits debugging output.

Linux, macOS invoke fsync(). Windows invokes a fsync() wrapper that invokes _commit().

Issue traced to _sim_debug_flush() calling sim_set_deboff(0, NULL) to simulate fsync().

@bscottm
Copy link
Copy Markdown
Contributor Author

bscottm commented Mar 17, 2025

@pkoning2: Please merge PR #450 so that CI/CD can get past the LTO issue so that this and other builds have some chance at success (barring simulator test failures.)

@pkoning2
Copy link
Copy Markdown
Member

I'm a bit confused here. You pointed out that the simulated fsync can't be used in multithreaded simh. So why is it still in the code? If it's broken, would it not be more logical to remove it, especially since everyone has fsync?

@bscottm
Copy link
Copy Markdown
Contributor Author

bscottm commented Mar 17, 2025

I'm a bit confused here. You pointed out that the simulated fsync can't be used in multithreaded simh. So why is it still in the code? If it's broken, would it not be more logical to remove it, especially since everyone has fsync?

It can still be used for single threaded code, i.e., simulators that don't use AIO.

Also, I don't have a good handle on which platforms SIMH is supposed to support (the makefile seems to support a lot more than just Linux, macOS and Windows... Haiku? SunOS variants?) So, I left it in there as CYA.

Very happy to take it out and rewrite code to assume everyone has fsync(), except for Windows, which needs the diaper.

@bscottm
Copy link
Copy Markdown
Contributor Author

bscottm commented Mar 18, 2025

I'm a bit confused here. You pointed out that the simulated fsync can't be used in multithreaded simh. So why is it still in the code? If it's broken, would it not be more logical to remove it, especially since everyone has fsync?

@pkoning2: It's all just fsync now. If there's a platform out there that needs to be supported (VMS?), an issue is probably the best way to address it. Now back to the ETH_MAC * discussion...

Comment thread scp.c
*
* f == NULL: fwrite() returns 0 on Linux and Windows. len never gets decremented
* and an infinite loop ensues (cue "Forever NULL" sung to the tune of Alphaville's
* "Forever Young".)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply treat length written == 0 as an error? But in my testing, a NULL file pointer causes a segfault (on recent Linux as well as on Mac OS).

Copy link
Copy Markdown
Contributor Author

@bscottm bscottm Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkoning2: I left the "dragon" note as more of a "Don't Do This" warning. I added extra checks deal with the f == NULL,len_written == 0 and errno != EAGAIN cases. In each case, SIMH outputs a diagnostic message (the "why") and the buffer's contents that weren't written to the debug output.

@bscottm bscottm force-pushed the debug_flush_loop branch 3 times, most recently from af23c88 to 072fe53 Compare August 21, 2025 16:28
Fix race condition in _debug_fwrite_all() where the FILE *f output
file can be permanently NULL, causing fwrite() to permanently return
zero and resulting in an infinite loop on mulithreaded SIMH. This
pathology primarily occurs when main thread calls flush_svc() and
one of the Ethernet threads (reader, writer) emits debugging output.

Linux and macOS platforms invoke fsync() to synchronize file data to
disk. Windows has a fsync() wrapper calling _commit() providing the
same semantics.

Issue traced to _sim_debug_flush() calling sim_set_deboff(0, NULL) to
simulate fsync().
@pmetzger
Copy link
Copy Markdown

So I'm looking at this for inclusion in ZIMH and I can't understand why the old behavior of closing and reopening the stream made any sense or why an fsync() is needed over a simple fflush; this is for debugging logging after all, not a database where you need an atomic commit to durable storage.

Can someone explain to me why this can't just be:

/* Flush pending debug output. */
static t_stat _sim_debug_flush (void)
{
    if (sim_deb == NULL)                                /* no debug? */
        return SCPE_OK;

    _sim_debug_write_flush ("", 0, TRUE);

    if (fflush (sim_deb) != 0)
        return SCPE_IOERR;

    return SCPE_OK;
}

If no one has a good explanation to the contrary, I think I'm just going to do it that way. fflush is nice and portable C after all.

pmetzger added a commit to pmetzger/zimh that referenced this pull request Apr 29, 2026
Fix the debug-output flush race described in open-simh/simh#454, but
with a much simpler local implementation.

The old debug flush path drained SIMH's internal debug buffer, then
closed and reopened ordinary debug files to force the stdio buffers out.
That temporarily cleared and replaced the global debug stream, which
could race with other threads emitting debug output.

Keep the internal debug-buffer flush, then call fflush() on the current
debug stream. This preserves the useful flush behavior without closing
or replacing sim_deb.

Upstream-PR: open-simh/simh#454
@bscottm
Copy link
Copy Markdown
Contributor Author

bscottm commented May 1, 2026

@pmetzger: fsync() is a hard gaurantee that data is sync-ed to the file system, whereas fflush() doesn't make that gaurantee. (Note: fsync() doesn't guarantee that file system metadata is sync-ed, only the data.)

Like you, I've gone a different direction to re-implement in Rust, although I could be convinced to contribute to ZIMH. I had enough of the trolling.

@pmetzger
Copy link
Copy Markdown

pmetzger commented May 1, 2026

@bscottm My intention is to rebuild the entire thing in Rust once I have a full set of tests in place so that I feel confident that subsystems work as they're converted. Happy to discuss how the whole thing is to work; the discussions feature is open on the ZIMH repo.

@pmetzger
Copy link
Copy Markdown

pmetzger commented May 1, 2026

I've thought about it, and I went with just the fflush; the fsync didn't seem necessary because if the host machine crashes hard at that moment you don't have any real guarantees about where the simulated machine got to in a debugging run anyway. I can be argued out of that position, but it's better to do it in the ZIMH repo issues and discussions.

@pkoning2
Copy link
Copy Markdown
Member

pkoning2 commented May 3, 2026

Actually, the definition of fsync is that it commits both data and metadata. You may be thinking of fdatasync which (as you can tell from the name) is metadata only.
fflush is something different, it writes buffered data to the device for stdio type buffered I/O. The Mac OS documention does not say, but the Linux one does, that the two are separate and a complete commit to disk requires fflush followed by fsync.

@pmetzger
Copy link
Copy Markdown

pmetzger commented May 3, 2026

Yes, but there is no need to sync a log file hard to the durable device (hard disk or ssd); that lowers system performance (including for unrelated processes!) unnecessarily for the very rare case where the system crashes during your work, and if it crashed right then, the probability that you care about the log that much is nil. fsync is there for when you're dong a database transaction or the like, and this is not a database transaction. fflushing'ing a log, on the other hand, is perfectly reasonable. So there's no need for the fsync, but an fflush is reasonable. So I fully cleaned up this code in my own fork just to fflush.

@rcornwell
Copy link
Copy Markdown
Member

I don't see any reason to call fsync, if the data is flushed and the application crashes, it will still get written to disk. fsync is designed to make sure disk is in sync with file, mostly for multi application access.

@pkoning2
Copy link
Copy Markdown
Member

pkoning2 commented May 3, 2026

I don't see any reason to call fsync, if the data is flushed and the application crashes, it will still get written to disk. fsync is designed to make sure disk is in sync with file, mostly for multi application access.

Yes, that's true. fsync is needed if you're worried about OS crashes. For application issues fflush is sufficient.

@pmetzger
Copy link
Copy Markdown

pmetzger commented May 4, 2026

And this is an application issue; the goal is to get debugging information out, not to deal with rare host operating system crashes. You can clean this whole thing up into a very short routine that is much easier to understand. (I did.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants