Skip to content

fix assertion failure in distributed barrier#6871

Open
the-ivii wants to merge 6 commits intoTheHPXProject:masterfrom
the-ivii:fix/barrier-assertion-6430
Open

fix assertion failure in distributed barrier#6871
the-ivii wants to merge 6 commits intoTheHPXProject:masterfrom
the-ivii:fix/barrier-assertion-6430

Conversation

@the-ivii
Copy link
Copy Markdown
Contributor

@the-ivii the-ivii commented Feb 2, 2026

Fixes #6430

Proposed Changes

  • added a null check for barrier nodes in barrier::synchronize to prevent a crash when the barrier is destroyed early.
  • added a regression test barrier_synchronize to verify the fix works as expected.
  • updated the test suite cmake file to include the new verification test.

Any background context you want to provide?

Checklist

Not all points below apply to all pull requests.

  • I have added a new feature and have added tests to go along with it.
  • I have fixed a bug and have added a regression test.
  • I have added a test using random numbers; I have made sure it uses a seed, and that random numbers generated are valid inputs for the tests.

@the-ivii the-ivii requested a review from hkaiser as a code owner February 2, 2026 11:22
@StellarBot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch 3 times, most recently from 6048f0b to 5af888e Compare February 3, 2026 07:19
init_args.desc_cmdline = description;

HPX_TEST_EQ(hpx::init(argc, argv, init_args), 0);
hpx::distributed::barrier::synchronize();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few HPX facilities can be used when the runtime is not active (like here). The distributed barrier is not one of them.

Did you see actual code that is doing this? Or is this an artificial construct meant to test your change to synchronize() above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a test designed to simulate a race condition we saw during teardown in distributed runs. It is not meant to encourage post runtime usage, but just to ensure that if a sync call overlaps with cleanup, it returns gracefully instead of triggering a hard assertion failure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't your solution just hide the problem?

Also, I'm glad you now mention under what circumstances you saw the original issue. If it happens during system teardown, then it's definitely not a problem in the initialization (as I initially assumed) but a possible sequencing error while the runtime exits (possibly caused by the sequencing of global destructors being undefined).

IIRC, the barrier in this case is used to make sure that all localities have finished executing.

I'd appreciate it if you could provide a stack backtrace of the original assert you were seeing. That may allow us to see which object is destroyed too late.

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Feb 8, 2026

@the-ivii Could you please rebase your branch onto master (and resolve the conflicts)?

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Feb 26, 2026

@the-ivii What are your plans regarding this PR? Will you have the time to work on it?

@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch from a42a94b to 191ad62 Compare February 26, 2026 14:24
HPX_TEST_EQ(hpx::init(argc, argv, init_args), 0);

return hpx::util::report_errors();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test reproduce the problem for you (without your change)?

the-ivii added 2 commits March 3, 2026 11:49
Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch from c663e38 to bafd7c6 Compare March 3, 2026 06:27
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 3, 2026

The issue reported by the clang-format CI should be fixed. Also, the MacOS builders fail your new test. Please have an eye on the CIs yourself.

@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch 4 times, most recently from f7fbcf8 to 26b73d6 Compare March 3, 2026 19:27
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 17, 2026

@the-ivii there is still the clang-format issue, also your changes cause compilations errors. Please have a look.

Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch from 26b73d6 to a807991 Compare March 21, 2026 14:19
Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
Signed-off-by: the-ivii <divyanshitalreja12@gmail.com>
@the-ivii the-ivii force-pushed the fix/barrier-assertion-6430 branch from cb127a5 to 74d11aa Compare March 22, 2026 09:24
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 28, 2026

@the-ivii ping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPX distributed barrier runs into assertion error

3 participants