Skip to content

Implement remove and remove_if algorithms for datapar execution#6970

Open
BhoomishGupta wants to merge 16 commits intoTheHPXProject:masterfrom
BhoomishGupta:remove_datapar
Open

Implement remove and remove_if algorithms for datapar execution#6970
BhoomishGupta wants to merge 16 commits intoTheHPXProject:masterfrom
BhoomishGupta:remove_datapar

Conversation

@BhoomishGupta
Copy link
Copy Markdown
Contributor

Partially Fixes #2333

Proposed Changes

-Added datapar/remove.hpp implementing SIMD-accelerated sequential_remove_if and sequential_remove via tag_invoke overloads for vectorpack execution policies
-Added detail/remove.hpp with sequential_remove_if_t and sequential_remove_t tag-fallback CPOs used by both scalar and datapar paths
-Added datapar unit tests remove_datapar.cpp and remove_if_datapar.cpp covering simd, par_simd, simd(task), and par_simd(task) policies with both random-access and forward iterators; registered both in CMakeLists.txt

Any background context you want to provide?

Checklist

Not all points below apply to all pull requests.

  • I have added a new feature and have added tests to go along with it.
  • I have fixed a bug and have added a regression test.
  • I have added a test using random numbers; I have made sure it uses a seed, and that random numbers generated are valid inputs for the tests.

Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
@BhoomishGupta BhoomishGupta requested a review from hkaiser as a code owner March 6, 2026 23:14
@StellarBot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 7, 2026

@BhoomishGupta From what I can see, your new implementations do not take advantage of data parallelisation but simply do a per-element processing once the first matching element is found. Do you see a way of taking advantage of data parallelism for the remainder of the sequence as well?

@BhoomishGupta
Copy link
Copy Markdown
Contributor Author

@BhoomishGupta From what I can see, your new implementations do not take advantage of data parallelisation but simply do a per-element processing once the first matching element is found. Do you see a way of taking advantage of data parallelism for the remainder of the sequence as well?

Thanks for the feedback! I will look into it.

Just as a quick reminder, my university exams are still going on right now, so my availability is quite limited. I will dive into this and work on an improved approach right after my exams finish on March 14th.

Thanks for your patience!

BhoomishGupta and others added 2 commits March 20, 2026 08:24
"must not be a simd (vectorpack) execution policy");
return std::forward<ExPolicy>(policy);
}
} to_non_simd{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these mapping CPOs from https://github.com/STEllAR-GROUP/hpx/blob/master/libs/core/executors/include/hpx/executors/datapar/execution_policy_mappings.hpp to libs/core/executors/include/hpx/executors/execution_policy_mappings.hpp

These CPOs were originally defined in the datapar header, but I moved them to the common execution policy mappings header so they are available even when datapar/SIMD is not enabled. This is needed for code paths that reference the mappings independently of datapar support.
The datapar header now just includes the common header (guarded by HPX_HAVE_DATAPAR) for compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be needed if datapar is not enabled?

Copy link
Copy Markdown
Contributor Author

@BhoomishGupta BhoomishGupta Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I reviewed this. And you are right.
I originally fell back to the non_simd policy (and consequently had to move the execution_policy_mappings.hpp header to make it globally visible) because the f1 loop was failing to compile under the SIMD backends.

However, doing this forced the parallel chunks to evaluate the predicate sequentially, which indeed defeats the purpose of the SIMD implementation.

The root cause of the original compilation failure was that the flags array was declared as bool[]. In the std::simd backends, boolean vector arrays instantiate as proxy mask objects rather than standard packed primitives, which broke the vector_pack traits when it tried to iterate and assign. Also, the f1 lambda was explicitly taking (zip_iterator it) instead of (auto it), which prevented the datapar loop_n from passing in the vectorized chunks.

To resolve this properly, I am thinking of doing this:

  1. I will change the flags array allocation from bool to std::uint8_t so it reliably packs.
  2. I will change the f1 lambda parameter to auto it and revert its loop policy back to the native std::decay_t so that the heavy predicate evaluation actually vectorizes.
  3. I will completely revert the execution_policy_mappings.hpp move, as to_non_simd will no longer be leaking unconditionally into the generic algorithm.

std::size_t part_size) -> void {
// MSVC complains if pred or proj is captured by ref below
util::loop_n<std::decay_t<ExPolicy>>(part_begin, part_size,
util::loop_n<inner_policy_type>(part_begin, part_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't invoking the loop with a non-simd policy defeat the whole purpose of introducing SIMD functionalities to remove_if?

Copy link
Copy Markdown
Contributor Author

@BhoomishGupta BhoomishGupta Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. Invoking the loop with inner_policy_type (non-SIMD) was a workaround because of a compilation error with the SIMD backend. I've since identified that the root cause was the bool[] flags array and the explicit zip_iterator parameter in the lambda. I'm reverting this to use the original execution policy to ensure the predicate evaluation is properly vectorized, as detailed in my other comment

else if (!hpx::parallel::traits::all_of(msk))
{
//mixed
for (std::size_t i = 0; i < size; ++i)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For std::simd (std::experimental::simd) you could utilize https://en.cppreference.com/w/cpp/experimental/simd/find_first_set.html to possibly reduce the amount of iterations needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other datapar backends (EVE, VC, arm simd) you may have to emulate this or simply always return 0 and simd_width respectively)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up! It looks like HPX already has a generic implementation for this across all backends via hpx::parallel::traits::find_first_of(msk) in libs/core/execution/include/hpx/execution/traits/vector_pack_find.hpp

I'll update the loop to use it.

Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 1, 2026

Not up to standards ⛔

🔴 Issues 1 medium

Alerts:
⚠ 1 issue (≤ 0 issues of at least minor severity)

Results:
1 new issue

Category Results
ErrorProne 1 medium

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 1, 2026

@BhoomishGupta Would you be able to follow up on the compilation errors reported by the CIs?

@BhoomishGupta
Copy link
Copy Markdown
Contributor Author

@BhoomishGupta Would you be able to follow up on the compilation errors reported by the CIs?

Working on it!

…xecution

Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 2, 2026

@BhoomishGupta Please pay attention to the CIs. ARM is failing to compile. Also, the remove test is now failing to compile even on CIs that do not enable SIMD functionalities.

Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
@BhoomishGupta
Copy link
Copy Markdown
Contributor Author

@BhoomishGupta Please pay attention to the CIs. ARM is failing to compile. Also, the remove test is now failing to compile even on CIs that do not enable SIMD functionalities.

My mistake! I missed an include. Just pushed a fix. Watching the CIs now

Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
@BhoomishGupta BhoomishGupta force-pushed the remove_datapar branch 2 times, most recently from 7bd012e to 7d1fae7 Compare April 3, 2026 12:52
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>
Signed-off-by: BhoomishGupta <bhoomishguptabits@gmail.com>

#include <hpx/config.hpp>
#include <hpx/algorithms/traits/projected.hpp>
#include <hpx/execution/traits/vector_pack_conditionals.hpp>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please alsways use only the generted module includes (except inside a module itself):

Suggested change
#include <hpx/execution/traits/vector_pack_conditionals.hpp>
#include <hpx/modules/execution.hpp>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will keep to generated module includes in non-module code paths from now on.

typename std::iterator_traits<Iter>::value_type>;
using flag_t =
std::conditional_t<use_vectorpack_predicate, value_t, bool>;
using zip_iterator = hpx::util::zip_iterator<Iter, flag_t*>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor the code similarily to what we have done for he other algorithms. The code related to datapar should be kept fully separate from the non-datapar code paths. If datapar is disabled, e.g., the is_vectorpack_execution_policy you use here are not available.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will refactor the code such that datapar-specific assumptions are out of generic algorithm paths so non-datapar builds remain valid.

{
// Self-assignment must be detected.
util::loop_n<execution_policy_type>(
util::loop_n<hpx::execution::sequenced_policy>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this lose information? Why is this change needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not lose algorithm/data information.
That loop has a loop-carried dependency on dest plus self-assignment checks, so running it with vectorpack/unsequenced execution is not safe without a prefix-sum style compaction step. Using sequenced_policy here preserves correctness and ordering; the SIMD work remains in the predicate-evaluation phase f1. So the change drops vectorization in f2.

hpx::is_invocable_v<Pred,
typename std::iterator_traits<FwdIter>::value_type
>
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the rationale of this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is to keep the range execution-policy overload aligned with current dispatch behavior in remove_if.
If we required only the policy-aware projected callable check, scalar-only predicates would be rejected at the CPO boundary, even though the implementation can still run them through the scalar predicate-evaluation path when vectorized invocation is not available.
Adding the scalar invocable check preserves backward compatibility and keeps iterator/range policy-overload behavior consistent.
This changes only whether predicate evaluation can be vectorized for a given call; it does not change remove_if semantics or output correctness.

If you prefer stricter behavior for par_simd and related policies, I can tighten this to require only the policy-aware projected callable check and fail early otherwise.

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 8, 2026

@BhoomishGupta ping?

@BhoomishGupta
Copy link
Copy Markdown
Contributor Author

@BhoomishGupta ping?

Sorry for the delay. Resolving the comments asap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement datapar for parallel algorithms

3 participants