Skip to content

Add generic module reload validation and harden fastrpc unload hang coverage #336

Open
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:kernel
Open

Add generic module reload validation and harden fastrpc unload hang coverage #336
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:kernel

Conversation

@smuppand
Copy link
Contributor

@smuppand smuppand commented Mar 9, 2026

Add a new generic Module_Reload_Validation suite as per the new requirement #326 and wire in an initial fastrpc profile for unload/reload regression coverage.

Key changes in this PR:

  • reusable module reload engine in lib_module_reload.sh
  • new Baseport Module_Reload_Validation testcase and YAML
  • profile-based execution so future modules can be covered without  duplicating testcase logic
  • add fastrpc-specific daemon lifecycle handling around adsprpcd/cdsprpcd
  • improve timeout handling so stuck rmmod/modprobe commands do not hang  the harness
  • capture clearer timeout-path evidence and print relevant artifact paths  to stdout for CI/LAVA debugging
  • narrow module-state collection to the target module being tested

The first target is fastrpc, where unload can hang after daemon activity.This suite is meant to make that regression reproducible in automation andto preserve useful evidence when it happens.

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation# ./run.sh --module fastrpc
[INFO] 1970-01-01 00:01:06 - ---------------- Starting Module_Reload_Validation ----------------
[INFO] 1970-01-01 00:01:06 - Platform Details: machine='Qualcomm Technologies, Inc. Robotics RB3gen2' target='Kodiak' kernel='6.18.7-g0e842c87bef3-dirty' arch='aarch64'
[INFO] 1970-01-01 00:01:06 - Args: module='fastrpc' iterations=3 unload_timeout=30 load_timeout=30 settle_timeout=20 mode='profile-default' sysrq_dump=1
[INFO] 1970-01-01 00:01:06 - Sysrq hang dump policy: enabled on timeout paths only
[INFO] 1970-01-01 00:01:06 - [fastrpc] module=fastrpc mode=daemon_lifecycle iterations=3
[INFO] 1970-01-01 00:01:06 - [fastrpc] warmup: unmasking and starting adsprpcd.service
[INFO] 1970-01-01 00:01:07 - [fastrpc] warmup: unmasking and starting cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: killing remaining adsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking cdsprpcd.service
[INFO] 1970-01-01 00:01:09 - [fastrpc] quiesce: killing remaining cdsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:12 - [fastrpc] sending SIGTERM to pid=587 pattern=/usr/bin/adsprpcd
[INFO] 1970-01-01 00:01:28 - [fastrpc] iter 1/3 exec: rmmod fastrpc
[FAIL] 1970-01-01 00:02:02 - [fastrpc] iter 1/3 unload timed out after 30s (pid=12476)
[INFO] 1970-01-01 00:02:02 - [fastrpc] unload timeout log: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/unload.log
[INFO] 1970-01-01 00:02:02 - [fastrpc] collecting hang evidence in: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/hang_evidence

Necessary logs will be available

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01# ls -l
total 4
drwxr-xr-x 2 root root 340 Jan  1 00:02 hang_evidence
drwxr-xr-x 2 root root 260 Jan  1 00:01 pre_state
-rw-r--r-- 1 root root 185 Jan  1 00:02 unload.log
drwxr-xr-x 2 root root 120 Jan  1 00:01 unload_timeout
root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01#

smuppand added 2 commits March 9, 2026 23:07
…ence capture

Avoid wedging the test harness when unload or load commands time out
by returning immediately after timeout evidence collection.

Also narrow module state capture to the target module and improve
timeout-path stdout reporting for easier debugging.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
Update the fastrpc profile to stop and mask rpc daemons, verify
remaining daemon processes, and provide clearer stdout context when
quiesce or unload paths fail.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
@mwasilew
Copy link
Contributor

Is there an example job using this code? It almost seems like there is too much code for one commit.

@basak-qcom
Copy link

basak-qcom commented Mar 11, 2026

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

@smuppand
Copy link
Contributor Author

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

Thanks — this is a fair concern.
The intent of the abstraction was to keep the runner generic (run.sh + lib_module_reload.sh) and isolate only the module-specific lifecycle in the profile. In the current fastrpc case, that profile ended up carrying too much procedural logic because the unload hang is tightly coupled to adsprpcd / cdsprpcd stop-mask-kill-wait handling and the evidence collection around that path.

I agree the current split is not ideal if future profiles have to repeat this much code. The right end state is for a profile to be mostly declarative (module name, services, patterns, mode, a few knobs), with the common service/process wait/kill helpers moved into lib_module_reload.sh.

For this PR, I started with the fastrpc-specific flow to make the unload-hang regression reproducible first. I can follow up by shrinking fastrpc.profile and moving the reusable pieces into the shared library so that adding a second module profile is much smaller and cleaner.

@smuppand
Copy link
Contributor Author

Is there an example job using this code? It almost seems like there is too much code for one commit.

https://lava.infra.foundries.io/scheduler/job/158642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants