[WIP] test: Increase etcd IOPS for AWS scale jobs by hakuna-matatah · Pull Request #17741 · kubernetes/kops

hakuna-matatah · 2025-11-04T21:38:52Z

To test some theories

k8s-ci-robot · 2025-11-04T21:38:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hakman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

tests/e2e/scenarios/scalability/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hakuna-matatah · 2025-11-04T21:39:33Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-11-04T22:33:48Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-11-05T12:01:11Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-11-05T20:32:10Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-19T15:10:11Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-19T16:36:13Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-19T17:56:11Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-20T20:28:55Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-20T22:08:09Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-20T23:20:01Z

It seems like now tests (both presubmits and periodics) are not even able to create clusters @dims @hakman , it says it is unable to get bucket details.

I1220 07:05:34.766288   18101 subnets.go:224] Assigned CIDR 10.0.128.0/18 to subnet us-east-2c
Error: error building complete spec: failed to get bucket details for "s3://k8s-infra-kops-discovery-d19a-20251220070534/scale-5000.periodic.test-cncf-aws.k8s.io": Could not retrieve location for AWS bucket k8s-infra-kops-discovery-d19a-20251220070534

Error: error building complete spec: failed to get bucket details for "s3://k8s-infra-kops-discovery-8d16-20251220204549/e2e-fa029a0ba8-a2033.tests-kops-aws.k8s.io": Could not retrieve location for AWS bucket k8s-infra-kops-discovery-8d16-20251220204549

Do we know if these buckets still exist ? I don't have access to the account to check this.

hakuna-matatah · 2025-12-29T02:28:45Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-29T02:30:00Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-29T06:16:40Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakman · 2025-12-29T06:55:09Z

@hakuna-matatah Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

hakuna-matatah · 2025-12-30T02:29:49Z

Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

Unfortunately, not yet. It appears prometheus stack failed to set up, need to understand why ? And it appears somehow job is still in running state for last 20hrs and it dump the logs yet - https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688/

�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
W1229 07:09:21.681067   58568 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
F1229 07:09:21.685136   58568 clusterloader.go:335] Error while setting up prometheus stack: timed out waiting for the condition
exit status 1
F1229 07:09:22.092900   53903 cl2.go:161] failed to run clusterloader2 tester: exit status 1

will re-run to see if it kills old one and if prom stack setup failure is consistent ^^^, hard to debug if there are no logs on why prometheus stack failed to set up.

hakuna-matatah · 2025-12-30T02:30:16Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-12-30T02:32:53Z

@ameukam @BenTheElder It looks like prow job is stuck in running state for last 20 hours - https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688

I remember vaguely this happened in the past and there was a fix on Infra for this. Do you happen to know if it has regressed ?

hakman · 2025-12-30T06:22:36Z

Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

Unfortunately, not yet. It appears prometheus stack failed to set up, need to understand why ? And it appears somehow job is still in running state for last 20hrs and it dump the logs yet - https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688/

will re-run to see if it kills old one and if prom stack setup failure is consistent ^^^, hard to debug if there are no logs on why prometheus stack failed to set up.

I remember @upodroid investigating this and being related to perf-test. Not sure why it's so frequent these days.

upodroid · 2025-12-30T07:41:28Z

Double check if the pvc of the prometheus pod is up and running.

Also, we should be running 100 node jobs every other hour like we do for gce

hakuna-matatah · 2025-12-30T12:27:31Z

Double check if the pvc of the prometheus pod is up and running.

I think even for this we would need APIServer audit logs

It appears that prometheus stack set up went fine in the last test, but test itself failed due to API SLOs breaching.

<failure type="Failure">:0 [measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheus error: top 
latency metric: there should be no high-latency requests, but: [got: &{Resource:events Subresource: Verb:DELETE 
Scope:namespace Latency:perc50: 1m0s, perc90: 1m0s, perc99: 1m0s Count:120 SlowCount:68}; expected perc99 <= 30s]] :0</failure>

Unfortunately we don't have APIServer audit logs with kops setup to debug where the latency is coming from ?
I wonder if requests are waiting in AP&F queue for longer time before executing and thus breaching SLO ? Or etcd being culprit ?

k8s-ci-robot · 2026-01-27T08:03:41Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-02-05T06:40:11Z

@hakuna-matatah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
presubmit-kops-gce-small-scale-ipalias-using-cl2	`e91dee9`	link	true	`/test presubmit-kops-gce-small-scale-ipalias-using-cl2`
pull-kops-kubernetes-e2e-ubuntu-gce-build	`e91dee9`	link	false	`/test pull-kops-kubernetes-e2e-ubuntu-gce-build`
presubmit-kops-aws-scale-amazonvpc-using-cl2	`e91dee9`	link	false	`/test presubmit-kops-aws-scale-amazonvpc-using-cl2`
pull-kops-gce-master-scale-performance-100	`e91dee9`	link	true	`/test pull-kops-gce-master-scale-performance-100`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

BenTheElder · 2026-02-05T18:37:33Z

@ameukam @BenTheElder It looks like prow job is stuck in running state for last 20 hours - https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688

I remember vaguely this happened in the past and there was a fix on Infra for this. Do you happen to know if it has regressed ?

I was out for the holidays and this got buried -- This usually means the job was OOMKilled kubernetes-sigs/prow#210

upodroid · 2026-02-05T20:54:40Z

Yeah, we haven't applied the single_process_oom_kill kubeletconfig in AWS yet

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2025

k8s-ci-robot requested review from dims and hakman November 4, 2025 21:38

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 4, 2025

hakuna-matatah force-pushed the iops branch from 3525f28 to 5e49033 Compare November 4, 2025 22:33

hakuna-matatah force-pushed the iops branch from 5e49033 to 294ada9 Compare November 5, 2025 20:31

Increase etcd iops

1d783f9

hakuna-matatah force-pushed the iops branch from 294ada9 to 545d9d9 Compare December 19, 2025 15:23

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 19, 2025

update master instance type

e91dee9

hakuna-matatah force-pushed the iops branch from 545d9d9 to e91dee9 Compare December 19, 2025 15:38

hakman changed the title ~~[WIP] Increase etcd iops~~ [WIP] test: Increase etcd IOPS for AWS scale jobs Dec 29, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 27, 2026

Conversation

hakuna-matatah commented Nov 4, 2025

Uh oh!

k8s-ci-robot commented Nov 4, 2025

Uh oh!

hakuna-matatah commented Nov 4, 2025

Uh oh!

hakuna-matatah commented Nov 4, 2025

Uh oh!

hakuna-matatah commented Nov 5, 2025

Uh oh!

hakuna-matatah commented Nov 5, 2025

Uh oh!

hakuna-matatah commented Dec 19, 2025

Uh oh!

hakuna-matatah commented Dec 19, 2025

Uh oh!

hakuna-matatah commented Dec 19, 2025

Uh oh!

hakuna-matatah commented Dec 20, 2025

Uh oh!

hakuna-matatah commented Dec 20, 2025

Uh oh!

hakuna-matatah commented Dec 20, 2025

Uh oh!

hakuna-matatah commented Dec 29, 2025

Uh oh!

hakuna-matatah commented Dec 29, 2025

Uh oh!

hakuna-matatah commented Dec 29, 2025

Uh oh!

hakman commented Dec 29, 2025

Uh oh!

hakuna-matatah commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hakuna-matatah commented Dec 30, 2025

Uh oh!

hakuna-matatah commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hakman commented Dec 30, 2025

Uh oh!

upodroid commented Dec 30, 2025

Uh oh!

hakuna-matatah commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 27, 2026

Uh oh!

k8s-ci-robot commented Feb 5, 2026

Uh oh!

BenTheElder commented Feb 5, 2026

Uh oh!

upodroid commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hakuna-matatah commented Dec 30, 2025 •

edited

Loading

hakuna-matatah commented Dec 30, 2025 •

edited

Loading

hakuna-matatah commented Dec 30, 2025 •

edited

Loading