/cc @mattcary
These are old and appear to be inappropriately configured for proper use of KMS, which is causing some PR jobs to sporadically fail.
We have way more headroom than we need, so drop the following:
spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1
FWIW I can't access this
What exactly do we see as outstanding here?
I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.
Things you might want to consider before capping this off:
I'll leave it to @ameukam or others to close if you're fine with this as-is.
Merge pull request #112989 from ameukam/bump-golang.org/x/text-to-v0.3.8
Bump golang.org/x/text to v0.3.8
Merge pull request #112997 from liggitt/dep-approver
Add liggitt to dep-approvers alias
Merge pull request #112978 from logicalhan/kcm-fg
add 'metrics/slis' to kcm health checks
Merge pull request #112643 from SergeyKanzhelev/removeDynamicKubeletConfig
remove DynamicKubeletConfig feature gate from the code
rewrite signature of function StartEventWatcher
function changes
changes in non-test files
Merge pull request #112944 from kishen-v/fix_test_failures_go_1_20
Switch to assert.ErrorEquals from assert.Equal to check error equality
changes in test files
bumped image version and upgraded to buster and bumped QEMUVERSION to v7.1.0-2 #109295
add support for parsing gauge func
Change-Id: Id0b9cd51dead5ee9f4adac804d62f5d9742320a7
fix parsing error on labels
Change-Id: I990967b93b10dbfa9a564ca4286ffbd051c69697
parse time signatures for maxAge
Change-Id: I91e330d82c4ebbfa38bc52889beb64e6689bfb77
Merge pull request #112785 from MartinForReal/master
CloudProvider: service update event should be triggered when appProtocol in port is changed
cleanup printlns
Change-Id: I49a48446029ba2e66b09f138a1477b837d55766a
Adding ndixita@ to KubeletCredentialProviders feature owner, and capitalizing GA
kubelet: fix nil crash in allocateRemainingFrom
unparameterize 'webhook' from conversion metrics since it's the only one
Change-Id: I6dda5c033786f128e9b2d5d889e47f3dc7937ed5
Merge pull request #113014 from logicalhan/stability-v2
add support for parsing gauge func
add metrics/slis to kube-scheduler health checks
/ok-to-test I think this is a reasonable thing to be re-examined but I'm not sure it's wise to introduce this late in the release cycle
/close
Yeah just came here to say it should be up and running now, ref: https://github.com/kubernetes/test-infra/pull/27831#issuecomment-1310907343
I'm not happy with the amount of time triage is taking to update clusters, which maybe the stack traces are exacerbating, but I don't have time to look into it
Per https://github.com/kubernetes/test-infra/issues/27869#issuecomment-1310902224 expect triage to fall up to 6h out of date
kettle's back up and running so go.k8s.io/triage now mostly reflects the results of this (triage looks at the past 14d of data, the most recent 10d of which use the new format), e.g. https://storage.googleapis.com/k8s-triage/index.html#6da66348a32f994adcb7
Nov 5 01:01:31.666: error dialing backend: dial timeout, backstop
test/e2e/storage/utils/local.go:259
k8s.io/kubernetes/test/e2e/storage/utils.(*ltrMgr).setupLocalVolumeDirectoryLinkBindMounted(0xc00143d590, 0xc001f9fac0, 0x65bd500?)
test/e2e/storage/utils/local.go:259 +0x19a
k8s.io/kubernetes/test/e2e/storage/utils.(*ltrMgr).Create(0x0?, 0xc001f9fac0, {0x74bee29, 0x14}, 0x0)
test/e2e/storage/utils/local.go:318 +0x1b4
k8s.io/kubernetes/test/e2e/storage/drivers.(*localDriver).CreateVolume(0xc00147ee00, 0xc0025a3380, {0x74a389f, 0x10})
test/e2e/storage/drivers/in_tree.go:1687 +0xd8
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolume({0x7e6a418, 0xc00147ee00}, 0xc001fb6580?, {0x74a389f, 0x10})
test/e2e/storage/framework/driver_operations.go:43 +0xd2
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolumeResourceWithAccessModes({0x7e6a418, 0xc00147ee00}, 0xc0025a3380, {{0x750bef7, 0x1f}, {0x0, 0x0}, {0x74a389f, 0x10}, {0x0, ...}, ...}, ...)
test/e2e/storage/framework/volume_resource.go:70 +0x225
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolumeResource({0x7e6a418, 0xc00147ee00}, 0x0?, {{0x750bef7, 0x1f}, {0x0, 0x0}, {0x74a389f, 0x10}, {0x0, ...}, ...}, ...)
test/e2e/storage/framework/volume_resource.go:56 +0x110
k8s.io/kubernetes/test/e2e/storage/testsuites.(*subPathTestSuite).DefineTests.func1()
test/e2e/storage/testsuites/subpath.go:129 +0x26e
k8s.io/kubernetes/test/e2e/storage/testsuites.(*subPathTestSuite).DefineTests.func4()
test/e2e/storage/testsuites/subpath.go:206 +0x4d
/close k8s-gubernator tables are no longer stale, metrics-kettle check is happy: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=20
I had hoped clearing the db would give us some breathing room, but things are still pretty brittle. tl;dr kettle is refreshing k8s-gubernator:build
tables every ~2h, and triage is updating clusters every ~3h, so https://go.k8s.io/triage may be up to 6h stale
I don't have time to look further, so hopefully ~6h will do.
In case anyone is interested:
File "stream.py", line 351, in <module>
stop=StopWhen(OPTIONS.stop_at))
File "stream.py", line 229, in main
emitted = insert_data(bq_client, table, make_json.make_rows(db, builds))
File "stream.py", line 148, in insert_data
errors = retry(bq_client.insert_rows, table, chunk, skip_invalid_rows=True)
File "stream.py", line 106, in retry
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 2995, in insert_rows
return self.insert_rows_json(table, json_rows, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 3142, in insert_rows_json
timeout=timeout,
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 640, in _call_api
return call()
File "/usr/local/lib/python3.6/dist-packages/google/api_core/retry.py", line 291, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python3.6/dist-packages/google/api_core/retry.py", line 189, in retry_target
return target()
File "/usr/local/lib/python3.6/dist-packages/google/cloud/_http.py", line 484, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.GoogleAPICallError: 413 POST https://bigquery.googleapis.com/bigquery/v2/projects/k8s-gubernator/datasets/build/tables/day/insertAll?prettyPrint=false: <!DOCTYPE html>
I1110 18:03:46.639728 125 cluster.go:323] Finished clustering 6476 unique tests (2489910 failures) into 13107 clusters in 2h34m36.987214382s
vs.
I1030 08:09:24.406476 126 cluster.go:323] Finished clustering 4837 unique tests (744844 failures) into 5941 clusters in 54m13.089545513s
I manually ran updates for the daily and weekly table just to make sure things ran smoothly. You can see this reflected:
I'm going to use the regular kettle deployment to handle the other table (which triage sources from). First I wanted to make sure I landed some slightly better logging:
gcr.io/k8s-testimages/kettle:v20221110-bba2146583
kettle: fix make_db lint errors
Kettle uses an indexed auto-incrementing rowid to keep track of builds it has scraped from gcs. We can't just use the old build_emitted
tables since they'll be populated with rowids that don't correspond to the newly generated rowids.
So, take a best guess that relies on the fact that gcs_paths follow a similar pattern and have a monotonically increasing (but not sequential) number (call it a prow id since "build id" is already taken in this context) at the end, e.g.
sqlite> select rowid,gcs_path from build where rowid in (select build_id from build_emitted_1 order by build_id desc limit 2) order by rowid desc;
28694161|gs://kubernetes-jenkins/pr-logs/pull/112553/pull-kubernetes-dependencies/1586686521493164032
28694160|gs://kubernetes-jenkins/logs/e2e-kops-aws-cni-cilium-ipv6/1586680051955404800
We'll assume that everything with a prow id of less than 158667.*
has already been sent. So, create and populate new build_emitted
tables with the new corresponding rowids.
#!/bin/bash
wildcards=(
158667
158666
158665
158664
158663
158662
158661
158660
15865
15864
15863
15862
15861
15860
1585
1584
1583
1582
1581
1580
157
156
155
# earliest prow id I saw started with 155, but just to be safe
154
)
paths=(
gs://kubernetes-jenkins/pr-logs/pull/%/%
gs://kubernetes-jenkins/logs/%
)
sqlite3 build.db "create table if not exists build_emitted(build_id integer primary key, gen);"
for path in ${paths[@]}; do
for wildcard in ${wildcards[@]}; do
insert_stmt="insert into build_emitted select rowid as build_id, 0 from build where gcs_path like '${path}/${wildcard}%' order by build_id"
sqlite3 build.db "${insert_stmt}"
done
done
for days in 1 7 30; do
sqlite3 build.db "create table if not exists build_emitted_${days}(build_id integer primary key, gen);"
sqlite3 build.db "insert into build_emitted_${days}' select * from build_emitted"
done
Related:
I've been running a custom deployment using an image built from these changes to babysit regenerating kettle's db. I'd like to merge them in before I redeploy kettle
feat(prow/config): add support for jenkins jobs in folder
capz: run AKS test if test configuration is changed
fix(prow/config): fix var name deduplicate codes
Merge branch 'kubernetes:master' into feature/jenkins-job-in-folder
update merge dashboard to match tests
ci: separarte pull-containerd-node-e2e for 1.5 branch
containerd v1.5.x supports CRI v1alpha2, the API that was available at the time of release for containerd v1.5. containerd v1.6.x has support for both CRI v1alpha2 and v1; and is being designated a long term support release.
kubelet master is removing support for CRI v1alpha2, this action has the effect of forcing kubernetes master(and kubernetes r.next+) users to move up to containerd v1.6.x where both CRI v1 and v1alpha2 is supported.
Therefore we need to separate out the pull-containerd-node-e2e job for containerd 1.5 branch, so that patches can still be made to 1.5 branch till its EOL. Instead of running against kubernetes master, it will run against k8s release-1.25 branch (the last release which supports CRI v1alpha2)
Ref: https://github.com/kubernetes/kubernetes/pull/110618
Signed-off-by: Akhil Mohan makhil@vmware.com
Update golang to 1.19
feat(prow/jenkins): using full project name in Job
filed of prowapi.ProwJobSpec
When input folder/abc-job
, the Jenkins job name in api path will be folder/job/abc-job
k8s-infra: move pull-oci-proxy-build to k8s-infra
Signed-off-by: Arnaud Meukam ameukam@gmail.com
Re-enable ci-cri-containerd-e2e-cos-gce-alpha-features with Alpha feature tests
Add detail to assign plugin error message
Merge pull request #27925 from ameukam/oci-proxy-buid-k8s-infra
k8s-infra: move pull-oci-proxy-build to k8s-infra
Update OpenShift testgrid definitions by auto-testgrid-generator job at Wed, 09 Nov 2022 00:02:36 UTC
Updating image repo lists used for Windows tests for 1.24 clusters and below
Signed-off-by: Mark Rossetti marosset@microsoft.com
capz: separate self-managed, AKS jobs
Create CI & staging jobs for porche
Executing specific scripts in the repo.
Merge pull request #27937 from marosset/fix-windows-e2e-image-repo-list
Updating image repo lists used for Windows tests for 1.24 clusters and below
Pin OS version for e2e-gce-device-plugin job
Expliclity pass the OS version for each e2e-gce-device-plugin job. Pin the job to COS-97 for the latest master version and pin the old OS version (COS-85) for existing release branches.
Signed-off-by: David Porter porterdavid@google.com
use systemd cgroup driver
udpate kubelet flags to use systemd cgroup driver since cgroupv2 is enabled in cos-101
Signed-off-by: Akhil Mohan makhil@vmware.com
update job name to match required format
job names should match the regex "^a-z0-9?$"
Signed-off-by: Akhil Mohan makhil@vmware.com
Database is still rebuilding. It's been getting repeatedly OOMkilled while loading in junit files from GCS, then rescraping GCS after each restart. Down from 660K pending to 200K pending
I1109 10:52:09.068] 3994/198350 gs://kubernetes-jenkins/logs/ci-kubernetes-unit/1589858362311315456 1 6331377
kettle: fix skip-gcs flag
kettle: add --skip-gcs to make_db
https://prow.k8s.io/tide-history?repo=kubernetes%2Fk8s.io looks like no, tide didn't think these were batch mergeable