spiffxp
Repos
183
Followers
258
Following
14

About me

5
0

Events

spiffxp delete branch owner_update
Created at 2 weeks ago
delete branch
spiffxp delete branch prune-gce-projects
Created at 1 month ago
issue comment
boskos: prune misconfigured gce-projects

/cc @mattcary

Created at 2 months ago
create branch
spiffxp create branch prune-gce-projects
Created at 2 months ago
pull request opened
boskos: prune misconfigured gce-projects

These are old and appear to be inappropriately configured for proper use of KMS, which is causing some PR jobs to sporadically fail.

We have way more headroom than we need, so drop the following:

  • gce-up-c1-3-glat-up-clu-n
  • gce-up-c1-4-glat-up-clu-n
  • gce-up-c1-4-glat-up-clu
  • gce-up-c1-4-glat-up-mas
  • gce-up-g1-3-clat-up-clu-n
  • gce-up-g1-4-clat-up-clu
  • gce-up-g1-4-glat-up-clu-n
  • gce-up-g1-4-glat-up-mas
  • k8s-gce-soak-1-5
  • k8s-gci-gce-soak-1-5
  • k8s-jkns-clusterloader
  • k8s-jkns-gce-soak
  • k8s-jkns-gci-autoscaling-migs
  • k8s-jkns-gci-autoscaling
Created at 2 months ago
issue comment
Setup a budget and budget alerts

spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1

FWIW I can't access this

What exactly do we see as outstanding here?

I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.

Things you might want to consider before capping this off:

  • The current budget alerts at 90% and 100% of 250K/mo (3M/y). Since we're running over that rate, the alerts are going to be noise for those watching "are we out of credits for the year". Disable and setup a new budget that tracks our remaining spend for the year?
  • The alerts currently get sent out to k8s-infra leads, consider adding a wider audience?

I'll leave it to @ameukam or others to close if you're fine with this as-is.

Created at 2 months ago
delete branch
spiffxp delete branch wip-expand-proxy-e2e-verb-coverage
Created at 2 months ago
delete branch
spiffxp delete branch wip-e2e-coverage-fix
Created at 2 months ago

Merge pull request #112989 from ameukam/bump-golang.org/x/text-to-v0.3.8

Bump golang.org/x/text to v0.3.8

Merge pull request #112997 from liggitt/dep-approver

Add liggitt to dep-approvers alias

Merge pull request #112978 from logicalhan/kcm-fg

add 'metrics/slis' to kcm health checks

Merge pull request #112643 from SergeyKanzhelev/removeDynamicKubeletConfig

remove DynamicKubeletConfig feature gate from the code

rewrite signature of function StartEventWatcher

function changes

changes in non-test files

Merge pull request #112944 from kishen-v/fix_test_failures_go_1_20

Switch to assert.ErrorEquals from assert.Equal to check error equality

changes in test files

bumped image version and upgraded to buster and bumped QEMUVERSION to v7.1.0-2 #109295

add support for parsing gauge func

Change-Id: Id0b9cd51dead5ee9f4adac804d62f5d9742320a7

fix parsing error on labels

Change-Id: I990967b93b10dbfa9a564ca4286ffbd051c69697

parse time signatures for maxAge

Change-Id: I91e330d82c4ebbfa38bc52889beb64e6689bfb77

Merge pull request #112785 from MartinForReal/master

CloudProvider: service update event should be triggered when appProtocol in port is changed

cleanup printlns

Change-Id: I49a48446029ba2e66b09f138a1477b837d55766a

Adding ndixita@ to KubeletCredentialProviders feature owner, and capitalizing GA

kubelet: fix nil crash in allocateRemainingFrom

unparameterize 'webhook' from conversion metrics since it's the only one

Change-Id: I6dda5c033786f128e9b2d5d889e47f3dc7937ed5

Merge pull request #113014 from logicalhan/stability-v2

add support for parsing gauge func

add metrics/slis to kube-scheduler health checks

Created at 2 months ago
issue comment
tests: network: Prefer internal IPs first

/ok-to-test I think this is a reasonable thing to be re-examined but I'm not sure it's wise to introduce this late in the release cycle

Created at 2 months ago
issue comment
go.k8s.io/triage is clustering stack traces for kubernetes/kubernetes

/close

Created at 2 months ago
issue comment
go.k8s.io/triage is clustering stack traces for kubernetes/kubernetes

Yeah just came here to say it should be up and running now, ref: https://github.com/kubernetes/test-infra/pull/27831#issuecomment-1310907343

I'm not happy with the amount of time triage is taking to update clusters, which maybe the stack traces are exacerbating, but I don't have time to look into it

Per https://github.com/kubernetes/test-infra/issues/27869#issuecomment-1310902224 expect triage to fall up to 6h out of date

Created at 2 months ago
issue comment
kettle: combine failure message and backtrace

kettle's back up and running so go.k8s.io/triage now mostly reflects the results of this (triage looks at the past 14d of data, the most recent 10d of which use the new format), e.g. https://storage.googleapis.com/k8s-triage/index.html#6da66348a32f994adcb7

Nov  5 01:01:31.666: error dialing backend: dial timeout, backstop
test/e2e/storage/utils/local.go:259
k8s.io/kubernetes/test/e2e/storage/utils.(*ltrMgr).setupLocalVolumeDirectoryLinkBindMounted(0xc00143d590, 0xc001f9fac0, 0x65bd500?)
	test/e2e/storage/utils/local.go:259 +0x19a
k8s.io/kubernetes/test/e2e/storage/utils.(*ltrMgr).Create(0x0?, 0xc001f9fac0, {0x74bee29, 0x14}, 0x0)
	test/e2e/storage/utils/local.go:318 +0x1b4
k8s.io/kubernetes/test/e2e/storage/drivers.(*localDriver).CreateVolume(0xc00147ee00, 0xc0025a3380, {0x74a389f, 0x10})
	test/e2e/storage/drivers/in_tree.go:1687 +0xd8
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolume({0x7e6a418, 0xc00147ee00}, 0xc001fb6580?, {0x74a389f, 0x10})
	test/e2e/storage/framework/driver_operations.go:43 +0xd2
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolumeResourceWithAccessModes({0x7e6a418, 0xc00147ee00}, 0xc0025a3380, {{0x750bef7, 0x1f}, {0x0, 0x0}, {0x74a389f, 0x10}, {0x0, ...}, ...}, ...)
	test/e2e/storage/framework/volume_resource.go:70 +0x225
k8s.io/kubernetes/test/e2e/storage/framework.CreateVolumeResource({0x7e6a418, 0xc00147ee00}, 0x0?, {{0x750bef7, 0x1f}, {0x0, 0x0}, {0x74a389f, 0x10}, {0x0, ...}, ...}, ...)
	test/e2e/storage/framework/volume_resource.go:56 +0x110
k8s.io/kubernetes/test/e2e/storage/testsuites.(*subPathTestSuite).DefineTests.func1()
	test/e2e/storage/testsuites/subpath.go:129 +0x26e
k8s.io/kubernetes/test/e2e/storage/testsuites.(*subPathTestSuite).DefineTests.func4()
	test/e2e/storage/testsuites/subpath.go:206 +0x4d
Created at 2 months ago
issue comment
k8s-gubernator:build tables are stale

/close k8s-gubernator tables are no longer stale, metrics-kettle check is happy: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=20

Created at 2 months ago
issue comment
k8s-gubernator:build tables are stale

I had hoped clearing the db would give us some breathing room, but things are still pretty brittle. tl;dr kettle is refreshing k8s-gubernator:build tables every ~2h, and triage is updating clusters every ~3h, so https://go.k8s.io/triage may be up to 6h stale

I don't have time to look further, so hopefully ~6h will do.

In case anyone is interested:

  • kettle is supposed to stream results from gcs after a full refresh (which currently takes ~2h), but it's crashing, so it just goes back to another full refresh:
File "stream.py", line 351, in <module>
  stop=StopWhen(OPTIONS.stop_at))
File "stream.py", line 229, in main
  emitted = insert_data(bq_client, table, make_json.make_rows(db, builds))
File "stream.py", line 148, in insert_data
  errors = retry(bq_client.insert_rows, table, chunk, skip_invalid_rows=True)
File "stream.py", line 106, in retry
  return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 2995, in insert_rows
  return self.insert_rows_json(table, json_rows, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 3142, in insert_rows_json
  timeout=timeout,
File "/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py", line 640, in _call_api
  return call()
File "/usr/local/lib/python3.6/dist-packages/google/api_core/retry.py", line 291, in retry_wrapped_func
  on_error=on_error,
File "/usr/local/lib/python3.6/dist-packages/google/api_core/retry.py", line 189, in retry_target
  return target()
File "/usr/local/lib/python3.6/dist-packages/google/cloud/_http.py", line 484, in api_request
  raise exceptions.from_http_response(response)
google.api_core.exceptions.GoogleAPICallError: 413 POST https://bigquery.googleapis.com/bigquery/v2/projects/k8s-gubernator/datasets/build/tables/day/insertAll?prettyPrint=false: <!DOCTYPE html>
I1110 18:03:46.639728     125 cluster.go:323] Finished clustering 6476 unique tests (2489910 failures) into 13107 clusters in 2h34m36.987214382s
vs.
I1030 08:09:24.406476     126 cluster.go:323] Finished clustering 4837 unique tests (744844 failures) into 5941 clusters in 54m13.089545513s
Created at 2 months ago
issue comment
k8s-gubernator:build tables are stale

I manually ran updates for the daily and weekly table just to make sure things ran smoothly. You can see this reflected:

  • in the latest kettle-monitor job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/metrics-kettle/1590495158367948800
  • with actual results in: http://storage.googleapis.com/k8s-metrics/flakes-latest.json

I'm going to use the regular kettle deployment to handle the other table (which triage sources from). First I wanted to make sure I landed some slightly better logging:

  • https://github.com/kubernetes/test-infra/pull/27955
  • https://storage.googleapis.com/kubernetes-jenkins/logs/post-test-infra-push-kettle/1590494682872287233/build-log.txt
  • gcr.io/k8s-testimages/kettle:v20221110-bba2146583
Created at 2 months ago
delete branch
spiffxp delete branch kettle-repair
Created at 2 months ago

kettle: fix make_db lint errors

Created at 2 months ago
issue comment
k8s-gubernator:build tables are stale

Kettle uses an indexed auto-incrementing rowid to keep track of builds it has scraped from gcs. We can't just use the old build_emitted tables since they'll be populated with rowids that don't correspond to the newly generated rowids.

So, take a best guess that relies on the fact that gcs_paths follow a similar pattern and have a monotonically increasing (but not sequential) number (call it a prow id since "build id" is already taken in this context) at the end, e.g.

sqlite> select rowid,gcs_path from build where rowid in (select build_id from build_emitted_1 order by build_id desc limit 2) order by rowid desc;
28694161|gs://kubernetes-jenkins/pr-logs/pull/112553/pull-kubernetes-dependencies/1586686521493164032
28694160|gs://kubernetes-jenkins/logs/e2e-kops-aws-cni-cilium-ipv6/1586680051955404800

We'll assume that everything with a prow id of less than 158667.* has already been sent. So, create and populate new build_emitted tables with the new corresponding rowids.

#!/bin/bash
wildcards=(
   158667 
   158666
   158665 
   158664 
   158663 
   158662
   158661
   158660
   15865
   15864
   15863
   15862
   15861
   15860
   1585
   1584
   1583
   1582
   1581
   1580
   157
   156
   155
   # earliest prow id I saw started with 155, but just to be safe
   154
)
paths=(
    gs://kubernetes-jenkins/pr-logs/pull/%/%
    gs://kubernetes-jenkins/logs/%
)
sqlite3 build.db "create table if not exists build_emitted(build_id integer primary key, gen);"
for path in ${paths[@]}; do
    for wildcard in ${wildcards[@]}; do
        insert_stmt="insert into build_emitted select rowid as build_id, 0 from build where gcs_path like '${path}/${wildcard}%' order by build_id"
        sqlite3 build.db "${insert_stmt}"
    done
done
for days in 1 7 30; do
    sqlite3 build.db "create table if not exists build_emitted_${days}(build_id integer primary key, gen);"
    sqlite3 build.db "insert into build_emitted_${days}' select * from build_emitted"
done
Created at 2 months ago
pull request opened
Kettle repair

Related:

  • Part of: https://github.com/kubernetes/test-infra/issues/27869

I've been running a custom deployment using an image built from these changes to babysit regenerating kettle's db. I'd like to merge them in before I redeploy kettle

Created at 2 months ago

feat(prow/config): add support for jenkins jobs in folder

capz: run AKS test if test configuration is changed

fix(prow/config): fix var name deduplicate codes

Merge branch 'kubernetes:master' into feature/jenkins-job-in-folder

update merge dashboard to match tests

ci: separarte pull-containerd-node-e2e for 1.5 branch

containerd v1.5.x supports CRI v1alpha2, the API that was available at the time of release for containerd v1.5. containerd v1.6.x has support for both CRI v1alpha2 and v1; and is being designated a long term support release.

kubelet master is removing support for CRI v1alpha2, this action has the effect of forcing kubernetes master(and kubernetes r.next+) users to move up to containerd v1.6.x where both CRI v1 and v1alpha2 is supported.

Therefore we need to separate out the pull-containerd-node-e2e job for containerd 1.5 branch, so that patches can still be made to 1.5 branch till its EOL. Instead of running against kubernetes master, it will run against k8s release-1.25 branch (the last release which supports CRI v1alpha2)

Ref: https://github.com/kubernetes/kubernetes/pull/110618

Signed-off-by: Akhil Mohan makhil@vmware.com

Update golang to 1.19

feat(prow/jenkins): using full project name in Job filed of prowapi.ProwJobSpec

When input folder/abc-job, the Jenkins job name in api path will be folder/job/abc-job

k8s-infra: move pull-oci-proxy-build to k8s-infra

Signed-off-by: Arnaud Meukam ameukam@gmail.com

Re-enable ci-cri-containerd-e2e-cos-gce-alpha-features with Alpha feature tests

Add detail to assign plugin error message

Merge pull request #27925 from ameukam/oci-proxy-buid-k8s-infra

k8s-infra: move pull-oci-proxy-build to k8s-infra

Update OpenShift testgrid definitions by auto-testgrid-generator job at Wed, 09 Nov 2022 00:02:36 UTC

Updating image repo lists used for Windows tests for 1.24 clusters and below

Signed-off-by: Mark Rossetti marosset@microsoft.com

capz: separate self-managed, AKS jobs

Create CI & staging jobs for porche

Executing specific scripts in the repo.

Merge pull request #27937 from marosset/fix-windows-e2e-image-repo-list

Updating image repo lists used for Windows tests for 1.24 clusters and below

Pin OS version for e2e-gce-device-plugin job

Expliclity pass the OS version for each e2e-gce-device-plugin job. Pin the job to COS-97 for the latest master version and pin the old OS version (COS-85) for existing release branches.

Signed-off-by: David Porter porterdavid@google.com

use systemd cgroup driver

udpate kubelet flags to use systemd cgroup driver since cgroupv2 is enabled in cos-101

Signed-off-by: Akhil Mohan makhil@vmware.com

update job name to match required format

job names should match the regex "^a-z0-9?$"

Signed-off-by: Akhil Mohan makhil@vmware.com

Created at 2 months ago
issue comment
k8s-gubernator:build tables are stale

Database is still rebuilding. It's been getting repeatedly OOMkilled while loading in junit files from GCS, then rescraping GCS after each restart. Down from 660K pending to 200K pending

I1109 10:52:09.068] 3994/198350 gs://kubernetes-jenkins/logs/ci-kubernetes-unit/1589858362311315456 1 6331377
Created at 2 months ago

kettle: fix skip-gcs flag

Created at 2 months ago

kettle: add --skip-gcs to make_db

Created at 2 months ago
issue comment
releng: Image promotion for kubernetes v1.23.14 / v1.23.15-rc.0

https://prow.k8s.io/tide-history?repo=kubernetes%2Fk8s.io looks like no, tide didn't think these were batch mergeable

Created at 2 months ago