jasonliu747
Repos
19
Followers
59
Following
26

👨‍💻 ❤️ 💻 上海交通大学软件学院本科编程作业参考

650
111

QoS based scheduling system for hybrid orchestration workloads on Kubernetes, bringing workloads the best layout and status.

638
132

Training operators on Kubernetes.

1176
486

A Cloud Native Batch System (Project under CNCF)

2728
626

A high performance and generic framework for distributed DNN training

3303
457

Events

[proposal] Improve the readability of the PreFilter error message of the DeviceShare plugin

Totally agree with you. Will improve readability ASAP. /assign

Created at 17 hours ago
Created at 5 days ago
koordlet: PSI collector and use Prometheus to record interference metrics

HI @songtao98 nice work! Please rebase the latest code and fix some code conflicts. Thanks.

Created at 1 week ago
[proposal] support network qos

TODO, more description by @jasonliu747

What is your proposal:

Why is this needed:

Is there a suggested solution, if so, please add it:

Created at 1 week ago
[proposal] support network qos

/close duplicate issue

Created at 1 week ago
koord-scheduler: fix elasticQuota ut fail

@xulinfei1996 测试失败的日志可以上传一下

Created at 1 week ago
Error on koordlet after installed

What happened: I followed steps here to install koordinator with helm. https://koordinator.sh/docs/installation/

But koordlet pod keeps in Error and restart

What you expected to happen: koordlet pod is running

Environment:

  • Koordinator version: - v1.0.0
  • Kubernetes version (use kubectl version): v1.22.16 (also tried in v1.25.4)
  • docker/containerd version: docker 20.10.21
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.1 LTS
  • Kernel (e.g. uname -a): Linux worker-00 5.15.0-52-generic # 58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Anything else we need to know: Here is log of the koordlet pod. I can provide more detail if need.

I1114 00:03:16.733301 1751256 feature_gate.go:245] feature gates: &{map[Accelerators:true AllAlpha:true AuditEvents:true AuditEventsHTTPHandler:true BECPUEvict:true BECPUSuppress:true BECgroupReconcile:true BEMemoryEvict:true CPUBurst:true CgroupReconcile:true NodeTopologyReport:true PerformanceCollector:false RdtResctrl:true]} I1114 00:03:16.733359 1751256 feature_gate.go:245] feature gates: &{map[AllAlpha:true CPUSetAllocator:true GPUEnvInject:true GroupIdentity:true]} I1114 00:03:16.733509 1751256 main.go:74] Setting up client for koordlet I1114 00:03:16.733648 1751256 koordlet.go:75] NODE_NAME is worker-00,start time 1.668355396e+09 I1114 00:03:16.733667 1751256 koordlet.go:78] sysconf: &{CgroupRootDir:/host-cgroup/ CgroupKubePath:kubepods/ SysRootDir:/host-sys/ SysFSRootDir:/host-sys-fs/ ProcRootDir:/proc/ VarRunRootDir:/host-var-run/ NodeNameOverride: RuntimeHooksConfigDir:/host-etc-hookserver/ ContainerdEndPoint: DockerEndPoint:},agentMode:dsMode I1114 00:03:16.733687 1751256 koordlet.go:79] kernel version INFO : {IsAnolisOS:false} I1114 00:03:16.737171 1751256 koordlet.go:101] can not detect cgroup driver from 'kubepods' cgroup name I1114 00:03:16.846901 1751256 koordlet.go:121] Node worker-00 use 'systemd' as cgroup driver I1114 00:03:16.861144 1751256 callback_runner.go:95] states informer callback runtime-hooks-reconciler has registered for type RegisterTypeAllPods I1114 00:03:16.861167 1751256 hooks.go:45] hook GroupIdentity is registered I1114 00:03:16.861211 1751256 bvt.go:68] update system supported info to false for plugin GroupIdentity I1114 00:03:16.861221 1751256 reconciler.go:72] register reconcile function reconcile pod level cpu bvt value finished, detailed info: level=pod, filename=cpu.bvt_warp_ns I1114 00:03:16.861245 1751256 reconciler.go:72] register reconcile function reconcile kubeqos level cpu bvt value finished, detailed info: level=kubeqos, filename=cpu.bvt_warp_ns I1114 00:03:16.861251 1751256 runtimehooks.go:103] runtime hook plugin GroupIdentity enable true I1114 00:03:16.861259 1751256 hooks.go:45] hook CPUSetAllocator is registered I1114 00:03:16.861265 1751256 hooks.go:45] hook CPUSetAllocator is registered I1114 00:03:16.861269 1751256 hooks.go:45] hook CPUSetAllocator is registered I1114 00:03:16.861276 1751256 reconciler.go:72] register reconcile function set container cpuset and unset container cpu quota if needed finished, detailed info: level=container, filename=cpuset.cpus I1114 00:03:16.861283 1751256 reconciler.go:72] register reconcile function unset pod cpu quota if needed finished, detailed info: level=pod, filename=cpu.cfs_quota_us I1114 00:03:16.861287 1751256 runtimehooks.go:103] runtime hook plugin CPUSetAllocator enable true I1114 00:03:16.861292 1751256 hooks.go:45] hook gpu env inject is registered I1114 00:03:16.861297 1751256 runtimehooks.go:103] runtime hook plugin GPUEnvInject enable true I1114 00:03:16.861303 1751256 callback_runner.go:95] states informer callback runtime-hooks-rule-node-slo has registered for type RegisterTypeNodeSLOSpec I1114 00:03:16.861309 1751256 callback_runner.go:95] states informer callback runtime-hooks-rule-node-topo has registered for type RegisterTypeNodeTopology I1114 00:03:16.861457 1751256 main.go:99] Starting the koordlet daemon I1114 00:03:16.861464 1751256 koordlet.go:147] Starting daemon I1114 00:03:16.861487 1751256 main.go:89] Starting prometheus server on :9316 I1114 00:03:16.861544 1751256 states_informer.go:137] setup statesInformer I1114 00:03:16.861566 1751256 states_informer.go:139] starting callback runner I1114 00:03:16.861579 1751256 states_informer.go:143] starting informer plugins I1114 00:03:16.861588 1751256 states_informer.go:131] plugin nodeTopoInformer has been setup I1114 00:03:16.861625 1751256 states_informer.go:131] plugin nodeInformer has been setup I1114 00:03:16.861629 1751256 states_informer.go:131] plugin podsInformer has been setup I1114 00:03:16.861649 1751256 states_informer.go:131] plugin nodeSLOInformer has been setup I1114 00:03:16.861654 1751256 states_informer.go:179] starting informer plugin nodeSLOInformer I1114 00:03:16.861660 1751256 states_informer.go:179] starting informer plugin nodeTopoInformer I1114 00:03:16.861678 1751256 states_informer.go:179] starting informer plugin nodeInformer I1114 00:03:16.861682 1751256 states_informer.go:179] starting informer plugin podsInformer I1114 00:03:16.861686 1751256 states_informer.go:148] waiting for informer syncing I1114 00:03:16.861705 1751256 states_nodeslo.go:90] starting node slo informer I1114 00:03:16.861709 1751256 states_nodeslo.go:92] node slo informer started I1114 00:03:16.861740 1751256 states_noderesourcetopology.go:98] starting node topo informer I1114 00:03:16.861757 1751256 states_pods.go:86] starting pod informer I1114 00:03:16.861787 1751256 states_node.go:87] starting node informer I1114 00:03:16.861790 1751256 reflector.go:219] Starting reflector *v1alpha1.NodeSLO (12h0m0s) from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167 I1114 00:03:16.861806 1751256 states_node.go:89] node informer started I1114 00:03:16.861809 1751256 reflector.go:255] Listing and watching *v1alpha1.NodeSLO from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167 I1114 00:03:16.861876 1751256 reflector.go:219] Starting reflector *v1.Node (12h0m0s) from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167 I1114 00:03:16.861888 1751256 reflector.go:255] Listing and watching *v1.Node from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167 I1114 00:03:16.862198 1751256 metric_cache.go:781] expired metric data before 2022-11-13 23:33:16.861531641 +0800 HKT m=-1799.832286552 has been recycled, remaining in db size: nodeResCount=0, podResCount=0, containerResCount=0, beCPUResCount=0, podThrottledResCount=0, containerThrottledResCount=0 I1114 00:03:16.864744 1751256 states_nodeslo.go:125] update nodeSLO content: old null, new {"kind":"NodeSLO","apiVersion":"slo.koordinator.sh/v1alpha1","metadata":{"name":"worker-00","uid":"68cd37d5-cd9e-43e6-bf79-aeec7d907ef2","resourceVersion":"88451","generation":1,"creationTimestamp":"2022-11-13T16:01:46Z","managedFields":[{"manager":"koordinator-manager","operation":"Update","apiVersion":"slo.koordinator.sh/v1alpha1","time":"2022-11-13T16:01:46Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:cpuBurstStrategy":{".":{},"f:cfsQuotaBurstPercent":{},"f:cfsQuotaBurstPeriodSeconds":{},"f:cpuBurstPercent":{},"f:policy":{},"f:sharePoolThresholdPercent":{}},"f:resourceQOSStrategy":{},"f:resourceUsedThresholdWithBE":{".":{},"f:cpuSuppressPolicy":{},"f:cpuSuppressThresholdPercent":{},"f:enable":{},"f:memoryEvictThresholdPercent":{}}}}}]},"spec":{"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{"lsrClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"lsClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"beClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}}},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50}},"status":{}} I1114 00:03:16.864782 1751256 states_nodeslo.go:66] create NodeSLO {"kind":"NodeSLO","apiVersion":"slo.koordinator.sh/v1alpha1","metadata":{"name":"worker-00","uid":"68cd37d5-cd9e-43e6-bf79-aeec7d907ef2","resourceVersion":"88451","generation":1,"creationTimestamp":"2022-11-13T16:01:46Z","managedFields":[{"manager":"koordinator-manager","operation":"Update","apiVersion":"slo.koordinator.sh/v1alpha1","time":"2022-11-13T16:01:46Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:cpuBurstStrategy":{".":{},"f:cfsQuotaBurstPercent":{},"f:cfsQuotaBurstPeriodSeconds":{},"f:cpuBurstPercent":{},"f:policy":{},"f:sharePoolThresholdPercent":{}},"f:resourceQOSStrategy":{},"f:resourceUsedThresholdWithBE":{".":{},"f:cpuSuppressPolicy":{},"f:cpuSuppressThresholdPercent":{},"f:enable":{},"f:memoryEvictThresholdPercent":{}}}}}]},"spec":{"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50}},"status":{}} I1114 00:03:16.864853 1751256 rule.go:81] applying 2 rules with new RegisterTypeNodeSLOSpec, detail: {"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{"lsrClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"lsClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"beClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}}},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50}} I1114 00:03:16.864867 1751256 rule.go:88] system unsupported for rule GroupIdentity, do nothing during UpdateRules I1114 00:03:16.962563 1751256 shared_informer.go:270] caches populated I1114 00:03:16.962949 1751256 states_pods.go:114] pod informer started E1114 00:03:16.963011 1751256 pleg.go:150] failed to watch path /host-cgroup/cpu/kubepods.slice err inotify_add_watch /host-cgroup/cpu/kubepods.slice: no such file or directory F1114 00:03:16.963029 1751256 states_pods.go:111] Unable to run the pleg: %!(EXTRA *fs.PathError=inotify_add_watch /host-cgroup/cpu/kubepods.slice: no such file or directory) goroutine 113 [running]: k8s.io/klog/v2.stacks(0x1) /home/runner/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:1026 +0x8a k8s.io/klog/v2.(*loggingT).output(0x3a12d00, 0x3, {0x0, 0x0}, 0xc0009a6310, 0x0, {0x2cf53a3, 0xc000e59670}, 0x0, 0x0) /home/runner/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:975 +0x63d k8s.io/klog/v2.(*loggingT).printf(0xc000b6cfb0, 0x9f5abe, {0x0, 0x0}, {0x0, 0x0}, {0x234ae90, 0x18}, {0xc000e59670, 0x1, ...}) /home/runner/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:753 +0x1e5 k8s.io/klog/v2.Fatalf(...) /home/runner/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:1514 github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer.(*podsInformer).Start.func2() /home/runner/work/koordinator/koordinator/pkg/koordlet/statesinformer/states_pods.go:111 +0xcd created by github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer.(*podsInformer).Start /home/runner/work/koordinator/koordinator/pkg/koordlet/statesinformer/states_pods.go:109 +0x4cf

Created at 1 week ago
Error on koordlet after installed

@MercusChan We are working on making koordlet compatible with cgroup v2. For more details, you may check the current proposal at here.

Created at 1 week ago
jasonliu747 delete branch jasonliu747-patch-1
Created at 1 week ago
ci: simplify go build command

/approve

Created at 1 week ago
pull request opened
ci: simplify go build command

Signed-off-by: Jason Liu jasonliu747@gmail.com

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • [ ] I have written necessary docs and comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] All checks passed in make test
Created at 1 week ago
jasonliu747 create branch jasonliu747-patch-1
Created at 1 week ago
koordlet: add runtime hook plugin batch resource

/area koordlet

Created at 1 week ago
delete branch
jasonliu747 delete branch codeql
Created at 1 week ago