v2.6.0
đ Major featuresâ
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
đ Major bug fixesâ
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count
đ What's Changedâ
âŦī¸ Dependenciesâ
- Bump docker/build-push-action from 6.11.0 to 6.13.0 by (@dependabot) in #837
- Bump golang.org/x/net from 0.26.0 to 0.35.0 by (@dependabot) in #859
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by (@dependabot) in #941
- Bump docker/login-action from 3.3.0 to 3.4.0 by (@dependabot) in #942
- Bump docker/build-push-action from 6.13.0 to 6.15.0 by (@dependabot) in #899
- build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by (@dependabot) in #1024
- build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by (@dependabot) in #1052
- build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by (@dependabot) in #1091
đ¨ Other Changesâ
- fix: Enhance GPU metrics collection and error handling in vGPU monitor by (@haitwang-cloud) in #827
- refactor: update service configurations for device plugin and scheduler by (@haitwang-cloud) in #799
- add ut for scheduler/score by (@shijinye) in #853
- add ut for device/metax by (@shijinye) in #850
- Remove duplicate log fields by (@learner0810) in #860
- [docs] Fix default nvidia.resourceCoreName value in config.md by (@chinaran) in #842
- Update libvgpu.so by (@archlitchi) in #876
- update example.png by (@rockpanda) in #874
- support ascend 910B2 by (@ouyangluwei163) in #885
- fix docs typos by (@JinVei) in #869
- Accelerate node score calculations using multiple goroutines by (@learner0810) in #824
- Support Metax SGPU to sharing GPU by (@Kyrie336) in #895
- docs: fix broken commmunity links by (@agilgur5) in #907
- add config gpu core isolation policy for webhook by (@lengrongfu) in #901
- feat: support scheduler replicas > 1 by (@Azusa-Yuan) in #898
- docs: add syntax highlighting to various code blocks by (@agilgur5) in #906
- Fix UT not be properly executed during CI phase by (@archlitchi) in #911
- typo: fix typos in log and comment by (@popsiclexu) in #917
- feat: Add kube-qps and kube-burst parameters. by (@chaunceyjiang) in #769
- docs: Update MAINTAINERS file with current contributor information by (@Nimbus318) in #918
- Nominate chaunceyjiang to reviewer by (@chaunceyjiang) in #926
- build: update dependencies and remove unused cdiapi by (@yxxhero) in #903
- add lengrongfu to reviewers by (@lengrongfu) in #937
- chore: add namespace override for multi-namespace deployments by (@chinaran) in #924
- fix: hygon dcu concurrent creation conflict by (@joy717) in #921
- Fix the wrong describe of device registry in protocol.md by (@hurricane1988) in #910
- chore: helm chart support scheduler webhook cert-manager by (@chinaran) in #951
- refactor(scheduler): replace init methods with constructor functions by (@yxxhero) in #905
- add Dependencies policy and Security policy by (@yangshiqi) in #934
- scheduler: fix blocked the nodeNotify channel when node changes by (@Iceber) in #964
- docs: Update Ascend910 support documentation by (@zhaikangqi331) in #988
- update iluvatar's docs by (@yangshiqi) in #995
- refactor: replace interface{} with any in various files by (@yxxhero) in #1000
- scheduler: fix duplicate handling of the node label selector by (@Iceber) in #965
- refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by (@yxxhero) in #1002
- update hami arch by (@wawa0210) in #1007
- Update README.md by (@yowenter) in #1005
- refactor: simplify code by using modern constructs by (@Shouren) in #978
- scheduler: fix removing node devices from node manager by (@Iceber) in #966
- feat: Add support for profiling via net/http/pprof package by (@Shouren) in #963
- Support Enflame gcushare for enflame devices by (@archlitchi) in #1013
- docs: Remove ACTIVE_OOM_KILLER environment variable description by (@chinaran) in #1015
- refactor(vGPUmonitor): change Run to RunE and return errors by (@yxxhero) in #999
- refactored the filter logs and event messages to enhance their clarity, by (@Wangmin362) in #1023
- feat: Support for using RuntimeClass with nvidia devices by (@chinaran) in #1021
- fix wrong log and container request device count error by (@Wangmin362) in #1020
- feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by (@chinaran) in #1022
- fix vgpu-devices-allocated annotations are inconsistent #991 by (@ouyangluwei163) in #1012
- add Enflame GCU S60 into roadmap. by (@winston-zhang-orz) in #1030
- add nvidia-smi command show cuda version info by (@lengrongfu) in #953
- Separate options from client to make the responsibility more clear. by (@yangshiqi) in #938
- Add nvidia gpu topoloy score registry to node by (@lengrongfu) in #1018
- fix(cicd): update ci.yaml to upload coverage to Codecov by (@Shouren) in #1056
- feat(Actions): Add an action to label pr automatically by (@Shouren) in #1053
- fix: Improve Metax GPU usability and fix related issues by (@Kyrie336) in #1063
- fix(chart): support GKE pre-release versions via kubeVersion '-0' by (@Nimbus318) in #1072
- fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1âĻ by (@Goend) in #1061
- update maintainer information by (@wawa0210) in #1079
- add LIBCUDA_LOG_LEVEL env to device-plugin by (@lengrongfu) in #1087
- fix: missing apiVersion in serviceMonitor dashboard docs by (@ntheanh201) in #1077
- test(pkg/util): Add some unit tests for pkg/util by (@Shouren) in #1067
- feat: vGPUmonitor support MigInfo metrics by (@ouyangluwei163) in #1048
- update hami-core version by (@lengrongfu) in #1082
Committers: đ New Contributorsâ
- rockpanda (@rockpanda)
- ouyangluwei163 (@ouyangluwei163)
- JinVei (@JinVei)
- Shouren (@Shouren)
- Kyrie336 (@Kyrie336)
- agilgur5 (@agilgur5)
- Azusa-Yuan (@Azusa-Yuan)
- popsiclexu (@popsiclexu)
- hurricane1988 (@hurricane1988)
- Iceber (@Iceber)
- zhaikangqi331 (@zhaikangqi331)
- yowenter (@yowenter)
- Wangmin362 (@Wangmin362)
- winston-zhang-orz (@winston-zhang-orz)
- Goend (@Goend)
- ntheanh201 (@ntheanh201)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0









