Version: v1.1 (EOL)

Upgrade from v1.1.1/v1.1.2 to v1.1.3

General information

An Upgrade button appears on the Dashboard screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see Start an upgrade.

For air-gapped environments, see Prepare an air-gapped upgrade.

Known Issues

1. The upgrade process is stuck when pre-draining a node. (Case 1)

Starting from v1.1.0, Harvester waits for all volumes to become healthy before upgrading a node (for clusters with three or more nodes).

When this issue occurs, you can check the health of the affected volumes on the embedded Longhorn UI. You can also check the pre-drain job logs. For more troubleshooting information, see Phase 4: Upgrade nodes.

2. The upgrade process is stuck when pre-draining a node. (Case 2)

An upgrade is stuck, as shown in the screenshot below:

Harvester is unable to proceed with the upgrade and the status of two or more nodes is SchedulingDisabled.

$ kubectl get nodes
NAME    STATUS                     ROLES                       AGE   VERSION
node1   Ready                      control-plane,etcd,master   20d   v1.24.7+rke2r1
node2   Ready,SchedulingDisabled   control-plane,etcd,master   20d   v1.24.7+rke2r1
node3   Ready,SchedulingDisabled   control-plane,etcd,master   20d   v1.24.7+rke2r1

Related issue:
- [BUG] Multiple nodes pre-drains in an upgrade
Workaround:
- https://github.com/harvester/harvester/issues/3216#issuecomment-1328607004

3. The upgrade process is stuck on the first node.

Harvester attempts to upgrade the first node but is unable to proceed. The upgrade eventually fails because the job is not completed by the expected end time.

4. The status of a Fleet bundle after the upgrade indicates that deployment errors occurred.

After an upgrade is completed, the status of a bundle managed by Fleet may be ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]. The errors that occurred while deploying the bundle may block the next Harvester upgrade or managedChart update if not addressed.

To check the status of bundles, run the following command:

kubectl get bundles -A

The following output indicates that the issue exists in your cluster.

NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   local-managed-system-agent                    1/1
fleet-local   mcc-harvester                                 1/1
fleet-local   mcc-harvester-crd                             1/1
fleet-local   mcc-local-managed-system-upgrade-controller   1/1
fleet-local   mcc-rancher-logging                           1/1
fleet-local   mcc-rancher-logging-crd                       1/1
fleet-local   mcc-rancher-monitoring                        1/1
fleet-local   mcc-rancher-monitoring-crd                    1/1

5. The upgrade process stops after the upgrade repository is created.

Harvester is unable to retrieve the harvester-release.yaml file and proceed with the upgrade. The following error message is displayed: Get "http://upgrade-repo-hvst-upgrade-mldzx.harvester-system/harvester-iso/harvester-release.yaml": context deadline exceeded (Client.Timeout exceeded while awaiting headers) context deadline exceeded (Client.Timeout exceeded while awaiting headers)` message:

This issue was fixed in v1.1.2. For v1.1.0 and v1.1.1 users, however, the workaround is to restart the upgrade process.

Related issue:
- https://github.com/harvester/harvester/issues/3729
Workaround:
- Start over an upgrade

6. The upgrade is stuck in the "Pre-drained" state.

This issue could be caused by a misconfigured pod disruption budget (PDB).

You can perform the following steps to confirm the cause and use the current workaround. In this example, the affected node is harvester-node-1.

Check the name of the instance-manager-e or instance-manager-r pod on the node.

$ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
instance-manager-r-d4ed2788          1/1     Running   0              3d8h

The output shows that the instance-manager-r-d4ed2788 pod is on the node.

Check the Rancher logs and verify that the instance-manager-e or instance-manager-r pod cannot be drained.

$ kubectl logs deployment/rancher -n cattle-system
...
2023-03-28T17:10:52.199575910Z 2023/03/28 17:10:52 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-4f8cb698b24a,custom-a0f714579def
2023-03-28T17:10:55.034453029Z evicting pod longhorn-system/instance-manager-r-d4ed2788
2023-03-28T17:10:55.080933607Z error when evicting pods/"instance-manager-r-d4ed2788" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Check if a PDB is associated with the node.

$ kubectl get pdb -n longhorn-system -o yaml | yq '.items[] | select(.spec.selector.matchLabels."longhorn.io/node"=="harvester-node-1") | .metadata.name'
instance-manager-r-466e3c7f

Check the owner of the instance manager for the associated PDB.
```
$ kubectl get instancemanager instance-manager-r-466e3c7f -n longhorn-system -o yaml | yq -e '.spec.nodeID'
harvester-node-2
```
If the output does not show the affected node, the issue exists in your cluster. In this example, the output shows harvester-node-2 instead of harvester-node-1.
Check the health of all volumes.
```
kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
```
The output should show that all volumes are marked healthy. If not, consider uncordoning nodes to improve volume health.

Remove the misconfigured PDB.

kubectl delete pdb instance-manager-r-466e3c7f -n longhorn-system

Related issue:
- [BUG] 3 Node AirGapped Cluster Upgrade Stuck v1.1.0->v1.1.2-rc4

General information​

Known Issues​

1. The upgrade process is stuck when pre-draining a node. (Case 1)​

2. The upgrade process is stuck when pre-draining a node. (Case 2)​

3. The upgrade process is stuck on the first node.​

4. The status of a Fleet bundle after the upgrade indicates that deployment errors occurred.​

5. The upgrade process stops after the upgrade repository is created.​

6. The upgrade is stuck in the "Pre-drained" state.​