Skip to content

Upgrade from v1.0.2 to v1.0.3

General information

The Harvester GUI Dashboard page should have an upgrade button to perform an upgrade. For more details please refer to start an upgrade.

For the air-gap env upgrade, please refer to prepare an air-gapped upgrade.

Known issues


1. Fail to download the upgrade image

Description

Downloading the upgrade image can't be complete or fails with an error.

Related issues

Workaround

Delete the current upgrade and start over. Please see "Start over an upgrade".


2. An upgrade is stuck, a node is in "Pre-drained" state (case 1)

Description

Users might see a node is stuck at the Pre-drained state for a while (> 30 minutes).

This might be caused by instance-manager-r-* pod on node harvester-z7j2g can’t be drained. To verify the above case:

  • Check rancher server logs:

    kubectl logs deployment/rancher -n cattle-system
    

    Example output:

    error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    evicting pod longhorn-system/instance-manager-r-10dd59c4
    error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    evicting pod longhorn-system/instance-manager-r-10dd59c4
    error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    evicting pod longhorn-system/instance-manager-r-10dd59c4
    error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    
  • Verify the pod longhorn-system/instance-manager-r-10dd59c4 is on the stuck node:

    kubectl get pod instance-manager-r-10dd59c4 -n longhorn-system -o=jsonpath='{.spec.nodeName}'
    

    Example output:

    harvester-z7j2g
    
  • Check degraded volumes:

    kubectl get volumes -n longhorn-system
    

    Example output:

    NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE              AGE
    pvc-08c34593-8225-4be6-9899-10a978df6ea1   attached   healthy      True        10485760      harvester-279l2   3d13h
    pvc-526600f5-bde2-4244-bb8e-7910385cbaeb   attached   healthy      True        21474836480   harvester-x9jqw   3d1h
    pvc-7b3fc2c3-30eb-48b8-8a98-11913f8314c2   attached   healthy      True        10737418240   harvester-x9jqw   3d
    pvc-8065ed6c-a077-472c-920e-5fe9eacff96e   attached   healthy      True        21474836480   harvester-x9jqw   3d
    pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599   attached   degraded     True        10737418240   harvester-x9jqw   2d23h
    pvc-9a6539b8-44e5-430e-9b24-ea8290cb13b7   attached   healthy      True        53687091200   harvester-x9jqw   3d13h
    

    Here we can see volume pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 is degraded.

    Notes

    the user needs to check all degraded volumes one by one.

  • Check degraded volume’s replica state:

    kubectl get replicas -n longhorn-system --selector longhornvolume=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 -o json | jq '.items[] | {replica: .metadata.name, healthyAt: .spec.healthyAt, nodeID: .spec.nodeID, state: .status.currentState}'
    

    Example output:

    {
      "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246",
      "healthyAt": "2022-07-25T07:33:16Z",
      "nodeID": "harvester-z7j2g",
      "state": "running"
    }
    {
      "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-22974d0f",
      "healthyAt": "",
      "nodeID": "harvester-279l2",
      "state": "running"
    }
    {
      "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5",
      "healthyAt": "",
      "nodeID": "harvester-x9jqw",
      "state": "stopped"
    }
    

    Here the only healthy replica is pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246 and it’s on node harvester-z7j2g. So we can confirm the instance-manager-r-* pod resides on node harvester-z7j2g and avoids the drain.

Related issues

Workaround

We need to start the “Stopped” replica, from the previous example, the stopped replica’s name is pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5.

  • Check the Longhorn manager log, we should see a replica waiting for the backing image. First, we need to get the manager's name:

    kubectl get pods -n longhorn-system --selector app=longhorn-manager --field-selector spec.nodeName=harvester-x9jqw
    

    Example output:

    NAME                     READY   STATUS    RESTARTS   AGE
    longhorn-manager-zmfbw   1/1     Running   0          3d10h
    
  • Get pod log:

    kubectl logs longhorn-manager-zmfbw -n longhorn-system | grep pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
    

    Example output:

    (...)
    time="2022-07-28T04:35:34Z" level=debug msg="Prepare to create instance pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
    time="2022-07-28T04:35:34Z" level=debug msg="Replica pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5 is waiting for backing image harvester-system-harvester-iso-n7bxh downloading file to node harvester-x9jqw disk 3830342d-c13d-4e55-ac74-99cad529e9d4, the current state is in-progress" controller=longhorn-replica dataPath= node=harvester-x9jqw nodeID=harvester-x9jqw ownerID=harvester-x9jqw replica=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
    time="2022-07-28T04:35:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5\", UID:\"c511630f-2fe2-4cf9-97a4-21bce73782b1\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"632926\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
    

    Here we can determine the replica is waiting for the backing image harvester-system-harvester-iso-n7bxh.

  • Get the disk file map from the backing image:

    kubectl describe backingimage harvester-system-harvester-iso-n7bxh -n longhorn-system
    

    Example output:

    (...)
    Disk File Status Map:
        3830342d-c13d-4e55-ac74-99cad529e9d4:
          Last State Transition Time:  2022-07-25T08:30:34Z
          Message:
          Progress:                    29
          State:                       in-progress
        3aa804e1-229d-4141-8816-1f6a7c6c3096:
          Last State Transition Time:  2022-07-25T08:33:20Z
          Message:
          Progress:                    100
          State:                       ready
        92726efa-bfb3-478e-8553-3206ad34ce70:
          Last State Transition Time:  2022-07-28T04:31:49Z
          Message:
          Progress:                    100
          State:                       ready
    

    The disk file with UUID 3830342d-c13d-4e55-ac74-99cad529e9d4 has the state in-progress.

  • Next, we need to find backing-image-manager that contains this disk file:

    kubectl get pod -n longhorn-system --selector=longhorn.io/disk-uuid=3830342d-c13d-4e55-ac74-99cad529e9d4
    

    Example output:

    NAME                              READY   STATUS    RESTARTS   AGE
    backing-image-manager-c00e-3830   1/1     Running   0          3d1h
    
  • Restart the backing-image-manager by deleting its pod:

    kubectl delete pod -n longhorn-system backing-image-manager-c00e-3830
    

3. An upgrade is stuck, a node is in "Pre-drained" state (case 2)

Description

Users might see a node is stuck at the Pre-drained state for a while (> 30 minutes).

Here are some steps to verify this issue has happened:

  • Visit the Longhorn GUI: https://{{VIP}}/k8s/clusters/local/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/#/volume (replace VIP with an appropriate value) and check degraded volumes. The degraded volume might contain a healthy replica only (with blue background) and the healthy replica resides on the "Pre-drained" node:

  • Hover the mouse to the red scheduled icon, the reason is toomanysnapshots:

Related issues

Workaround

  • In the "Snapshots and Backup" panel, toggle the "Show System Hidden" switch and delete the latest system snapshot (which is just before the "Volume Head").

  • The volume will continue rebuilding to resume the upgrade.