EsoErik

Tuesday, January 5, 2016

 

Arch ZFS root with 4.1.15-1-lts kernel: boot fails with "ZFS: Unable to import root_pool", but "import root_pool" in rescue shell works!

On a particular system with heaps of drives and controllers (booting involves 10+ minutes of controller bioses loading, enumerating, scanning, installing, etc), with a ZFS root, the following Arch Linux issue has come, gone, and come again:
  • The initrd step of booting fails out to a rescue shell with the error, "ZFS: Unable to import pool root_pool".
  • In that rescue shell, running "zpool import root_pool" succeeds, leading me to wonder why the same thing had apparently just failed.
To get past this error, I've previously banged gongs, chanted in tongues, burned incense, sent postcards to Jesus, and things such as that.  However, the issue never quite went away.

Sigh.  What a pain.  And so, we arrive at our very last and final reserve strategy: using our brain.  Alright, brain.  Get to thinking.  In the mean time, I'm going to play some video games.

45 minutes later, our brain says:
"So, when you later try the thing that failed, the thing that failed then succeeds.  Maybe it just fails the first time because it's idiotic and can only ever work when retried, or perhaps the delay itself is the important part?  Who cares!  Just make it retry once or twice with increasing delays.  This is a good plan.  Do it.  No more thinking is required at this time.  Signing off."

Thanks, brain!  Let's try that.  We boot an Arch image with ZFS integrated from a thumb drive, then run "zpool import root_pool -R /mnt", and then we "arch-chroot /mnt".  There's this file, /usr/lib/initcpio/hooks/zfs.  We edit it (edited portion highlighted and italicized).

ZPOOL_FORCE=""
ZPOOL_IMPORT_FLAGS=""

zfs_get_bootfs () {
    for zfs_dataset in $(/usr/bin/zpool list -H -o bootfs); do
        case ${zfs_dataset} in
            "" | "-")
                # skip this line/dataset
                ;;
            "no pools available")
                return 1
                ;;
            *)
                ZFS_DATASET=${zfs_dataset}
                return 0
                ;;
        esac
    done
    return 1
}

zfs_mount_handler () {
    local node=$1
    if [ "$ZFS_DATASET" = "bootfs" ] ; then
        if ! zfs_get_bootfs ; then
            # Lets import everything and try again
            /usr/bin/zpool import $ZPOOL_IMPORT_FLAGS -N -a $ZPOOL_FORCE
            if ! zfs_get_bootfs ; then
                echo "ZFS: Cannot find bootfs."
                return 1
            fi
        fi
    fi

    local pool="${ZFS_DATASET%%/*}"
    local rwopt_exp=${rwopt:-ro}

    if ! "/usr/bin/zpool" list -H $pool 2>&1 > /dev/null ; then
        if [ "$rwopt_exp" != "rw" ]; then
            msg "ZFS: Importing pool $pool readonly."
            ZPOOL_IMPORT_FLAGS="$ZPOOL_IMPORT_FLAGS -o readonly=on"
        else
            msg "ZFS: Importing pool $pool."
        fi

        if ! "/usr/bin/zpool" import $ZPOOL_IMPORT_FLAGS -N $pool $ZPOOL_FORCE ; then
            echo "ZFS: Unable to import pool $pool.  Sleeping for 10 seconds and then tying again."
            sleep 10
            if ! "/usr/bin/zpool" import -d /dev -N $pool -f ; then
                echo "ZFS: Unable to import pool $pool.  Sleeping for 60 seconds and then trying yet again."
                sleep 60
                if ! "/usr/bin/zpool" import -d /dev -N $pool -f ; then
                    echo "ZFS: Unable to import pool $pool, even after waiting a good long time for devices to show up."
                    return 1
                fi
            fi
        fi
    fi

    local mountpoint=$("/usr/bin/zfs" get -H -o value mountpoint $ZFS_DATASET)
    if [ "$mountpoint" = "legacy" ] ; then
        mount -t zfs -o ${rwopt_exp} "$ZFS_DATASET" "$node"
    else
        mount -o zfsutil,${rwopt_exp} -t zfs "$ZFS_DATASET" "$node"
    fi
}

run_hook() {
    # Force import the pools, useful if the pool has not properly been exported
    # using 'zpool export '
    [[ $zfs_force == 1 ]] && ZPOOL_FORCE='-f'
    [[ "$zfs_import_dir" != "" ]] && ZPOOL_IMPORT_FLAGS="$ZPOOL_IMPORT_FLAGS -d $zfs_import_dir"

    if [ "$root" = 'zfs' ]; then
        mount_handler='zfs_mount_handler'
    fi

    case $zfs in
        auto|bootfs)
            ZFS_DATASET='bootfs'
            mount_handler="zfs_mount_handler"
            ;;
        *)
            ZFS_DATASET=$zfs
            mount_handler="zfs_mount_handler"
            ;;
    esac

    if [ ! -f "/etc/hostid" ] ; then
        echo "ZFS: No hostid found on kernel command line or /etc/hostid. ZFS pools may not import correctly."
    fi

    # Allow up to 10 seconds for zfs device to show up
    for i in 1 2 3 4 5 6 7 8 9 10; do
        [ -c "/dev/zfs" ] && break
        sleep 1
    done
}
 

Next, we run "mkinitcpio -p linux-lts", exit the chroot, "zpool export root_pool", reboot, and we see that the first zpool import attempt still fails.  However, we are then into the code we added, which waits ten seconds and tries zpool importing again.  This second attempt succeeds.

In summary: most likely, the "waiting for uevents" step that precedes the first import attempt does not do quite enough waiting. Waiting more and retrying works. There is some possibility that the import just fails the first time and works the second, but the point remains that waiting more and retrying works, and that's what we do.

Comments:

Post a Comment

Subscribe to Post Comments [Atom]





<< Home

Archives

July 2009   August 2009   September 2009   October 2009   November 2009   December 2009   January 2010   September 2010   December 2010   January 2011   February 2011   April 2011   June 2011   August 2011   February 2012   June 2012   July 2012   August 2012   October 2012   November 2012   January 2014   April 2014   June 2014   August 2014   September 2014   October 2014   January 2015   March 2015   April 2015   June 2015   November 2015   December 2015   January 2016   June 2016   August 2016   January 2017   March 2017  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]