How to handle “Data error (cyclic redundancy check)” (system error code 23)

If you get an error like this during image backups:

Error on client occurred: Error while reading from shadow copy device (1). Data error (cyclic redundancy check).

Or like this during a file backup:

Error getting file patch for “x/y/z.file” from CLIENT. Errorcode: ERRORCODES (11)
Remote Error: Reading from file failed. System error code: 23

Or you get the same error while copying/accessing a file in any program (Windows Explorer, etc.) you have damaged sectors on your hard disk. One way to confirm this is to have a look at your disk’s S.M.A.R.T. values (e.g. with SpeedFan; please comment with a better alternative if you have any).

The raw value of “Current pending sector count” should be greater than zero.

As an explanation: Your hard disk is designed such that the probability that each sector becomes unreadable is really low. Nevertheless, it is still possible and expected during a hard disk’s lifetime, as such the hard disk is engineered to handle this failure case. The hard disk has a few spare sectors. If a sector becomes damaged, the hard disk replaces this sector with one of those spare sectors. The good case is that the hard disk detects that a sector has problems while still being able to read the data in this sector (e.g. by retrying or using storage redundancy). In this case it can read the data, replace the sector with a spare one, write the data to the spare sector and use the spare sector from then on (S.M.A.R.T. value “Reallocated Sector Count” is increased by one).
If it cannot read the data it has to return a error saying that it cannot read the sector (and increase “Current Pending Sector Count” by one). The next time the sector is written to, it’ll write the data to a spare sector, then replace the damaged sector with the spare one (and decrease “Current Pending Sector Count” by one while increasing “Reallocated Sector Count” by one).
The error being returned by Windows if the hard disk says a sector is unreadable is “Data error (cyclic redundancy check)” (system error code 23).


If you don’t have a copy of the damaged sector you have lost data. You could retry a few times. Put the hard disk in a freezer and then retry or something, but chances are it won’t work.

This doesn’t mean your hard disk is completely broken. It may be an indication that it is going to break completely soon, but you also might have just been unlucky during normal operation.

How to fix it

If the error message includes a file path, replace the file with a backup copy. If you don’t have a backup copy, you can either delete the whole file, or use the scrub tool below to only delete the damaged parts of the file (if it is a movie file or an image it might be mostly okay).

If it is occurring during image backups, the scrub tool has the option to find out which files are damaged. You can then either replace the files with backup copies, delete the files or let the scrub tool delete the damaged parts of the file.

If the damaged files are Windows system files, they might get fixed by running a system integrity check. Open an admin console (WINDOWS + X then “Windows PowerShell (Administrator)”) then enter “sfc /scannow”.

Scrub tool

The UrBackup client (starting with 2.4.x) includes a tool that scans a whole disk, then lists the damaged files, then has the option to delete damaged parts of a file.

  • Download and install UrBackup client from https://www.urbackup.org/download.html#client_windows
  • Go to “C:\Program Files\UrBackup” in Explorer
  • Right-click “scrub_disk.bat” then select “Run as Administrator”
  • Select the volume you want to repair
  • If it finds unreadable sectors it asks if you want to list the damaged files (this might take some time) and if you want to delete those unreadable sectors

Btrfs file system stability status update

I wrote before in a blog entry about btrfs stability. That was almost exactly seven years ago.The issue I had back then (about a hard link limit of hard links in the same directory) was fixed a long time ago.

Since then btrfs has come a long way. The most significant development was that SUSE started using btrfs as root file system and put several developers to work on finding and fixing bugs. Simultaneously only few new features were added. This significantly improved stability (at least for the root file system use case).

As just hinted, the main problem is that each developer just looks after their set of use cases. So, developers from Facebook mainly look at their use case of using btrfs on their HDFS storage nodes. They have redundancy etc. on a higher level, so they are probably using btrfs only with single disks and focusing on making btrfs performant and stable within this use case. SuSE, as said, uses it as root file system (system files), where there are single disks or, at most, mirrored disks. File system repair isn’t too important, as those file systems don’t get very large and contain non-unique data.

Compared to that when using UrBackup with btrfs:

  • Storage is for backup and/or long term archival (RAID6 etc. would be appropriate)
  • Can have a lot of threads writing data to the file system at once
  • Can have combination of threads writing lots of small files and large files (image backups)
  • Lots of snapshots (backups) of different sub-volumes, some of which can persist for a long time
  • Deduplication (at least file level)
  • Can’t reboot the server, when it gets stuck like you’d be able to with e.g. a HDFS node
  • UrBackup imposes a database style workload (db for deduplication), where latency is important as well as the backup workload, where throughput is important
  • File system can get very large and it is very time consuming to move data from a damaged (read-only mounted) file system to a new one

This, combined, was guaranteed to cause problems when seriously using btrfs as UrBackup storage. The most persistent issue was premature ENOSPC, where btrfs has an error saying it is out of space, remounting the file system read-only, even though there is still a lot of free space (both for data and metadata) available on the disk(s). The problem seems to be solved (on the systems I observe) with Linux 5.1 or Linux 4.19.x with this patch. RAID56 is still officially “unstable”.

Btrfs isn’t alone with having issues with this work load. When using UrBackup seriously (large number of clients with many simultaneous backups) e.g. with ZFS, I experienced a lot of different issues with ZFSOnLinux (especially in connection with memory/thread management), making ZFSOnLinux unusable. ZFS on FreeBSD was a little better, but there also were issues that occurred about once per week causing hangs.

Btrfs also isn’t the sole area where Linux was needing improvement in the storage area. For example, writeback throttling was significantly improved a few years ago. This example improves the memory management, making Linux more able to handle the mixed database and archival workload mentioned above. Not to say that all errors are fixed. For example, I recently learned that a call to synchronize the file system to make sure that data is actually written to disk does not return errors on Linux. There is now a work-around for that, at least for btrfs, in UrBackup but there hasn’t been a fix on the Linux side in general, yet.

Another important consideration is performance. One thing w.r.t. btrfs to keep in mind is that there is a feature, or better trade-off, where there is a back-reference from each disk data location to each file/item using that disk data location. So, you can ask btrfs to list all files having data at e.g. disk offset 726935388160 and it will do so fast compared to other file systems like EXT, ZFS, NTFS, etc. which would have to scan through all files. Managing the backref metadata comes with the cost of having more metadata in general, though. Operations such as writing new data to files, deleting files, deleting snapshots, etc. become a bit slower as the backref metadata has to be added or deleted, in addition to the (forward) metadata pointing from file to data. Btrfs goes to great lengths making this fast, though (delayed backrefs etc.). There is a number of unique features of btrfs which could be implemented because of the presence of back-references:

  • File system shrinking
  • Device removal
  • Balancing
  • If there is a corruption, it can show all the files (and offsets into files) affected
  • Quotas
  • Btrfs send

The performance problem with backrefs is that if a file has a lot of reflinks or snapshots, it may have a lot of backrefs. Running one of the operations above then involves a lot of iteration through all backrefs, making this operation (unusably) slow. UrBackup naturally creates a lot of snapshots and may reflink certain files many times. The work-around (at least for me) is to avoid the above operations at all cost and patch out the reflink walking in btrfs send.

Conclusion

Should you use btrfs with UrBackup? It is certainly the file system with which UrBackup works the best. Outside of that you’d have to see if you can have:

  • Linux kernel at least 4.19.x (I suggest using the latest LTS kernel). If you have ENOSPC issues with 4.19.x apply patch.
  • Able to avoid above operations such as file system shrinking or device removal
  • Able to avoid btrfs RAID56 (i.e. use RAID1(0) instead or use btrfs in single mode on top of a storage layer that provides integrity such as Ceph or the UrBackup Appliance)
  • I’d suggest using ECC memory, as btrfs repair once it is damaged is mostly not possible

Connect clients with a HTTPS CONNECT web proxy

With 2.4.x you can use UrBackup with a HTTPS proxy. This way you can have the web interface and the clients connecting at the same port, secured by the same transport encryption (SSL). This post shows how to do this in combination with the Apache web server.

The idea is that the client connects to the web server and issues a HTTP CONNECT request to the actual UrBackup server.

First Enable CONNECT proxy module in apache. On debian via

a2enmod proxy_connect  

Then allow connections to the UrBackup server Internet port by adding

AllowConnect 55415

to your apache configuration.

Next in your apache virtual host configuration, set proxy options such as the timeout, allow proxy connections to the UrBackup server, and disallow them to every other host:

ProxyTimeout 600
ProxyRequests On  
<Proxy 127.0.0.1:55415>
</Proxy>
<ProxyMatch ^(?!127.0.0.1:55415$).*$>
    Order Deny,Allow
    Deny from all
</ProxyMatch>

Then, go to your UrBackup server web interface and setup your web server URL as Internet client proxy (https://example.com) and the Internet server name/IP as 127.0.0.1. Internet clients should then start connecting via your web server to your UrBackup server. Once all clients connect this way you could turn off UrBackup’s build in Internet transfer encryption and rely on SSL.

Fixing client IP addresses

You may notice that on the status page all Internet clients now show the IP address of your web server as their IP address. Fixing this is a bit difficult, as there is no standard way to forward the client IP address information from the web server (compared to a normal HTTP proxy where there is a X-Forwarded-For header). So, a bit of hacking to fix this is in order. I modified the mod_proxy_connect apache plugin to forward the client IP information in a 50 byte buffer to the backend: mod_proxy_connect.c
On debian you could replace your original mod_proxy_connect with the modified one via the following commands:

apt install apache2-dev
wget https://gist.githubusercontent.com/uroni/143c0d7ed6169e89f2d6c59a870dd4cc/raw/28dd30b1f82938777c504f2afdc5f162fd91b3fd/mod_proxy_connect.c
apxs -i -a -c mod_proxy_connect.c

Then in the UrBackup server advanced global settings set “List of server IPs (proxys) from which to expect endpoint information (forwarded for) when connecting to Internet service (needs server restart)” to include your web server IP (127.0.0.1 in the example here). After a server restart you should be able to see the actual client IP instead of the web server IP on the status screen.

Fixing SNI errors

If you have multiple virtual hosts with SSL there is an issue with SNI. Apache2 automatically compares the hostname in the CONNECT request with the server name in the SSL connection (SNI) and rejects the request if they differ. The only solution (or ugly hack) I found to fix this was to add the hostname with the target IP to /etc/hosts and then use the hostname instead of the IP in the CONNECT request. I.e., add “127.0.0.1 example.com” to /etc/hosts, then replace 127.0.0.1 with example.com in all the configuration above.

Additional proxy authentication

As additional security layer, one can require proxy authentication. Clients need to know a username+password to get through the web server to the UrBackup server. With apache2 e.g.:

 htpasswd -c -b /etc/apache2/urbackup_password urbackup passw0rd 

Then modify the proxy section to:

<Proxy 127.0.0.1:55415>
    AuthType Basic
    AuthName "Restricted UrBackup"
    AuthBasicProvider file
    AuthUserFile "/etc/apache2/urbackup_password"
    Require user urbackup
</Proxy>

Afterward add username+password to the proxy url, that is e.g. https://urbackup:password@example.com

UrBackup Windows Installer creator

If you wanted to create a generic UrBackup client installer for your internet server you’d have to create one yourself following instructions here .

I have made https://installercreator.urbackup.org/ to make this process easier. Given the correct parameters it’ll create a executable for you that’ll create a client on your server, then download the client installer for this newly created client and then run this installer. It’ll use the computer name as default client name, optionally appending a random string, such that name collisions do not occur.

RAID calculator/simulator

Go to RAID calculator/simulator

This is a RAID calculator/simulator. Given a desired failure percentage and space overhead, it simulates a few disk arrangements and then outputs the one where the failure percentage is lower or equal to the desired one.

So, you don’t have to read up on when it is best practice to use RAID5 vs. RAID6 or how many disks groups to arrange your RAID50 into.

A few notes:

  • The default disk failure percentage per year is 5%. This is the upper end. Depending on disk model, how much they are used and things like temperature and vibration isolation one would probably set it to more like 1% (See e.g. the statistics here)
  • The Unrecoverable Read Error (URE) rate is 1e14 per default. As mentioned in the Wikipedia article this often also is 1e15. You can get this info from the disk data sheets for example here for Seagate IronWolf.

What can still be improved:

  • Currently it assumes that disk failures are independent (“Independent and identically distributed random variables” from probability theory). I.e. one disk failing doesn’t make it more likely that others will fail. This may not be the case if you put more than one disk of the same model or even production batch into the same RAID group, because those models may all have the same flaws which make the disks die about the same time after identical usage (taking your RAID down with it). So, try to avoid putting identical disks into the same RAID group
  • As one can see it only uses one failure probability per year. A more accurate simulation can probably be had if the failure probability is adjusted for age. Disks fail often early on, then fail less, then start to fail more often when they are older.

Infscape UrBackup Appliance uses this RAID calculator/simulator to automatically arrange the RAID configuration according to the storage failure percentage and overhead you specify. This has a few advantages compared to manually setting up a RAID:

  • It is easier because you don’t have to manually arrange the disks. And it also avoids accidental misconfiguration
  • When a disk fails it can simply recalculate the RAID configuration with one disk less and use this configuration to write new data. After waiting a bit for the disk to be replaced it can also automatically reshape the data into this new configuration
  • When a disk becomes full it can calculate a RAID configuration without the full disk and write more data using this new configuration allowing the RAID to fill disks with different sizes while keeping the overall failure guarantees of the storage

Skewed Color

In which I debug why the tray icon on my laptop doesn’t turn yellow during backups anymore and just cannot find the issue:

Turns out I have f.lux on and it is night time and f.lux makes the yellow look like white. Now is that a bug or not?

Image mounting in UrBackup Server 2.1.x

UrBackup 2.1.x can now mount image backups. That is, it lists the image backups it has on the web interface, you can browse into them and e.g. download files or directories as a ZIP file.
I am particularly proud that this works on Linux as well as on Windows with both raw image files and VHD(z) files. On FreeBSD it only works with raw image files currently. The screenshots are from a Windows UrBackup server.

On Linux it uses libguestfs-tools to mount images in a sandboxed virtual machine. On Windows/FreeBSD mounting a hostile image may be a dangerous operation.

How I backup LVM volumes on my Xen server

I’ve got a Xen server which runs a couple of Linux and Windows VMs. The VMs are stored in LVM volumes on a LVM volume group which is on a bcache device. The bcache device consists of a mirrored SSD pair (using mdraid) as cache and a mirrored HDD pair (also using mdraid) as backing storage. The SSD caching gives a nice performance boost, but nowadays I would go with SSD storage only, because bcache caused some problems (did not play nice with udev during boot).
The Windows VMs are backed up by installing the UrBackup client in the VMs. To restore I’d need to boot the restore CD in Xen or restore the Windows images via command line in the hypervisor.
The Linux VMs are backed up at hypervisor level in the Xen dom0 (which is Debian in this case) using LVM snapshots. To create and remove LVM snapshots I have following snapshot creation and removal script (the volume group on which the volumes are is mirror-vg).

Snapshot creation script at /usr/local/etc/urbackup/create_filesystem_snapshot:

#!/bin/bash
set -e
SNAP_UID=$1
VOLNAME="$5"
VGNAME="mirror-vg"
if [[ $VOLNAME == "" ]]; then
        echo "No volume name specified"
        exit 1
fi
if [[ $VOLNAME == "other-data" ]]; then
        VGNAME="data2-vg"
fi
if [[ $SNAP_UID == "" ]]; then
        echo "No snapshot uid specified"
        exit 1
fi
export LVM_SUPPRESS_FD_WARNINGS=1
lvcreate -l100%FREE -s -n $SNAP_UID /dev/$VGNAME/$VOLNAME
SUCCESS=0
trap 'test $SUCCESS = 1 || lvremove -f /dev/$VGNAME/$SNAP_UID' EXIT
mkdir -p /mnt/urbackup_snaps/${SNAP_UID}
mount -o ro /dev/$VGNAME/$SNAP_UID /mnt/urbackup_snaps/${SNAP_UID}
SUCCESS=1
echo "SNAPSHOT=/mnt/urbackup_snaps/$SNAP_UID"
exit 0

Snapshot removal script at /usr/local/etc/urbackup/remove_filesystem_snapshot:

#!/bin/bash
set -e
SNAP_UID=$1
SNAP_MOUNTPOINT="$2"
if [[ $SNAP_UID == "" ]]; then
        echo "No snapshot uid specified"
        exit 1
fi
if [[ "$SNAP_MOUNTPOINT" == "" ]]; then
        echo "Snapshot mountpoint is empty"
        exit 1
fi
if ! test -e $SNAP_MOUNTPOINT; then
        echo "Snapshot at $SNAP_MOUNTPOINT was already removed"
        exit 0
fi
if ! df -T -P | egrep "${SNAP_MOUNTPOINT}\$" > /dev/null 2>&1; then
        echo "Snapshot is not mounted. Already removed"
        rmdir "${SNAP_MOUNTPOINT}"
        exit 0
fi
if lsblk -r --output "NAME,MOUNTPOINT" --paths > /dev/null 2>&1; then
        VOLNAME=`lsblk -r --output "NAME,MOUNTPOINT" --paths | egrep " ${SNAP_MOUNTPOINT}\$" | head -n 1 | tr -s " " | cut -d" " -f1`
else
        VOLNAME=`lsblk -r --output "NAME,MOUNTPOINT" | egrep " ${SNAP_MOUNTPOINT}\$" | head -n 1 | tr -s " " | cut -d" " -f1`
        VOLNAME="/dev/mapper/$VOLNAME"
fi
if [ "x$VOLNAME" = x ]; then
    echo "Could not find LVM volume for mountpoint ${SNAP_MOUNTPOINT}"
    exit 1
fi
if [ ! -e "$VOLNAME" ]; then
    echo "LVM volume for mountpoint ${SNAP_MOUNTPOINT} does not exist"
    exit 1
fi
echo "Unmounting $VOLNAME at /mnt/urbackup_snaps/${SNAP_UID}..."
if ! umount /mnt/urbackup_snaps/${SNAP_UID}; then
        sleep 10
        umount /mnt/urbackup_snaps/${SNAP_UID}
fi
rmdir "${SNAP_MOUNTPOINT}"
echo "Destroying LVM snapshot $VOLNAME..."
export LVM_SUPPRESS_FD_WARNINGS=1
lvremove -f "$VOLNAME"

The snapshot scripts are specified via the file /usr/local/etc/urbackup/snapshot.cfg:

create_filesystem_snapshot=/usr/local/etc/urbackup/create_filesystem_snapshot
remove_filesystem_snapshot=/usr/local/etc/urbackup/remove_filesystem_snapshot
volumes_mounted_locally=0

Then I have a virtual client for each LVM volume that needs to be backed up. I have put those virtual clients in a settings group with the default path to backup “/|root/require_snapshot”.

For restore I need to recreate the LVM volume. Create a file system on it (e.g. with mkfs.ext4) mount it in the hypervisor and then restore via.

urbackupclientctl restore-start --virtual-client VOLUMENAME -b last –map-from / --map-to /mnt/localmountpoint

Windows Backup API support in UrBackup 2.1.x

UrBackup 2.1.x has more completselect_windows_componentse Windows Backup API support. Previously the backup API was only used to create snapshots of the specified paths to backup (volume shadow copy snapshots). Now it does a so called component level backup, if configured to do so. You can select the components to backup via the client user interface and then the selected components automatically communicate to UrBackup which files need to be backed up, and on restore it communicates where the files should be restored and if e.g. services should be restarted during/after restore.select_restore_components

This works with applications which in turn support the Windows Backup API, such as for example Microsoft Exchange, Microsoft SQL Server, Microsoft Hyper-V, Oracle DB on Windows.

Testing backup and resrestore_componentstore with those different applications is now the big item on the to-do list. Every help and pointers to applications where backup or restore is broken will be helpful.

Visual Studio 2015 runtime and MSI installer

If you are using the MSI installer to install either UrBackup Client or Server on Windows there is a potential problem you might run into.

Starting with Visual Studio 2015, with which UrBackup Client/Server 2.x are compiled, Microsoft decided to split the Visual Studio runtime into a operating system level component (Universal Runtime) and another “normal” runtime component, wheras earlier it was only a “normal” runtime component.
The operating system component is installed via Windows Update and cannot be installed by a MSI installer. With Windows 10 it is always installed, but with Windows Vista or 8.1 the system needs to be up to date in order for the system component to be present (KB2999226), otherwise UrBackup will not start.

Another work-around is to use the .EXE (NSIS) Installer which includes the operating system compontent. Installation depends on Windows Update functioning correctly (which it may not).

Let’s just say this change does not make the software developer’s and user’s life easier.