RAID and hard disk read errors

27.02.2021 19:07

I have been preoccupied with data storage issues lately. What I though would be a simple installation of a solid state drive into my desktop PC turned into a month-long project. I found out I need to design and make new drive rails, decided to do some overdue restructuring of the RAID array and then had to replace two failing drives. In any case, a thing that caught my eye recently was a warning about RAID 5 I saw repeated on a few places on the web:

Note: RAID 5 is a common choice due to its combination of speed and data redundancy. The caveat is that if one drive were to fail and another drive failed before that drive was replaced, all data will be lost. Furthermore, with modern disk sizes and expected unrecoverable read error (URE) rates on consumer disks, the rebuild of a 4TiB array is expected (i.e. higher than 50% chance) to have at least one URE. Because of this, RAID 5 is no longer advised by the storage industry.

This is how I understand the second part of the warning: it talks about a RAID 5 array with total usable capacity of 4 TiB. Such an array would typically consist of three 2-terabyte disks. In the described scenario one of the disks has failed and, after adding in a replacement drive, the array is restored by reading the contents of the remaining two disks. This means we need to read out 2 times 2 terabytes of data without errors to successfully restore the array.

I was surprised by the stated higher-than-50% chance of a read error during this rebuild procedure. It seemed too high given my experience. Hence I've looked up the reliability section of the datasheet for the new P300-series, 2 TB Toshiba desktop-class hard drive I just bought:

Reliability section of the datasheet for Toshiba P300 hard drive.

I'm a bit suspicious of the way probability is specified here. Strictly reading the exponential notation, 10E14 means that the probability of an unrecoverable error (URE) is one error per 10⋅1014 bits. Expressed as probability of an error when reading a single bit:

P_{URE} = \frac{1}{10\cdot10^{14}} = 10^{-15}

In another datasheet for a different series of drives (however this time for data center instead of consumer use) the error rate is given as 10 errors per 1016 bits. This again gives the same error probability of 10-15.

Consider this probability for a second. It's such a fantastically low number. I don't remember ever encountering an actual technical specification that would involve a probability that has a one preceded by fifteen zeros - or in other words - fifteen nines of reliability.

The number is just on the edge of what you can represent with the common 64-bit double-precision floating-point format. If using a tool like numpy that only uses double-precision, any calculations with such values need to be done extra carefully to ensure that loss of numerical precision doesn't lead to nonsensical results.

Hard drives tend to use SI prefixes instead of binary, so I'll do the calculation for 4 terabytes instead of 4 tebibytes like it says in the quote:

n = 4 \cdot 10^{12} \cdot 8 \mathrm{bits}

For this calculation it doesn't matter whether we're reading this number of bites from one or two drives since the URE probabilities are assumed independent. The probability of getting at least one error during the rebuild is:

P_{rebuild-error} = 1 - (1 - P_{URE})^n \approx 3.1\%

Note that if I read 10E14 in the original reliability specification as 1014, the probability of a rebuild error goes up to 27%.

This comes out a bit more optimistic than the higher-than-50% figure given in the warning, at least for this specific series of hard drives. I guess whether 3.1% is still too high depends on how much you value your data. However consider that in the original scenario this is the probability of an error given that another hard drive in the array has already catastrophically failed. So the actual probability of data loss is this multiplied with the (unknown) probability of a complete drive failure.

Then again, consider that this is a desktop drive. It is not meant to be put into a RAID array and is typically used without any redundancy. Some people will even scream at you if you use desktop drives in a RAID due to timeout issues. Without any redundancy this probability directly becomes the probability of data loss. And that seems exceedingly high - especially considering that drives up to 8 TB seem to be sold with this same error rate specification. Even with that amazing reliability of reading a single bit, modern drives are simply so large that the vanishingly tiny error probabilities add up.

Posted by Tomaž | Categories: Life | Comments »

Reading RAID stride and stripe_width with dumpe2fs

20.02.2021 20:08

Just a quick note, because I found this confusing today. stride and stripe_width are extended options for ext filesystems that can be used to tune their performance on RAID devices. Many sources on the Internet claim that the values for these settings on existing filesystems can be read out using tune2fs or dumpe2fs.

However it is possible that the output of these commands will simply contain no information that looks related to RAID settings. For example:

$ tune2fs -l /dev/... | grep -i 'raid\|stripe\|stride'
$ dumpe2fs -h /dev/... | grep -i 'raid\|stripe\|stride'
dumpe2fs 1.44.5 (15-Dec-2018)

It turns out that the absence of any lines relating to RAID means that these extended options are simply not defined for the filesystem in question. It means that the filesystem is not tuned to any specific RAID layout and was probably created without the -E stripe=...,stripe_width=... option to mke2fs.

However I've also seen some filesystems that were created without this option still display a default value of 1. I'm guessing this depends on the version of mke2fs that was used to create the filesystem:

$ dumpe2fs -h /dev/... |grep -i 'raid\|stripe\|stride'
dumpe2fs 1.44.5 (15-Dec-2018)
RAID stride:              1

For comparison, here is how the output looks like when these settings have actually been defined:

$ dumpe2fs -h /dev/md/orion\:home |grep -i 'raid\|stripe\|stride'
dumpe2fs 1.44.5 (15-Dec-2018)
RAID stride:              16
RAID stripe width:        32
Posted by Tomaž | Categories: Code | Comments »

Showing printf calls in AtmelStudio debugger window

11.02.2021 16:26

Writing debugging information to a serial port is common practice in embedded development. One problem however is that sometimes you can't connect to the serial port. Either the design lacks a spare GPIO pin or you can't physically access it. In those cases it can be useful to emulate such a character-based output stream over the in-circuit debugger connection.

A few years back I've written how to monitor the serial console on the ARM-based VESNA system over JTAG. Back then I used a small GNU debugger script to intercept strings that were intended for the system's UART and copy them to the gdb console. This time I found myself with a similar problem on an AVR-based system and using AtmelStudio 7 IDE for development. I wanted the debugger window to display the output of various printf statements strewn around the code. I only had the single wire UPDI connection to the AVR microcontroller using an mEDBG debugger. Following is the recipe I came up with. Note that, in contrast to my earlier instructions for ARM, these steps require preparing the source code in advance and making a debug build.

Define a function that wraps around the printf function that is built into avr-libc. It should render the format string and any arguments into a temporary memory buffer and then discard it. Something similar to the following should work. Adjust buf_size depending on the length of lines you need to print out and the amount of spare RAM you have available.

#ifdef DEBUG_TRACEPOINT
int tp_printf_P(const char *__fmt, ...)
{
	const int buf_size = 32;
	char buf[buf_size];

	va_list args;

	va_start(args, __fmt);
	vsnprintf_P(buf, buf_size, __fmt, args);
	va_end(args);

	// <-- put a tracepoint here
	return 0;
}
#endif

We will now define a tracepoint in the IDE that will be called whenever tp_printf_P is called. The tracepoint will read out the contents of the temporary memory buffer and display it in the debugger window. The wrapper is necessary because the built-in printf function in avr-libc outputs strings character-by-character. As far as I know there's is no existing buffer where we could find the entire rendered string like this.

The tracepoint is set up by right-clicking on the marked source line, selecting Breakpoint and Insert Tracepoint in the context menu. This should open Breakpoint settings in the source code view. You should set it up like in the following screenshot and click Close:

Setting up a tracepoint to print out the temporary buffer.

The ,s after the variable name is important. It makes the debugger print out the contents of the buffer as a string instead of just giving you a useless pointer value. This took me a while to figure out. AtmelStudio is just a customized and rebranded version of Microsoft Visual Studio. The section of the manual about tracepoints doesn't mention it, but it turns out that the same format specifiers that can be used in the watch list can also be used in tracepoint messages.

Another thing worth noting is that compiler optimizations may make it impossible to set the tracepoint at this specific point. I haven't seen this happen with the exact code I shown above. It seems my compiler will not optimize out the code even though the temporary buffer isn't used anywhere. However I've encountered this problem elsewhere. If the tracepoint icon on the left of the source code line is an outlined diamond instead of the filled diamond, and you get The breakpoint will not currently be hit message when you hover the mouse over it, this will not work. You will either have to disable some optimization options or modify the code somehow.

Example of a tracepoint that will not work.

To integrate tp_printf_P function into the rest of the code, I suggest defining a macro like the one below. My kprintf can be switched at build time between the true serial output (or whatever else is hooked to the avr-libc to act as stdout), the tracepoint output or it can be turned off for non-debug builds:

#ifdef DEBUG_SERIAL
#  define kprintf(fmt, ...) printf_P(PSTR(fmt), ##__VA_ARGS__);
#else
#  ifdef DEBUG_TRACEPOINT
#    define kprintf(fmt, ...) tp_printf_P(PSTR(fmt), ##__VA_ARGS__);
#  else
#    define kprintf(fmt, ...)
#  endif
#endif

With DEBUG_TRACEPOINT preprocessor macro defined during the build and the tracepoint set up as described above, a print statement like the following:

kprintf("Hello, world!\n");

...will result in the string appearing in the Output window of the debugger like this:

"Hello, world!" string appearing in the debug output window.

Unfortunately the extra double quotes and a newline seem to be mandatory. The Visual Studio documentation suggests that using a ,sb format specifier should print out just the bare string. However this doesn't seem to work in my version of AtmelStudio.

It's certainly better than nothing, but if possible I would still recommend using a true serial port instead of this solution. Apart from the extra RAM required for the string buffer, the tracepoints are quite slow. Each print stops the execution for a few 100s of milliseconds in my case. I find that I can usually get away with prints over a 9600 baud UART in most code that is not particularly time sensitive. However with prints over tracepoints I have to be much more careful not to trigger various timeouts or watchdogs.

I also found this StackExchange question about the same topic. The answer suggests just replacing prints with tracepoints. Indeed "print debugging" has kind of a bad reputation and certainly using tracepoints to monitor specific variables has its place when debugging an issue. However I find that with a well instrumented code that has print statements in strategic places it is hard to beat when you need to understand the big picture of what the code is doing. Prints can often point out problems in places where you wouldn't otherwise think of putting a tracepoint. They also have a benefit of being stored with the code and are not just an ephemeral setting in the IDE.

Posted by Tomaž | Categories: Code | Comments »

Making replacement Chieftec drive rails, 2

08.02.2021 19:47

Two weeks ago I was writing about 3D-printable replacements for 3.5" drive rails used in an old Chieftec PC enclosure I have. The original plastic rails became brittle with time. They often broke when I was replacing hard drives and I've eventually ran out of spares. I've drawn up a copy in FreeCAD and modified the design so that it was easily printable on a FDM printer. At the time of my last post I was waiting to get a sample pair printed. This is a quick update on that. I've tried the new rails and found that they fit well, but need some minor improvements.

3.5" hard drive with the 3D printed rails.

This is how the new rails look when mounted on a 3.5" hard drive. Unfortunately I've specified a wrong diameter for the holes in the STL file, so the imperial-sized screws that go into these drives don't fit. I've manually drilled the holes with a larger diameter on the photo above, but obviously it's better if the rails come correct from the printer in the first place.

These pieces were kindly printed for me by Matjaž on a Prusa i3 MK3S in PETG filament. They feel sturdy and more than strong enough for their purpose. The only thing I was slightly worried about is them getting softer when mounted on warm disk drives. According to the Prusa website the material should be good enough up to 68°C. My monitoring says that no drive got hotter than 50°C in the last 12 months so even in summer they should be fine.

Drive on the new rails being inserted into the Chieftec case.

I was happy to see that the drive with the new rails fits into the case perfectly. Much better in fact than with the original rails. Original rails were very difficult to handle, even when new. These slide in and out with just enough force. The plastic arcs on the top and bottom of the rails engage nicely with the guides and provide enough friction so that the drive doesn't rattle in the case. The strengthened tabs on the side also fit nicely. I saw no need to fix any of the basic dimensions.

Apart from the holes the only other part that needed fixing is the flexible latch. This latch has to be printed in a separate piece. My original idea was that I will glue it to the base part of the rail. However the latches all broke off when I was testing them. Part of the problem was that the cyanoacrylate (superglue) I was using doesn't seem to have good adhesion to PETG plastic. I'll probably find a glue that works better, but I still wanted to change the design so that it depends less on the strength of the bond between the two plastic pieces.

The picture below shows how I've slightly modified the latch. The red part is the latch and purple is the base part of the rail. This picture shows a cross-section in FreeCAD. See my last post for the renders of the complete rail.

The new and the old design for the latch on the drive rail.

The new latch design (top) has a tab that inserts into a slot in the rail. When the latch bends, the slot itself should take most of the torque and hold it in place. The glue should only hold it so that the latch doesn't fall out of the slot when not under tension. In the old latch design (bottom) the glue itself was taking all of the torque when the latch was bending.

I would be even happier with a design that wouldn't require glue at all and where the latch would click into place somehow. However I felt that coming up with the right tolerances for that would require several more round trips between FreeCAD and the printer and, since I'm already at version 4 of the design, I didn't want to waste any more time on this.

Anyway, I've put the updated STL files at the same place as last time. Again I'm waiting for the printouts. Hopefully when I get the chance to test them no further changes will be necessary and I can finally install a new stack of hard drives into my case.

Posted by Tomaž | Categories: Life | Comments »