Formatting SD Cards for Speed and Lifetime

Have you ever re-formatted an SD card and noticed the performance degrade? You formatted it FAT32 just like it came from the factory, but it runs slower. Except you didn’t actually format it like the factory. The following is a document I wrote mid-2013 about formatting microSD cards with a single FAT32 partition with a layout that is optimized for speed and lifetime. It also includes details about using this card as a bootloader source for TI OMAP3/4 processors.

Optimal Micro-SD Card Formatting for Performance and Wear-Leveling

version 1.3

version date changes
1.0 5/8/13 First draft
1.1 5/9/13 Added SD Card and Flash Internals section
1.2 5/9/13 Added a few more notes about other types of flash, smaller and larger
1.3 5/13/13 Link to Linaro sd card research article

Introduction & Background

MicroSD cards contain one or more flash chips and a small microprocessor that acts as an interface between the SD card specification and the flash chip(s). Cards are typically formatted from the factory for near-optimal performance. However, most operating systems default partitioning and formatting utilities treat the cards like traditional hard drives. What works for traditional hard drives results in degraded performance and lifetime for flash-based cards. This document outlines the relevant concerns and details how to properly partition and format microSD cards.

Raw flash chips are divided up into several integral read and write sizes. It is important to keep these sizes and their alignment into consideration or a card may end up performing 2 write operations for each 1 write operation intended thereby cutting the life of the flash in half. The microSD interface chip is called a Flash Translation Layer (FTL) chip or sometimes a controller chip. The SD card specification calls for the use of FAT32 as a file system and modern FTL chips are optimized for best performance using FAT32. The SD card specification also makes some alignment recommendations and most manufacturers follow these guidelines. Choosing proper alignment boundaries and placement of key file system structures greatly impacts card performance.

Partitioning

Newer high capacity SD cards, SDHC, are typically designed to have a few high priority 4MiB blocks at the beginning of the disk. For best performance, it is best to put the partition table at the beginning of the disk in its traditional location, and to place the first partition at absolute address (from the beginning of the disk) 4MiB. Only the first 512 bytes (1 sector) will be used of that 4MB.

4MiB 4MiB remainder of disk
MBR First Partition Header + FAT tables First Partition Data

The MBR, Master Boot Record, should be a traditional DOS-style partition table. For a FAT32 partition, we’ll use partition type ‘b’ or ‘c’. Type ‘b’ is used for FAT32 and ‘c’ for FAT32 with LBA (Logical Block Addressing). Here’s an example table.

$ sudo fdisk -H 255 -S 63 -u -c /dev/sdb

Command (m for help): p

Disk /dev/sdb: 7948 MB, 7948206080 bytes
255 heads, 63 sectors/track, 966 cylinders, total 15523840 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000edc02

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        8192    15523839     7757824    b  W95 FAT32

Note that the start and end units are sectors of 512 bytes. Also, 8,192 sectors equals 4MiB (8192 * 512 = 4 * 1024 * 1024). This table can be recreated using the SD Card Association Formatting Tool with the Logical Addressing Adjusmtent option enabled. The Logical Addressing Adjustment option will cause the partition table to use values 255 and 63 for “heads” and “sectors/track”. This can be important for some System On Chip processors’ ROM code when they attempt to boot directly from an SD card.

Formatting

FAT32 has existed for over 20 years and the filesystem layout is well understood. FAT32 partitions start with a BPB, Boot Parameter Block, in their first sector followed by some number of reserved sectors followed by two copies of the FAT, File Allocation Table. After that, the remainder of the partition is divided up into clusters which contain all file data and metadata. FAT32 supports several power of 2 based cluster sizes, but the only two we should consider are 32kiB and 64kiB. Using 64kiB will yield the best speed performance at the expense of wasted space for small files.

The first 4MiB block of the partition, which is the 2nd 4MiB block on the disk is usually the one optimized in the FTL chip for fast read/write access since the FATs are frequently read and modified. So, it’s important to tweak the FAT32 formatting to ensure that this 4MiB block is reserved for the FATs. To do this, perform a regular format of the partition and record the size of the FAT tables. The FAT tables are sized according to the number of clusters that will fit in the partition and then by the amount of space FAT32 needs to keep track of each cluster. Therefore, the size of the FAT is dependent on disk and cluster size. Once the size in sectors of the FAT is known, double it and subtract from 8,192 (4MiB in 512 byte sectors). The result is the number of reserved sectors needed between the BPB and the two FATs such that the file system data starts in the next 4 MiB block.

The following is an example workflow showing the process of aligning the FAT32 data area to a 4MiB boundary. The output of each command has been abbreviated.

Format the card for 64kiB clusters (128 sectors / cluster)

$ sudo mkdosfs -n NAME -s 128 -v /dev/sdb1
FAT size is 947 sectors, and provides 121200 clusters.

Notice FAT size of 947 sectors. Now examine the partition to see where the data area begins.

$ sudo dosfsck -vn /dev/sdb1
       512 bytes per logical sector
     65536 bytes per cluster
        32 reserved sectors
    484864 bytes per FAT (= 947 sectors)
Data area starts at byte 986112 (sector 1926)

Data area needs to start at logical sector 8192 (note: logical means relative to the start of the partition whereas absolute would be 16384 sectors from the start of the disk). Calculate a new value for “reserved sectors”.

$ echo "8192 - (2 * 947) " | bc
6298

Re-format the partition using this number.

$ sudo mkdosfs -n NAME -s 128 -R 6298 -v /dev/sdb1

Re-examine the partition. Notice that the data area now begins at sector 8192.

$ sudo dosfsck -vn /dev/sdb1
       512 bytes per logical sector
     65536 bytes per cluster
      6298 reserved sectors
First FAT starts at byte 3224576 (sector 6298)
    484864 bytes per FAT (= 947 sectors)
Data area starts at byte 4194304 (sector 8192)

Concerns

Zero’d Out BPB Blocks

In practive, we’ve seen microSD cards returned with zero’d out BPBs. How can we prevent that?

What if we put the BPB at the end of the first 4MiB block instead of at the beginning of the next 4MiB block?
It should never change once formatted.
What if we ensured that the BPB was in its own erase block?
Flash parts can be written in small page sizes, 16kiB for example, but they can only be erased in larger erase block sizes, 128kiB for example. Placing the BPB in its own erase block should ensure safety since that entire block should never require an erase and write. In fact, using the formatting routine in this document should ensure that the BPB is in its own erase block, but not necessarily guarantee that the two FATs will be. I now wonder if the cards we’ve seen with this failure in the past had their BPBs in separate erase blocks or if they had been re-formatted incorrectly.
What if each FAT were in its own erase block?
It would be nice if we could go further putting each FAT on an erase block boundary. This is not currently possible using mkdosfs. This would require either a patch to that program or manually modifying the partition after the above formatting process. You would need to figure out if the location of the 2nd FAT is hard coded to always be right after the 1st, or if it can be specified. If it can be specified, you would need to do the following:
– move the 1st FAT forward to an erase block boundary that leaves enough space for… – move the 2nd FAT forward to an erase block boundary – adjust the number inf the BPB that points to the 1st FAT – see if there’s a number that can be adjusted in the BPB to indicate the location of the second FAT

If that didn’t work, the next best thing would be:
– move both FATs forward together such that the 2nd FAT is on an erase block boundary. the 1st may not necessarily end up on an erase block boundary – adjust the BPB to point to the new start of the 1st FAT

For both methods, you’ll need to make sure that the data area’s location is calculated independently of the 1st FAT. Even if those two methods are incorrect, you should have an idea of how to proceed.

Configuring for TI OMAP Booting

The TI OMAP processors we use can boot from a microSD card with a FAT partition as long as the card follows some formatting rules. It may be possible to optimally format our microSD cards and still make them compatible with the OMAP Boot ROM code.

More specifically, the ROM code will boot a specially formatted second stage boot loader with the filename MLO. It will load that file from a FAT32 partition on a single MMC/SD card on the first MMC controller (which is a typical setup and therefore something you can forget). The file must be no larger than 128kiB. The common usage is to place an MLO file and a u-boot.bin file (loaded by a u-boot derived MLO) in the root of an SD card and the ROM code will boot those rather than the copies stored in onboard flash. From the TI OMAP3 Technical Reference manual, there are two ways the MLO file may be loaded, raw and file system.

Raw Mode

Spruf98u.pdf, 25.4.7.6.3 Read Sector Procedure

In raw mode, an image can be at offset 0 or 128KB and must not be bigger than 128KB. Raw mode is detected by reading sector 0 and sector 256.

  • It may be possible to place an MLO file at sector 256 in the unused space of the first 4MiB block without having to create a special FAT32 partition for the MLO file. The docs are unclear, but it appears that it tries raw mode before File System Mode. However, the ROM code might be written to skip checking sector 256 in raw mode if it first finds a partition table at sector 0. The MLO code, typically derived from U-Boot, will then need to know where to find the next stage, U-Boot or an operating system.

File System Mode

Spruf98u.pdf, 25.4.7.6.4 File System Handling

The MMC/SD cards can hold a file system that ROM code reads. The image used by the booting procedure is taken from a booting file named MLO. This file must be in the root directory on an active primary partition of type FAT12/16 or FAT32.

File System Detection Details

Spruf98u.pdf, 25.4.7.6.4.1 MBR and FAT File System

  • the ROM code will scan one active, primary partition marked as type ‘b’ or ‘c’
    • so, we might be able to put a small FAT32 boot partition at the end of a disk
    • if more then one active partition, the docs say the ROM code will fail
  • the TI technical reference manual, spruf98u, doesn’t mention this, but from experience, the ROM code will not boot a valid MLO file in the root directory if the disk is large and lots of files have been written before MLO. I don’t know if it’s an issue of fragmentation although with a 64kB cluster size that shouldn’t really be a problem since MLO is typically less than 64kiB and a max of 128kiB or 2 clusters. I also suspected that the ROM code will only look for data within the first X sectors or X FAT entries. Either might be the case. Regardless of whether the issue is fragmentation, disk location, or FAT entry location, creating a special small FAT32 partition for the ROM code is advised which should avoid all three problems.
  • A method of testing the fragmentation or exceeded disk/FAT offset hypothesis is as follows:
    • Create a large FAT32 volume. Fill it with lots of files and then write an MLO file into the root directory. Repeat this procedure until the device fails to boot. It should be easy to do.
    • [optional] If you have the technical expertise, examine the FAT to determine which FAT entry (offset from 0) contains the MLO file. See if the file is fragmented across multiple entries and if so, whether they contiguous on disk. Note the location of the MLO file on disk.
    • Remove all files in the partition except for the MLO file. I think that the MLO file should be left unchanged by this, i.e. still possibly fragmented or too far out on disk. See if the processor will boot. I suspect it will still fail.
    • Defragment the drive using whatever FAT32 defragmentation utility you can find. Select any options that re-order files on disk to match the directory structure. This should move the MLO file to the beginning of the disk and the FAT table. Try booting. I think this may pass. Success here would indicate that our hypothesis is correct, but without the knowledge from the optional inspection step, we won’t know for which of the three reasons the boot failed.

Checking FAT32 for Errors

The common tool used for checking FAT32 disks is dosfsck from the dosfstools package. The dosfsck man page lists the problems it will detect and fix. The problem with the tool is that some errors are benign while others are catastrophic if you need to trust your disk. The tool doesn’t give any way to determine the severity of the results. It merely returns an exit code of 1 for any problems fixed. For this reason, the only reasonable thing to do is to distrust the disk if a 1 is returned. In the future, either dosfsck could be altered to return a range of return codes or code could be written that parses the output of dosfsck and generates a severity based on the error messages.

$ sudo dosfsck -a /dev/sdb1 ; echo $?
dosfsck 3.0.7, 24 Dec 2009, FAT32, LFN
/dev/sdb1: 90 files, 80/121152 clusters
0

If you’re not writing to the disk, it should be safe to do a non-destructive check using options -an. You may also want to use -v if used on other disks such as USB drives. The verbose flag outputs configuration details for the disk which may be useful in the logs for later debugging.

SD Card and Flash Internals

Notes

SD card addressing is logical, not physical
This may seem obvious as you read this, but it is not obvious in practice when you go back and forth between raw flash and managed flash that has its own controller. When I write a 2kiB page to address 0 on my raw NAND flash, the page is written to physical address 0. If physical address 0 contains a bad block, then my write may have been in error. If I write a 2kiB page to address 0 on an SD card, the card will write it to physical address 0. However, if physical address 0 is bad, it may write it to another physical address and then refer to that physical address as logical address 0 to the outside world. This is important to remember so that you don’t over-think the steps you take for safety with SD cards.
SD card controllers cache writes
SD cards will cache write operations in order to extend the lifetime of the flash. For example, if you’re appending lines to a log file on a regular basis, the SD card will cache the pages containing that file in a small amount of RAM in the controller chip. After an algorithm determines that the cache is full or that some time has elapsed with no further activity, the chip will flush the cache to the actual flash. This means that you may write to a small file, close it, and then tell Linux to sync the file to disk. The Linux operating system will perform all writes flushing its own write buffer and then report success. However, Linux has only reported that its part has been completed. The SD card might not have flushed your small file to disk yet. The solution to this is unknown to me at this time, but probably involves telling the SD card driver to send one or more commands to the card to cause it to flush. Most assuredly, an unmount of a file system to an eject of the card probably causes the flush commands to be sent.
Why do we use the 4MiB boundaries and ignore flash details like erase block sizes?
From using and reading the documents accompanying the flashbench tool (see glossary), you’ll learn that most high capacity cards (today, mid 2013), are shipped with these optimized 4MiB boundaries regardless of flash internals. The 4MiB sizes are actually recommended by the SD card specifications. In addition, the page and erase block sizes are always not only smaller than 4MiB, but 4MiB is an even multiple of those smaller units. As for FAT32, the largest cluster size available is 64kiB, so you don’t get a choice there. If your page size is 64kiB or less than cluster sizes of 32kiB or 64kiB should be multiples of your page size. If your page size is bigger, there’s nothing you can do about it. Because FAT32 and SD card specification have limited our effective choices, there’s really no point in figuring out the details of your flash unless you try to do something more advanced. One last note is that the SD cards that use the 4MiB optimizations tend to have 2 or 4MiB allocation groups. So, if you’re going to create more than one partition, align your other partitions on 4MiB boundaries and you should be ok.
Small cards (<= 2GiB) and Large SSD Drives (>= 128GiB)
This document was targeted at users of microSD cards in the 4-16GB range with a focus on embedded Linux development with TI OMAP processors. Consequently, the recommended 4MiB boundaries are based on that size range. These numbers may not be appropriate for smaller cards and larger SSD drives. Cards with sizes less than 2GiB will be formatted using FAT16 by the SD Card association tool and may have different optimization sizes other than 4MiB. You can use the flashbench tool to detect the size and location of the optimized areas. Although, you also examine the flashbench results of other similarly sized and speed classed cards. If you see 2MiB across the board for them, chances are that your card is the same. For larger SSD hard drives, they have much larger page sizes, erase block sizes, and allocation groups. You may want to use flashbench to examine your drive.

Glossary

This is going to be a brief rundown of important flash and flash controller chip terminology. Example numbers used below are based on the 512MB Hynix NAND part in the Gumstix Overo FireCOM boards.

page size
minimum write or read size from a flash part, 2kiB
erase size (write size)
minimum size that must be erased in order to write, 128kiB. For example, if you change 1 byte in a file, the controller chip must update the 4kiB page containing that byte. In order to do that, it must read the 128kiB erase block containing that page into memory, replace the one changed page, erase the block, and then write the block. Erase blocks are erased by reseting all bits to 1s. Writes to flash can only really write 0s. If a 0 must be changed back to a 1, then the entire read/erase/write cycle must be performed.
page alignment
pages are aligned in the flash chips starting from the address 0. Formatting a partition in flash to start at address 2kiB with 4kiB file system block sizes will result in poor performance. If the updated page occurs in the middle of an erase block, then two pages (because each file system block overlaps two flash pages) will need to be updated, but only one erase will need to be performed. The worse scenario is an update to a block that overwrites an erase block boundary. Now two pages need to be updated and two erase cycles need to be performed. This is actually made worse once you know about allocation groups (see below). There could be many more than 2 erase cycles that must be performed if an allocation group must be flushed to flash.
allocation groups (au)
Flash controller chips have a small amount of RAM cache for reducing wear on the flash of a frequently updated area and to increase performance. The amount of space cached is called an allocation group (sometime allocation unit or au) and is named such because it typically contains more than one erase block. For example, an allocation group of 4MiB may exist for a flash chip with an erase block size of 256kiB. The chips can usually cache one or more of these allocation groups. The number of allocation groups available greatly increases the random read-write performance of the card. Most cards have 1 or 2 allocation groups and are sufficient for a digital camera that only reads or writes one file at a time. Sandisk cards have been noticed to have as many as 9 allocation groups. Allocation groups become important when an SD card is used for a root file system since a card with one allocation group must flush its cache each time a file is accessed outside of that group’s allocation.
readahead_kb
There is a sysfs control file on Linux systems for each disk named readahead_kb. The typical value is 128kiB. It governs how much additional data the kernel will attempt to read past each requested read in anticipation that another read for subsequent data will soon be coming. In practice with 8GB microSD. class 10 cards, I’ve found that 512kiB is a good number. However, the only way to tell is to try changing the number and performing many read tests to empirically compare the performance.
flashbench
a Linux program that attempts to use reads and writes of very specific sizes and offsets combined with the time taken to perform those tasks to determine the characteristics of the flash part(s) inside an SD card. SD cards use controller chips and therefore mask the characteristics such as page size from the user. However, as you now know, it can be important to know these numbers for proper formatting and use of the card. Flashbench is easy to compile and use. It is hard to interpret, though. The best method of learning is to search the linaro-dev mailing list (which is noted in flash bench’s readme file) for flashbench. In the search results, you’ll see others post their flashbench output and read the analysis of experts. It is through this process that you’ll best equip yourself to understand your own results.

Format Entire Disk FAT32 Script

Further Research

Time permitting…