bad block
What is a bad block?
A bad block is an area of storage media that is no longer reliable for storing and retrieving data because it has been physically damaged or corrupted. Bad blocks are also referred to as bad sectors.
There are two types of bad blocks: A physical, or hard, bad block comes from damage to the storage medium. A soft, or logical, bad block occurs when the operating system (OS) is unable to read data from a sector. An example of a soft bad block is when the cyclic redundancy check (CRC), or error correction code, for a particular storage block doesn't match data the disk reads.
On magnetic hard disk drives (HDDs), physical or hard bad blocks can happen when a location on the recording surface is defective or damaged. On NAND flash drives, the transistors that comprise storage blocks can become worn from use, making them unreliable or unusable after a certain number of write and erase cycles.
What causes bad blocks?
Today, digital information isn't stored or accessible at the bit or byte level. Storage devices simply can't provide the granularity needed. Instead, storage devices are organized into a series of storage areas. Think of these as a large series of cubbyholes, such as a wall of mailboxes in a post office. Each cubby holds a fixed volume of information. These storage spaces are called storage blocks or simply blocks.
When a computer stores, or writes, a file from its memory to storage media, it breaks the file into block-sized pieces and stores each piece in an available block located across the storage media. When the file is recalled, or read, from the storage media, the computer retrieves the pieces of the file. It receives the contents of those blocks where the file is stored, reassembles the pieces in the computer's memory, where applications can operate on the file again.
Storage media operates on two levels. The physical level is where data is stored. The logical level defines how an OS interacts with the physical media, such as formatting and error checking.
Although the terms sector and block are often used interchangeably, they are referring to different concepts:
- A sector is the smallest storage space available on a storage media. These are typically termed hard sectors or device blocks.
- A block is the smallest storage space available to the file system. These are typically called file system blocks or input/output blocks.
A block can use only a single sector, but it almost always involves multiple sectors on a physical storage device. This is an example of where the logical storage activities of the computer relate to the physical storage characteristics of the actual storage device. The important issue here is that a problem in any physical sector will adversely affect the logical block with which that sector is associated.
All bad blocks represent a failure of the storage media, whether within the transistors of solid-state devices or across the magnetic coatings of traditional rotating hard disks. Physical defects can't be corrected, though they can be mitigated. Logical defects can typically be corrected, though there might be some data loss.
Causes of bad blocks include the following:
- Magnetic defects. Storage drives can ship from the factory with defective blocks that originated in the manufacturing process. This is a common issue. Before the device leaves the factory, these bad blocks are marked as defective and remapped to the drive's extra blocks.
- Physical damage. A bad block can result from physical damage to a device that makes it impossible for the OS to access data. On HDDs, mishaps, such as dropping a laptop, can cause the drive head to damage the platter. Dust and natural wear can also damage HDDs.
- Logical defects. Software problems cause soft bad sectors. For instance, if a computer unexpectedly shuts down, the hard drive could turn off in the middle of writing to a block. In this case, the block could contain data that doesn't match its CRC error correction code and would be identified as a bad sector. Similarly, storage and retrieval aren't perfect; rare errors can occur, such as a bit changing state enroute to or from storage. That situation can present a bad block even though there's no damage or defect in the underlying storage media.
- Solid-state defects. Damage to a solid-state drive (SSD) or other flash storage-type device can occur when a transistor fails within the memory array. Storage cells can also become unreliable over time, such as when the NAND flash substrate in a cell becomes unusable after a certain number of program-erase cycles. The erase process on an SSD requires sending a small electrical charge through the flash cell. Over time, this degrades the oxide layer that separates the floating gate transistors from the flash memory silicon substrate and the bit error rates increase. The device's storage controller can use error detection and correction mechanisms to fix these errors. However, at some point, the errors outstrip the controller's ability to correct them, and the cell becomes unreliable or fails outright.
What do bad blocks do?
In broad terms, bad blocks disrupt the computer's storage operations. Generally, this produces error messages when reading or writing to storage. The way in which bad blocks manifest themselves -- and the data involved in the bad blocks -- will affect how a computer responds. Here are some signs and symptoms of bad blocks:
- If the bad blocks involve the computer's OS, the computer might fail to boot the OS, perhaps causing a blue screen error, or it might return an error message during the boot process. Performing an OS recovery can let the storage device use reserve blocks and restore normal operation.
- If the bad blocks involves applications, such as a word processor, the application might fail to load, and it could then return a storage error. Reinstalling or repairing the afflicted application might let the storage device use reserve blocks and restore normal operation.
- If the bad blocks involve data, such as a document or video file, the file might fail to load or save properly, and it could then return an error message from the application. For example, the application might not load or save the file. It could be possible to use disk utility software to remap the bad blocks, but the data would have to be recovered from a backup.
- Bad blocks can cause performance problems with storage access, resulting in slower save or load times as the storage device identifies and attempts to remap the bad blocks.
- The computer might not properly recognize a storage device afflicted with bad blocks, interpreting it as an "unknown device."
Disk utility software, such as CHKDSK on Microsoft Windows systems or Badblocks on Linux systems, scan storage media and mark the failed sectors so the OS doesn't use them. The firmware on an HDD controller also can identify and mark a bad block as unusable. This usually happens when a block is being overwritten with new data. The controller automatically remaps bad blocks to a different sector. Once it's identified as bad, that sector isn't used in future operations.
Bad blocks identified during post-manufacturing testing of a drive are listed on what's called the P-List, short for permanent or primary defect list. Bad blocks found after the drive is in use, caused by physical damage or deterioration of the recording surface, are recorded on the G-List, short for growing list.
When a NAND flash drive identifies a bad block, it's recorded in the device's bad block table (BBT). Before reading from or writing to a NAND device, the controller checks the device's BBT to avoid bad blocks. Flash drives use two kinds of BBTs: NAND-resident ones are retained across system boots and RAM-resident BBTs are recreated each time the system boots.
How to manage bad blocks
Bad blocks generally require little, if any, direct management. Onboard storage firmware algorithms and controllers automatically handle storage media and defect management in the background. The best way to fix an HDD file affected by a bad block is to write over the original file. This causes the hard disk to rewrite the incorrect bits properly, remapping the bad block or fixing the CRC or data.
Bad block management is critical to improving NAND flash drive reliability and endurance. Unlike magnetic storage media, flash can't be overwritten at the byte level. All changes must be written to a new block, and data in the original block must be marked for deletion. This is an important nuance for solid-state flash storage devices. A block must be erased before it can be rewritten. However, the erasure process is damaging to flash storage cells, so the entire device is written before any blocks are rewritten. This is a central premise of flash storage wear-leveling technique.
Once a flash drive fills up, the controller must start clearing out blocks marked for deletion before it can write new data. To do this, it consolidates good data by copying it to a new block. This process requires extra writes to consolidate the good data and results in write amplification where the number of actual writes exceeds the number of writes requested. Write amplification can decrease a flash drive's performance and lifespan.
Flash vendors use several techniques to control write amplification. One, known as garbage collection, involves freeing up blocks that were previously written to; this proactively consolidates data. When this is done properly, these reallocated sectors can reduce the need to erase entire blocks of data for every write operation.
Vendors also use data reduction technologies, such as compression and deduplication, to minimize the amount of data being written and erased on a drive. In addition, an SSD's interface can help decrease write amplification. Serial Advance Technology Attachment's TRIM and SAS's UNMAP commands identify data blocks no longer in use that can be wiped out. This approach minimizes garbage collection and frees up space on the drive, improving performance.
To extend the life of a solid-state device, the controller software that manages a NAND device can implement an algorithm to distribute program-erase cycles evenly across a drive and ensure no block has excessive use compared with other blocks. With wear leveling, the flash device remaps storage blocks each time a write occurs. This approach ensures that write cycles are spread across all the memory cells and no one block is written to more than others, reducing the chance that blocks fail prematurely.
To support wear leveling and garbage collection, vendors overprovision flash capacity on a drive. That way, a drive has an inventory of cells available to support write operations, improve performance and replace cells that wear out. Software-enabled flash can ease garbage collection downsides.
How to use CHKDSK to inspect a device
The CHKDSK utility is a diagnostic tool PC users and systems administrators can use to inspect a storage device, determine its functional status and identify issues such as bad blocks. CHKDSK is available through a Windows command prompt, the Advanced Startup Options and System Recovery Options. Here's how it works:
- From Windows, click Start and search for the Run command line tool.
- Type powershell.exe into the Run command box and press Ctrl+Shift+Enter to launch PowerShell in administrator mode. A PowerShell dialog will open.
- Type "CHKDSK" and press Enter. If no runtime switches are added to the CHKDSK command, the utility runs in read-only mode and leaves the storage device intact.
- The CHKDSK utility will run for a few moments and return a comprehensive report on the state of the storage device. The report includes a summary at the bottom that reports on the bad sectors.
The CHKDSK utility also provides a variety of command line switches or parameters that can be added to tailor CHKDSK behavior or select appropriate actions for a storage device. The most important switches are the following:
- /f, which will fix errors detected on the storage device.
- /v, which causes a detailed, or verbose, response from CHKDSK as it runs.
- /r, which will attempt to recover readable information from bad blocks.
A complete listing of CHKDSK can be found by using the help switch /?, such as chkdsk /?.
For example, to run CHKDSK to fix and recover wherever possible use the /f and /r switches together, such as chkdsk /f /r.
It's always best to run CHKDSK in read-only mode first and see what the storage device is doing before making any decision to fix, recover or allow CHKDSK to perform other functions on the device. It may wise to perform a complete system backup before using CHKDSK to repair a storage device.
Do bad blocks mean a storage device is failing?
All bad block errors represent a problem with storage media. The consequences of the problem will depend on the severity and cause of the bad block. Examples of how to assess a bad block problem include the following:
- Manufacturing. Some bad blocks are a normal side-effect of the manufacturing process. Typical storage devices can contain countless blocks. A common 2-terabyte HDD can contain more than four billion blocks, each 512 bytes long. Flaws in magnetic or solid-state media can result in bad blocks, though this is rare. Such bad device blocks are identified and mapped to prevent use, and users never even know that bad blocks are present. In this case, bad blocks don't represent a failing storage device.
- Normal use. Many logical bad blocks occur due to random errors. These can occur in power levels or high-frequency signals passing between the storage device and throughout the computer. For example, a particle of cosmic radiation might flip a bit once in billions of bytes, resulting in a CRC check error that suggests a bad block. These are logical errors and easily remedied with no ill-effects to the storage device. For example, simply resaving a file or recovering a backup can often overcome a logical bad block error.
- Failure. Storage devices are remarkably resilient and reliable devices, but both magnetic and solid-state storage devices can wear out and fail over time. There's no easy or direct way to distinguish a bad block caused by normal use from a bad block caused by a worsening failure in progress. IT professionals have the tools to track and analyze logs and replace storage devices pre-emptively to mitigate storage problems.
Individual computer users typically don't have such resources and must rely on personal experience to determine whether to fix bad blocks and what the proper remediation would be. In most cases, an occasional low-impact storage error can be ignored. If storage errors increase in frequency or seriousness as the computer ages, the computer user will decide when 'to replace the storage device.
Storage can be tricky to get right. Learn more about dealing with important data storage issues.