Fixing a Corrupted Ext4 File System

The Ext4 file system on our server is corrupted, causing the folder having all my latest simulator code and experimental data disappeared.

This post tells how I get the data back and fix the file system.

Background

Our lab received four fairly new arm servers last year, and each server comes with a disk of about 200GB only. There were four 2.2” new SSD disks (800GB each) left unused in our lab, and the arm servers happen to support NVMe. Therefore, we bought the adapter disk tray and mount one SSD disk to each arm server. To further make use of the disks, I setup NVMe-over-TCP to let the servers share the SSD disks with each other (root cause for the corruption). At the beginning, it seems good, that from server 1 I can see all the files in the SSD disk of server 2. Soon, I realized that writes from different servers to the same disk does not synchronize, until the disk is remounted. I tested a few, for example, let server 1 add a line to the file on the disk (physically) attached to server2 through the networked mount. While opening the file on server2, the new line was not there, and the change is not visible to other servers (say server 3) having the remote access either.

I am not sure why the disks mounted via NVMe-over-TCP behave like this, and the phenomena did not trigger enough caution to me, as we were not yet in a situation that have to use the remote disk.

After some tests on reading and writing files on remote SSD disks, I found my folder bambo “hides” from the root folder of the disk. I still can change directory into it, while it is not listed in the root folder any more. Also, my simulator code folder is a directory of git repo under bambo, and some of the git commands failed complaining about the .git module files.

So there were actually many signs of the file system corruption, but I was busy and ignored them.

Problem and First Aid

After a system reboot, I totally lost my folder (not even able to access it by manually typing its path like before). All my work since winter break gone with the folder (really scary). There has been a folder lost+found, but nothing inside.

I first unmounted the disks and stopped all the remote mounting via NVMe-over-TCP. The file system corruption is likely caused by the remote accesses, and with a file system already corrupted, any further access could be destructive.

Then I found a portable disk having enough capacity to make a sector-level backup of the SSD disk.

$ sudo dd if=/dev/nvme0n1p1 of=<path to the portable disk mount point> bs=512

Probably because it is sector by sector and the destination is a portable disk connected via USB 2.0, four days elapsed when the backup finally finished.

During waiting, I found e2fsck is usually the command in Linux to check a file system and correct a few types of corruptions. I guess e2fsck would probably be able to find my folder and put it under lost+found, but I was not so sure on it. So I found some other disk rescure tools recommanded by online articles, such testdisk. When I tested testdisk on another working file system, it did find the file I deleted on purpose, making me feel it is very promising.

However, after the backup was done, and I tried letting testdisk scan the corrupted SSD disk, it told me nothing found. Then I had to look for other tools. The one I ended up using is debugfs, which is not as automated as testdisk, but allows you to perform many foundmental operations to investiage and augment the disk (e.g., dump a block, modify an inode).

The file system for the SSD disk is ext4, which I had little knowledge about it. This time, due to the strong motivation to find my folder back, I earned hand-on experience of ext4 file system constructure with the help of debugfs.

The command debugfs should be only used on an unmounted disk partition or an image file, otherwise the dynamically changing disk content might lead to malfunction of debugfs. For my SSD disk with the corrupted file system, it complains about unmatched bitmap and checksum. The workaround is to add a command line option -c, which would skip loading the block and inode bitmap, and open the disk in read-only mode. I actually preferred read-only when I was still in the investigation phase. Later, to correct the file system, I have to force load with -n:

$ debugfs /dev/nvme0n1p1 -w -n -z <path to undo file>

It worth mentioning that -z allows you to record every step you operate on the disk in case you want to revert the change.

Ext4 File System

block and superblock

Like memory is broken into pages for management, disks are divided into blocks to be used. The block size is a group of sectors (sector is meaningful for mechanical HDD, not sure whether there is really an equivalent concept for SSD), ranging from 1KB to 64KB. Multiple blocks form a block group. The metadata like the block size and how many blocks in a group is the configuration of a specific and saved in some fields of the superblock.

The location of the superblock is fixed, that is 1024 bytes from the very beginning of the disk partition. The 1024 bytes are reserved on purpose in case some partition schemes need it. For my disk, the block size is 4096 bytes, so the superblock is in Block 0 and takes up the remaining space of Block after the first 1024 bytes. There are some backup superblocks in the later block group (group whose index is powers of 3, or 5, or 7), and the following command can find them out:

$ sudo dumpe2fs /dev/nvme0n1p1 | grep super

group descriptor

Following the superblock, there is a table of group descriptors. Group descriptor provides informations related to the block group, such as unused block/inode count in this block group, the offset to access block/inode bitmap (recording usage, see next paragraph). My disk has 5962 block groups, so the group descriptor table spans from Block 1 to Block 94 (64 bytes per group descriptor). By default, there is a full copy of the group descriptor If the sparse_super flag is set in the superblock, then group descriptor table appears only after superblocks.

bitmaps

There are bitmaps for blocks and inodes. The block bitmap records which block has been used by setting the corresponding bit to 1. The inode bitmap similarly marks the allocated inodes in the inode table (see next paragraph). Both bitmaps covers the blocks/inodes belong to the block group only, not for the entire disk (this is different from the group descriptor table). Ususally the block bitmap and inode bitmap are in the beginning of each block group and the group descriptor would point out their location. However, if there is flex_gp flag set in the superblock, and certain numbers of block groups form a bigger flexible group, and the bitmaps for those block groups in a flexible group are concadinated and placed in the first block group of the flexible group. The number of block groups in the flexible group is decided by a field in the superblock. This design allows ext4 to store huge files continuously.

inode and inode table

After the bitmaps, there is a table of inodes. In ext4, every file (directory is considered as a special file, links also) is an inode. Inode are indexed from 0 to a certain big number (decided by the superblock configuration: inodes per group x number of groups), and unique within a file system. The inode table in each block group covers only the inodes belong to the group, and might be relocated together with the inode bitmap to the first group in the flexible group if flex_gp is set.

Most if not all information of a file is stored in the inode, such as the type, owner, permission mode, create/access/modification/delete time, number of links, and so on. It worth mentioning that the number of links here is the count of hard links, and for a directory, it is at least two (one from the parent folder, another is the ‘.’ in its own list).

Other than those basic information, the most important fields in the inode are the ones related to the extent tree. Unlike ext2/ext3 that inode reserves certain fields for extent pointers with different depth, ext4 has a extent tree header followed by leaf node entries (when the depth in the header is 0) or index node entries. The header begins with a magic value 0xF30A, followed by valid entry count (0~4 in inode, 1~(blocksize-12)/12 for index block). The header also defines the depth (16-bit field, but actually maximum possible value is no more than 5, deeper than 5 levels would definitely run out of 32-bit block index first). The leaf node entries carry three information: 1) what index/offset the first block of this leaf node is within the file; 2) number of continous blocks this leaf node points to; 3) the global block index of the first block of the leaf node. The later two fields show an idea similar to segementation that base + length, and the maximum data size pointed by a leaf node entry is 32768 x blocksize. The index node entries just have one global block index (pointing to the internal extent block) and the index/offset the first data block of this entry is within the file.

directory structure

If the inode is a file, then follow the extent tree you will find the raw data in the data blocks. However, if the inode is a directory, the raw data is a list of directory entries (of varying length aligned to 4 bytes). Each directory entry begins with the 32-bit inode index, then 16-bit entry length, 8-bit name length (a file name is maximum 256-character long), 8-bit type code, followed by the file name. Every directory would have at least two entries (one for ‘.’, and one for ‘..’).

Rescure

The first challenge I had is to find out the inode index of my missing folder, bambo. The data block of the root directory seems to be overwritten, and there is no entry for bambo any more. There are many subfolders of different depths down from bambo, and once I hit one of them, I can check the ‘..’ entry all the way up to bambo. Therefore, I skimed through the inode table to find some candidate directories (a bit set in the first field of inode indicates it is a directory or not), and check their content. Later I realized that debugfs ncheck inode_index will report the path of the inode, very powerful and much easier than checking the directory content.

Anyways, I successfully found the inode index for bambo, and luckily the inode for bambo seems correct, and the corresponding bit in the inode bitmap is properly set. So I was not sure why there was the error message about bitmap checksum.

There is two subcommands of debugfs, dump and rdump. The former copies a file from the file system to a destination (not a destination in the same file system, which not mounted yet, but to the working file system), and the later copies a directory recursively. With those two subcommands, I finally got all my data copied from the corrupted file system.

I would like to try whether I can fix the file system in place, because the corruption seems to be very minor: just an entry missing from the root directory. I did not find how to modify a block via debugfs. Then I tried something similar to the third fun in the article, Fun with e2fsck and debugfs. I created a new directory named bambo under the root directory, and modify its inode to force its extent point to the same extent as my old bambo inode. After the change, debugfs ls / shows bambo is under the root directory, and all the accesses going into /bambo work fine. I wrote the disk, exited debugfs and mounted the disk normaly. However, only /bambo showed up, and I cannot make accesses further under /bambo (it said Bad message). I checked all the inode fields I can modify with debugfs, but it does not help. It also worths mentioning that debugfs mi seems handle the extent tree of ext4 as the old ext2/3 scheme, because the actual extent block appears as Direct Block #5, while the header content is recogonized as Direct Block #0 to #4. By mistake, I touched the Direct Block #0 in my first attempt. Although I tried to reverse the change, the header might be messed up, which I believe was the cause to the `Bas message’.

Scanning through the subcommand that debugfs offers, I found one named ln saying would create a link for an inode. Soon I realized it must be the proper way to connect folders. So I removed the newly-created bambo folder, and did debugfs ln <old bambo inode> /bambo. This time, it worked perfectly and the disk seems back to normal after I remount it.

Credits