A while ago, I was dealing with this unwelcomed bug. It was corrupting the
slab with occasional oops.
The syslog looked like this:
Slab corruption: 094f77bc start=094f77c0, len=444
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<0811de0f>](ext2_destroy_inode+0x41/0x46)
0f0: 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Prev obj=094f75f4 start=094f75f8, len=444
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<00000000>](nosmp+0xf7fb7000/0x14)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj=094f7984 start=094f7988, len=444
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<0811dd90>](ext2_alloc_inode+0x14/0x52)
000: 36 8c 01 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
First thing to do while detecting slab corruptions is to enable debug compile flags
CONFIG_DEBUG_SLAB, and
CONFIG_DEBUG_SLAB_LEAK. Also
CONFIG_DEBUG_VM doesn't harm.
Each allocatable memory unit is called an
object in the slab terminology. The first line of the syslog excerpt shows where the object starts which is corrupted according to the slab allocator. Slab corruption is detected while allocating an object. Typically, during unallocation the object is "poisoned" (a specific byte value is written at the entire memory of the object) and it is checked during allocation whether the poison values are intact. Thus, if some values are different, it indicates that the memory location was written after unallocation. There are 3 possible scenarios when this can happen:
- over-running the allocated memory - writing to address ahead of the end of the allocated memory
- under-running the allocated memory - writing to address before the start of the allocated memory
- using memory after freeing it
Wrong pointer arithmetic could lead to cases (a) and (b) above. In such cases, the "Next" and "Prev" objects specified in the syslog provide an indication as to where could be the leakage. The "Last user" in the syslog indicates the last function which freed the object.
[<00000000>] or
nosmp indicates that the object is not used yet.
Since I didn't had much pointer arithmetic in my code, it was likely that I was using the freed memory. The memory in question for me was for an inode. It was freed when all the referenced to it are dropped. Reference for an inode is dropped by using
iput.
I audited all the
iputs in the code but couldn't find any problems.
iputs in my code are the ones that make the code complex. Typical file systems have just one inode per file to deal with, so usually no
iputs are involved, as most stuff is taken care of by VFS. But in the ChunkFS case, lot more (continuation) inodes and hence lot more
igets and
iputs.
Next was to sprinkle printks around to know whats going on. That didn't help either, only it took me more than a day to figure out of piles of logs what is exactly happening and if anything absurd is going on, or if there is any particular corruption pattern. It is stressful as well as fun to build a mental map of the execution paths and the likely values of the variables just by seeing the log. Good, it was (almost) all single-threaded.
I figured that slab are mostly corrupted during creation of continuation inode for directories and in an desperate attempt to fix it fast, I resorted to a debugger. Theres a reason, btw, why theres no debugger in the Linux kernel. Debugger makes developers lazy. They inspect code no more to look for problems but quickly turn to debugging them. While debugging, I looked at the code and I saw where the problem was. I somehow didn't audited the code close enough to catch it. This patch fixed it:
@@ -264,10 +266,10 @@ static int chunkfs_mkdir(struct inode *
d_instantiate(dentry, inode);
out:
- if (parent)
- iput(parent);
if (dentry->d_parent->d_inode != dir)
mutex_unlock(&dir->i_mutex);
+ if (parent)
+ iput(parent);
return err;
In the code, variable
parent is equal to variable
dir if a continuation inode is created while creating a directory.