Prying Bugs II - file handle corruption
The one on which I'm working on right now is fairly simple to understand and always reproducible. Volumes with minor number greater than 256 are not being mounted. The client process waits for the server response and later dies with EIO.
I started hunting the bug by enabling the available debug info, and it was clear that the client wasn't getting FSINFO reply. A plunge into the code was required to find the reason. For every client, server maintains an authentication cache (export map) and a cache for file handles (expkey map). Since file handles has to be unique within the server, device numbers are used, which are fairly unique. Some client info and device number is used to decide the expkey. However, minor numbers greater than 256 are encoded in a different way, a new encoding logic (which arguably even ethereal doesn't know of, I tried adding it...just added the new file handle version but couldn't do more for lack of time).
After staring at the code for quite a while, I tried lot of debug printing and came to know that fh_verify (file handle verification) didn't succeed. A day later I realized that fh_verify didn't succeed because the expkey cache entry is stale and it never got updated. This was a stagnant phase for some time. Later, after some thoughtful insights from author of original (Sun) NFSv3 and its RFC (1813) (who sits two cubicles away from mine) himself and some more code tracing, I found that the cache should be updated by the kernel when it receives a request for file handle on /proc/fs/nfsd/.getfs or /proc/fs/nfsd/.getfd from mountd.
Then a network trace and a realization that correct file handle is not returned by mountd. This was all due to the different encoding scheme for higher minors. I changed the code which encoded the expkeys of devices with higher minor numbers to the old encoding scheme (as for the devices with lower minor number) and also correspondingly changed nfs-utils. It worked. But I don't know why devices with higher minor numbers are to be encoded differently when we have 32 bits to fit the key into.
This is no way a fix, just a work around, need to lookup where the things are slipping off for the higher minors and why file handle isn't correctly reported.
Right now, time for some weight lifting and biking (which I've started to love) :-)
I started hunting the bug by enabling the available debug info, and it was clear that the client wasn't getting FSINFO reply. A plunge into the code was required to find the reason. For every client, server maintains an authentication cache (export map) and a cache for file handles (expkey map). Since file handles has to be unique within the server, device numbers are used, which are fairly unique. Some client info and device number is used to decide the expkey. However, minor numbers greater than 256 are encoded in a different way, a new encoding logic (which arguably even ethereal doesn't know of, I tried adding it...just added the new file handle version but couldn't do more for lack of time).
After staring at the code for quite a while, I tried lot of debug printing and came to know that fh_verify (file handle verification) didn't succeed. A day later I realized that fh_verify didn't succeed because the expkey cache entry is stale and it never got updated. This was a stagnant phase for some time. Later, after some thoughtful insights from author of original (Sun) NFSv3 and its RFC (1813) (who sits two cubicles away from mine) himself and some more code tracing, I found that the cache should be updated by the kernel when it receives a request for file handle on /proc/fs/nfsd/.getfs or /proc/fs/nfsd/.getfd from mountd.
Then a network trace and a realization that correct file handle is not returned by mountd. This was all due to the different encoding scheme for higher minors. I changed the code which encoded the expkeys of devices with higher minor numbers to the old encoding scheme (as for the devices with lower minor number) and also correspondingly changed nfs-utils. It worked. But I don't know why devices with higher minor numbers are to be encoded differently when we have 32 bits to fit the key into.
This is no way a fix, just a work around, need to lookup where the things are slipping off for the higher minors and why file handle isn't correctly reported.
Right now, time for some weight lifting and biking (which I've started to love) :-)