Prying Bugs I - fcntl / close race
Recently I've been put into bug fixing apart from the regular chores. Mostly NFS. Last week I was working on a bug in which the locks obtained using fcntl were not gracefully revoked when a process trying to hold the _non-blocking_ lock is terminated abnormally. This was an interesting problem. A very obvious bug, easily reproducible and made me wonder why this bug survived so long? So long that redhat had to hire me to fix this. It is a 2.4 kernel bug. Its even upstream.
When processes died (for any possible reason), locks that are held by the process are revoked while closing the files. Linux kernel maintains locks at two places - in the inode (i_flock) and in a global linked list of locks. Acquiring a lock for any filesystem is easy. Only the conflicts need to be resolved. But for a network filesystem, first call the filesystem lock routine...that puts the stuff on the wire back and forth...and then resolve conflicts, if any. Bug was actually a race between a file close and fcntl, and I bet would exist on any linux network filesystem - for its in VFS itself. This was the scene:
---
P1 requests server for the lock
Server grants the lock
P1 sends release request for the lock
Server grants the release request..
...But before this release response reaches P1 and i_flock is updated, P2 gets
going..
P2 request server for the same lock
Server grants the lock
P2 tries to update the i_flock field, but find conflicts
BUG: error goes undetected
P1 runs and updates i_flock
---
The bug is that, conflict is detected but server is not sent the unlock request, and server thinks it has granted the lock. Remember the locks are non-blocking. Later when you test for the lock, server thinks that the lock is granted and will return the same notification, but in fact the process to which it granted the lock now cease to exist. A fix was to send the unlock in case a conflict is detected after the lock is granted from the server.
Bugs are making me more inquisitive...they increase my hunger and make me desperate to track them down. I'm liking it :)
When processes died (for any possible reason), locks that are held by the process are revoked while closing the files. Linux kernel maintains locks at two places - in the inode (i_flock) and in a global linked list of locks. Acquiring a lock for any filesystem is easy. Only the conflicts need to be resolved. But for a network filesystem, first call the filesystem lock routine...that puts the stuff on the wire back and forth...and then resolve conflicts, if any. Bug was actually a race between a file close and fcntl, and I bet would exist on any linux network filesystem - for its in VFS itself. This was the scene:
---
P1 requests server for the lock
Server grants the lock
P1 sends release request for the lock
Server grants the release request..
...But before this release response reaches P1 and i_flock is updated, P2 gets
going..
P2 request server for the same lock
Server grants the lock
P2 tries to update the i_flock field, but find conflicts
BUG: error goes undetected
P1 runs and updates i_flock
---
The bug is that, conflict is detected but server is not sent the unlock request, and server thinks it has granted the lock. Remember the locks are non-blocking. Later when you test for the lock, server thinks that the lock is granted and will return the same notification, but in fact the process to which it granted the lock now cease to exist. A fix was to send the unlock in case a conflict is detected after the lock is granted from the server.
Bugs are making me more inquisitive...they increase my hunger and make me desperate to track them down. I'm liking it :)