Hey list... Sorry for all the emails lately about this bug I'm working on. I finally got tired of waiting for support to track it down so today I found it myself and fixed it. Of course I'm going through normal support channels to get an official [supported] fix; but if anyone's interested here's a patch that'll do it. And you never know... it might speed up the process a bit by posting this explanation. I tested the patch (it only changes a few lines) and it successfully fixed the problem without any side-effects. I'll try to explain it as clearly as possible but programmers aren't always the most brilliant communicators. ;) VERSIONS AFFECTED: all that I'm aware of (including 1.0.10-1) BUG: the EXCLUSIVE_LOCK on the DirNode is not being released after a file is created if the directory has multiple DirNodes and the last DirNode (not the one used for locking) has ENABLE_CACHE_LOCK set. ENABLE_CACHE_LOCK can be set on the last DirNode if the DirNode has this lock when the 255th file is added because ocfs copies the entire structure byte-for-byte (including locks) to the new DirNode it is creating and does not clear out the locks. CAUSE: ocfs_insert_file() in Common/ocfsgendirnode.c (line 1594 in the code available through subversion at oss.oracle.com right now) looks at the incorrect variable (DirNode, the last one instead of LockNode the locking one) to check for ENABLE_CACHE_LOCK and does not release the EXCLUSIVE_LOCK because it thinks that the node is holding an ENABLE_CACHE_LOCK instead of an EXCLUSIVE_LOCK. I've attached a patch that fixes it if anyone cares to have a look... STEPS TO REPRODUCE BUG: create a new ocfs directory. when you initially create it it will have ENABLE_CACHE_LOCK set... *BEFORE* accessing the directory from *ANY* other node (this will release the ENABLE_CACHE_LOCK) create at least 255 files in the directory. optionally you can check with debugocfs -D to see if the last (not-locking) DirNode has ENABLE_CACHE_LOCK set. *NOW* access it from the other node (to change the locking DirNode from ENABLE_CACHE_LOCK to NO_LOCK) and then create a file from either node -- the EXCLUSIVE_LOCK will not be released. attempting any operation from the opposite node that requires an EXCLUSIVE_LOCK will cause that node to hang until you do an operation that successfully releases the EXCLUSIVE_LOCK (e.g. deleting a file). FIX: Of course the best solution is to fix ocfs_insert_file() to look at the correct DirNode for the lock. Another thought would be to update the code that creates a new DirNode and have it clear out locks after copying the DirNode, although this just seems extraneous to me. PATCH: I have tested this patch (it is a minimal change) and it successfully fixes the bug. It releases the lock from the correct pointer (either the variable "DirNode" or "LockNode" depending on whether they come from the same disk DirNode) but it now also checks the correct variable for ENABLE_CACHE_LOCK. REQUESTED COMPENSATION: hehe... email me for the address where you can send pizza. I wonder if any oracle developers really do read this list... Index: ocfsgendirnode.c =================================================================== --- ocfsgendirnode.c (revision 5) +++ ocfsgendirnode.c (working copy) @@ -1591,17 +1591,26 @@ indexOffset = -1; } - if (DISK_LOCK_FILE_LOCK (DirNode) != OCFS_DLM_ENABLE_CACHE_LOCK) { - /* This is an optimization... */ - ocfs_acquire_lockres (LockResource); - LockResource->lock_type = OCFS_DLM_NO_LOCK; - ocfs_release_lockres (LockResource); + if (LockNode->node_disk_off == DirNode->node_disk_off) { + if (DISK_LOCK_FILE_LOCK (DirNode) != OCFS_DLM_ENABLE_CACHE_LOCK) { + /* This is an optimization... */ + ocfs_acquire_lockres (LockResource); + LockResource->lock_type = OCFS_DLM_NO_LOCK; + ocfs_release_lockres (LockResource); - if (LockNode->node_disk_off == DirNode->node_disk_off) + /* Reset the lock on the disk */ + DISK_LOCK_FILE_LOCK (DirNode) = OCFS_DLM_NO_LOCK; + } + } else { + if (DISK_LOCK_FILE_LOCK (LockNode) != OCFS_DLM_ENABLE_CACHE_LOCK) { + /* This is an optimization... */ + ocfs_acquire_lockres (LockResource); + LockResource->lock_type = OCFS_DLM_NO_LOCK; + ocfs_release_lockres (LockResource); + /* Reset the lock on the disk */ - DISK_LOCK_FILE_LOCK (DirNode) = OCFS_DLM_NO_LOCK; - else DISK_LOCK_FILE_LOCK (LockNode) = OCFS_DLM_NO_LOCK; + } } status = ocfs_write_dir_node (osb, DirNode, indexOffset);