Since January of this year, I’ve been actively seeing AppleFileServer crash regularly on a server at work. This server is our primary student account server, which at any given time has about 40-80 students logged in (network home directories).
Many days, AFP crashes several times. Every time, it’s the same error: kern_protection_failure. The thread that crashes is always talking about ByteRangeLockTreeKey. The only good thing about this problem, is seemingly AFP comes back up, and people’s computers reconnect (go autofs!). But this is a very poor consolation prize since for some people, this does cause a problem (anyone with Mail open usually gets an error about not being able to access their inbox, and do they want to rebuild, or quit, and some others occasionally get Final Cut project file corruption (this is rare, and only seems to impact those that have their autosave vault set to their home directory, and not the local HD)).
So, Apple was notified about this, officially, on Jan 22nd, 2009. Ticket number 6517425. After getting back to me and asking for some follow up info, they proceeded to roll the ticket into another one (6237420). This ticket, apparently, was not related, and after telling our Sales Engineer about this, he had them un-merge the tickets. Apple then rolled my bug into another ticket, 5859645. An even older ticket! From what I’ve gathered, this ticket may be related to some lower level issue than AFP… either filesystem level (perhaps ACLs?!?, or even general I/O level).
All the while, I am in contact with someone in Minnesota who is having my same issue, and has also opened tickets (and has the luxury of having AppleCare for 10.5 server (the high end AppleCare to boot). He had two open case numbers with them. He even had a regional service engineer come by and take a look at this system, which he said was set up correctly, and there’s nothing more they could do to help alleviate the problem until a patch was available.
So, also during this time, someone from London contacts me and says he’s having the same issue as well, and has a Developer account (pay for), so he tries a beta of 10.5.7. It does not fix the issue. Around this time, I downgrade to 10.5.4 hoping the issue will be lessened (long story short, it isn’t). But, a few weeks later, the gent from London says he’s fixed his problem by removing the “deny all” acl from all his share points and folders within share points. The “deny all” acl was added around 10.5.4 or so to mitigate something… no one’s sure what. Anyway, he then tells Apple about this “fix” and they reply that it’s an “unacceptable workaround” and that they’re working on a fix. This was April 9th he did this.
Well, so, 10.5.7 dropped last Tuesday (May 12th, 2009). I installed it on the server experiencing the issue Friday night, at about 2am. I didn’t have a single crash until Sunday, May 17th, 2009, at 5:52pm. Same exact error.
So, not only was Apple notified AT LEAST 110 days prior to 10.5.7 shipping, but they were notified of an actual “fix” about 33 days before hand. I really wish Apple’s bug database was public, so that I could post links to my bugs, but, alas it is not.
However, here are a few threads on the issue:
At this point, I’m going to start actively poking buttons and prodding people until I get an answer. The last email I sent to firstname.lastname@example.org resulted in the “pat”, “There is no new information at this time”. What a load of horse crap. They know of at least one “option”… the least they could do would be to educate someone having this issue about that “fix” and it’s repercussions. Given the amount of time that 10.5.7 took to hit the street, and how far in advance I notified them about this bug, I have very little hope this will get fixed before 10.6. If we’re lucky, we’ll see the fix back ported, but I doubt it.
To cap this all off, the main reason I’m posting this is for posterity, as well as the hope that anyone else that has this bug can actually see they’re not alone! And that they can contact Apple and say “hey, I have some bug numbers here of others having this issue”. If you are having this issue, please, don’t hesitate to contact me and I’ll work to get you in contact with others having this issue, or with someone at Apple that will actually listen.
UPDATE 1: Today I got a call from the local Education SE, who has created an escalation of this issue. Assuming it gets signed off by his boss, I should be hearing from Apple Engineering in the next few days… which is good since AFP crashed 5 times today. I have decided, in the interim, to remove the “group:everyone deny delete” ACL from many of the home folders on the server. Hopefully this will ease the problem. We’ll have to see. And I’ll post more once I hear from Engineering.