Archive

Posts Tagged ‘ACL’

10.6 Server, Xsan 2.2.1, and ACL oddities

December 31st, 2009 staze No comments

UPDATE: So there was one more issue going on with this. After re-reading all the Xsan 2.2 documentation, it indicates that the primary MDC should be either a replica, or the Master OD server. In my setup, the backup MDC is the Master, but the Primary MDC is only “Connected”. Apparently this doesn’t work right. So, I made the Primary a replica, and everything now works. So, while the below is true, I’d make sure the above is also true if you’re running Xsan.

So, as I talked talked about earlier, we recently updated to 10.6 server, and along with that, Xsan 2.2.1. Since then, we’d been seeing odd ACEs (Access Control Entries) on folders that are on the Xsan, on the 10.6 Servers (the 10.5 Server saw everything just fine). But, the 10.6 Servers would see many of the ACEs as FFFFEEEE-DDDD-CCCC-BBBB-AAAA82xxxxxx (where xxxxxx is a hex equivalent of something (seemingly not the user/group id).

Removing and reapplying the ACLs wouldn’t help. Some of the ACLs would show fine, but some no matter what would show up as the above. So obviously there is an issue with the client looking up the user/group associated with that ACL (yet 10.5 works).

The solution came to me a few days ago. As I said previously, our Open Directory server has been around for a while. It started life as a 10.1 or 10.2 server, and has been upgraded since that point to 10.6 now. Any several of the groups/users have stayed the same on this system since then. Which relates to some issues I had a while back with iCal server not working for our older users. Accounts/Groups back in the 10.2 days didn’t have a UUID created and assigned to them. I fixed this for the user accounts about 10 months ago with a script that generates UUIDs and adds them to the user record. But at the time, I didn’t think of it about the groups. Now I wish I had. Once I added GeneratedUIDs to the groups that didn’t have them, and then removed and re-added the ACEs, everything seems to have worked. We still have a couple that don’t resolve right visually, but access to the files seems to work fine, so no clue why that’s happening.

All and all, kind of an annoying issue. Apple really should have their upgrade from 10.x to 10.x check for users/groups that don’t have GeneratedUIDs add them to the record, since some people have thousands of users, and have been upgrading since the days before LDAP (NetInfo is what used to hold directory info).

Ah well. So, anyone having a similar issue, check the inspector in WGM for a GeneratedUID for the group/user in question. My script linked above should easily be able to be modified to add GUIDs for groups as well.

My AFP problem

May 18th, 2009 staze No comments

Since January of this year, I’ve been actively seeing AppleFileServer crash regularly on a server at work. This server is our primary student account server, which at any given time has about 40-80 students logged in (network home directories).

Many days, AFP crashes several times. Every time, it’s the same error: kern_protection_failure. The thread that crashes is always talking about ByteRangeLockTreeKey. The only good thing about this problem, is seemingly AFP comes back up, and people’s computers reconnect (go autofs!). But this is a very poor consolation prize since for some people, this does cause a problem (anyone with Mail open usually gets an error about not being able to access their inbox, and do they want to rebuild, or quit, and some others occasionally get Final Cut project file corruption (this is rare, and only seems to impact those that have their autosave vault set to their home directory, and not the local HD)).

So, Apple was notified about this, officially, on Jan 22nd, 2009. Ticket number 6517425. After getting back to me and asking for some follow up info, they proceeded to roll the ticket into another one (6237420). This ticket, apparently, was not related, and after telling our Sales Engineer about this, he had them un-merge the tickets. Apple then rolled my bug into another ticket, 5859645. An even older ticket! From what I’ve gathered, this ticket may be related to some lower level issue than AFP… either filesystem level (perhaps ACLs?!?, or even general I/O level).

All the while, I am in contact with someone in Minnesota who is having my same issue, and has also opened tickets (and has the luxury of having AppleCare for 10.5 server (the high end AppleCare to boot). He had two open case numbers with them. He even had a regional service engineer come by and take a look at this system, which he said was set up correctly, and there’s nothing more they could do to help alleviate the problem until a patch was available.

So, also during this time, someone from London contacts me and says he’s having the same issue as well, and has a Developer account (pay for), so he tries a beta of 10.5.7. It does not fix the issue. Around this time, I downgrade to 10.5.4 hoping the issue will be lessened (long story short, it isn’t). But, a few weeks later, the gent from London says he’s fixed his problem by removing the “deny all” acl from all his share points and folders within share points. The “deny all” acl was added around 10.5.4 or so to mitigate something… no one’s sure what. Anyway, he then tells Apple about this “fix” and they reply that it’s an “unacceptable workaround” and that they’re working on a fix. This was April 9th he did this.

Well, so, 10.5.7 dropped last Tuesday (May 12th, 2009). I installed it on the server experiencing the issue Friday night, at about 2am. I didn’t have a single crash until Sunday, May 17th, 2009, at 5:52pm. Same exact error.

So, not only was Apple notified AT LEAST 110 days prior to 10.5.7 shipping, but they were notified of an actual “fix” about 33 days before hand. I really wish Apple’s bug database was public, so that I could post links to my bugs, but, alas it is not.

However, here are a few threads on the issue:

    http://www.afp548.com/forum/viewtopic.php?showtopic=23311
    http://discussions.apple.com/thread.jspa?threadID=1975848
    http://discussions.apple.com/thread.jspa?messageID=8857952

At this point, I’m going to start actively poking buttons and prodding people until I get an answer. The last email I sent to devbugs@apple.com resulted in the “pat”, “There is no new information at this time”. What a load of horse crap. They know of at least one “option”… the least they could do would be to educate someone having this issue about that “fix” and it’s repercussions. Given the amount of time that 10.5.7 took to hit the street, and how far in advance I notified them about this bug, I have very little hope this will get fixed before 10.6. If we’re lucky, we’ll see the fix back ported, but I doubt it.

To cap this all off, the main reason I’m posting this is for posterity, as well as the hope that anyone else that has this bug can actually see they’re not alone! And that they can contact Apple and say “hey, I have some bug numbers here of others having this issue”. If you are having this issue, please, don’t hesitate to contact me and I’ll work to get you in contact with others having this issue, or with someone at Apple that will actually listen.

UPDATE 1: Today I got a call from the local Education SE, who has created an escalation of this issue. Assuming it gets signed off by his boss, I should be hearing from Apple Engineering in the next few days… which is good since AFP crashed 5 times today. I have decided, in the interim, to remove the “group:everyone deny delete” ACL from many of the home folders on the server. Hopefully this will ease the problem. We’ll have to see. And I’ll post more once I hear from Engineering.

Systems upgrade tonight through Monday…

June 20th, 2008 staze No comments

So, I get the ability to take stuff down for more than an hour! Starting tonight, I get to take down the SAN, back up everything onto another array, upgrade Xsan (to 2.1), wipe the SAN (to facilitate upping the block size from 4k to 8k), then copy everything back. I could just do an upgrade of the system, but that wouldn’t let me change the block size… 

actually… in looking online… maybe I shouldn’t change my block size. I mean, there are a lot of prefs and such that are small files (smaller than 4k) that would balloon to 8k with a block size change. 

Okay… hmmm… maybe I’ll backup, install, wipe, recreate with a 4k block size, then copy stuff back. I pretty much need to do the wipe to get ACLs back working. 

So, yeah… fun weekend of the constant: start a job, wait, check it, wait, check it, wait… start another, wait…

Categories: System Administration Tags: ,