Archive

Posts Tagged ‘Open Directory’

AFP, Kerberos, and 10.6

January 16th, 2010 staze No comments

UPDATE: Monday after a scheduled outage, I demoted my OD replicas to standalone (safer this way, I think), then ran `mkpassdb -kerberize` on the OD Master. About 5 minutes later, I had gone from 3500 Kerberos Principals to about 9500 (about 1500 of those are really old entries that I’ll clear out over the summer). I then added the replicas back. At that point, `kinit username` for previously failing users. We shall see.

UPDATE 2: Two days after the above, we have not seen any users having problems logging in. I will be talking to my AppleCare Enterprise friend tomorrow and seeing if he can shed some light on why AFP is trying to use Kerberos even though it’s supposed to only do “Standard” auth. More to come…

UPDATE 3: Well, that was nice while it lasted. Starting this week, on Monday, we started getting 2-3 users per day that couldn’t log in. Restarting AFP is the only way to get them logging in again. So, I’ve been doing that at about 6:45am each morning. So, I’ve got an open case with Apple at this point seeing what they can figure out. So far, it’s completely stumped them. We’ll have to see.

Starting with 10.5.7, I would occasionally see users (a small subset of users) that when they tried to login from a managed client (loginwindow, 10.5.8 client), they would get an error stating “You cannot login at this time because an error occurred”. If you then went to a computer that was unmanaged, and attempted to do a “Go-Connect to Server” and connect to the server over AFP, you would be presented with their home directory, only blank. Trying to connect over SMB would work, and everything was there.

The only way to make AFP work again would be to restart the AFP process. Obviously, this was really annoying, but I never could figure out the cause. Over the course of the summer break, we upgraded to 10.6 server, and didn’t see any instances of it.

Queue Fall term. We started seeing this problem the first week of the term, though slightly different. First, the clients are still 10.5.8 since we have about 36 PPC machines still in use (all G5 iMacs). Affected users would still get “Unable to login at this time because an error occurred”. So, I email a buddy that works for AppleCare Enterprise and forward on some log entries when it happens. Only things he sees are some IPv6 related messages (which is odd, since IPv6 is disabled), and maybe a Kerberos message… which, I don’t think much about at the time. Trying to connect to the server from “Go->Connect to Server” from an unmanaged client over AFP would result in a message saying “You do not have permission for any shares on this server”. Over SMB would result in seeing the shares, but trying to mount them would give you a permission denied error.

So, I go over my notes from the 10.5.x server days, and because it seemed to make things better with 10.5 server, I change AFP’s Authorization from “Any” to “Standard”. No change in results.

I bang my head against this for several days, trying many different options, but really don’t hit upon anything until an unrelated issue, where I am playing with some ACLs, and notice that if I “Deny” “Full Control” on a folder to a certain group, the folder disappears for that group. Not just “No access”, but it full on disappears. Huh. So many the issue is some kind of permissions thing. But, as my friend at AppleCare Enterprise mentions, the Effective Permissions Inspector (http://docs.info.apple.com/article.html?path=ServerAdmin/10.5/en/c2fs28.html) shows the permissions are fine for the user’s home folder. Okay…

So, I dig around some more, and randomly try “kinit” for an affected user. “kinit: Unable to acquire credentials for ‘user@REALM.EXAMPLE.COM’: Client not found in Kerberos database”. Hmmm. so I try for another affected user… same thing. I try it for all the users I’ve got records for having seen this issue. All are missing kerberos records. Well shit. So, I use kadmin to add a record for one of the users that’s seeing the problem (`kadmin -p admin -q addprinc user@REALM.EXAMPLE.COM` then type in the admin password, and their password twice). It adds, and after propagating, I can kinit. But, AFP still doesn’t work. Few hours later, I try AFP again, and I am allowed to mount their home, but it’s blank. Holy crap! Back to the 10.5.8 symptom. So obviously I’m getting somewhere. Later that night, I restart AFP, and suddenly the user account works perfectly. Ah ha!

K, so I get a list of all the kerberos principals on the server, ~3500. Hmm… given we have about 7600 users in the OD, that seems like a problem. But, after looking at most of the users that are seeing this, I find they’re all older user accounts. Meaning they were created when the OD Master was an older machine (G4 Xserve, or an old Quicksilver) running 10.3.9 or 10.4.x (depending on how old the accounts were). All the newer accounts seem to have Kerberos records. But, when we upgraded to 10.6 Server on the OD from 10.5, it seems ALL accounts got an attribute added that says “altSecurityIdentities: Kerberos:user@REALM.EXAMPLE.COM”. Hmm… I guess I could see this causing an issue.

So the question, other than “why do these users not have kerberos principals?” is “Why is AFP using Kerberos if it’s authorization is set to Standard?” This seems like a bug, or there’s something going on I’m not understanding. Obviously it seems the auth system in SMB changed a bit too between 10.5 and 10.6, since it used to behave differently.

Either way, I’ll be running “mkpassdb -kerberize” on the OD Master on Monday during our systems outage (there is a scheduled, 2 hour, power outage to test power resiliency on campus) (I already ran a test case on a test OD master, and it did add kerberos entries for all the users. So, that’s nice). This should hopefully resolve this issue permanently. I will update this post once I’ve kerberized all users, and things work, and I’ll update again later next week once I know whether or not it resolved the issue. I’m also expecting some info back from my friend about why this might be happening with AFP.

One thing I will say… this has really got me looking at Kerberos. Previous to this, I didn’t really use it at all on our systems. But since playing with it, it seems pretty damn cool. =)

Well, more in a few days.

10.6 Server, Xsan 2.2.1, and ACL oddities

December 31st, 2009 staze No comments

UPDATE: So there was one more issue going on with this. After re-reading all the Xsan 2.2 documentation, it indicates that the primary MDC should be either a replica, or the Master OD server. In my setup, the backup MDC is the Master, but the Primary MDC is only “Connected”. Apparently this doesn’t work right. So, I made the Primary a replica, and everything now works. So, while the below is true, I’d make sure the above is also true if you’re running Xsan.

So, as I talked talked about earlier, we recently updated to 10.6 server, and along with that, Xsan 2.2.1. Since then, we’d been seeing odd ACEs (Access Control Entries) on folders that are on the Xsan, on the 10.6 Servers (the 10.5 Server saw everything just fine). But, the 10.6 Servers would see many of the ACEs as FFFFEEEE-DDDD-CCCC-BBBB-AAAA82xxxxxx (where xxxxxx is a hex equivalent of something (seemingly not the user/group id).

Removing and reapplying the ACLs wouldn’t help. Some of the ACLs would show fine, but some no matter what would show up as the above. So obviously there is an issue with the client looking up the user/group associated with that ACL (yet 10.5 works).

The solution came to me a few days ago. As I said previously, our Open Directory server has been around for a while. It started life as a 10.1 or 10.2 server, and has been upgraded since that point to 10.6 now. Any several of the groups/users have stayed the same on this system since then. Which relates to some issues I had a while back with iCal server not working for our older users. Accounts/Groups back in the 10.2 days didn’t have a UUID created and assigned to them. I fixed this for the user accounts about 10 months ago with a script that generates UUIDs and adds them to the user record. But at the time, I didn’t think of it about the groups. Now I wish I had. Once I added GeneratedUIDs to the groups that didn’t have them, and then removed and re-added the ACEs, everything seems to have worked. We still have a couple that don’t resolve right visually, but access to the files seems to work fine, so no clue why that’s happening.

All and all, kind of an annoying issue. Apple really should have their upgrade from 10.x to 10.x check for users/groups that don’t have GeneratedUIDs add them to the record, since some people have thousands of users, and have been upgrading since the days before LDAP (NetInfo is what used to hold directory info).

Ah well. So, anyone having a similar issue, check the inspector in WGM for a GeneratedUID for the group/user in question. My script linked above should easily be able to be modified to add GUIDs for groups as well.

Open Directory oddities…

December 28th, 2009 staze No comments

Just this last weekend, I upgraded our primary systems from 10.5.8 Server to 10.6.2 Server, and the Xsan to 2.2.1 from 2.1.1. All and all, everything went well, though there’s been an odd issue that arose.

Since the update, I’ve seen something like the following error every 2 hours on the 10.6 machines: “Dec 25 14:03:26 server DirectoryService[29]: Misconfiguration detected in hash ‘Global GID’ – see /Library/Logs/DirectoryService/DirectoryService.error.log for details”

You look in DirectoryService.error.log, and you find:
2009-12-25 14:03:26 PST - T[0x0000000104781000] - Group 'wheel' (/LDAPv3/od.example.com) - ID 0 - UUID 9E733C05-88DE-4F83-9E09-038A887F1327 - SID S-1-5-21-4096-2147483678-1391576524-1001
2009-12-25 14:03:26 PST - T[0x0000000104781000] - Group 'wheel' (/Local/Default) - ID 0 - UUID ABCDEFAB-CDEF-ABCD-EFAB-CDEF00000000 - SID S-1-5-21-4171259825-3059450906-1974363594-1001

This error is there for several system level groups: daemon, kmem, sys, wheel, etc. Basically, the OD clients are all complaining that there is a conflict between the local group “wheel”, and the “wheel” that exists in the directory. These accounts, seemingly, shouldn’t exist within the directory, as they’re local accounts that exist on all the OD clients.

So, at this point, I think I’m safe removing them from the directory at this point. Looking at an ldif dump of the directory, it shoes these groups were created in 2003, when I upgraded the directory server from 10.2 to 10.3 (Netinfo to LDAP).

All told, there are probably 15 of these groups. They all conflict with other groups on the local directories, or are antiquated and don’t need to exist on the directory.

UPDATE: I successfully removed all of these groups, and it seem to have resolved the error messages, and had no ill effects. So, if you’re getting a bunch of the above errors, check to make sure you don’t have some weird group sitting on your directory that’s conflicting with a local system group. In general, all the groups you create should start with UID 1000 or above. There are only a few that are supposed to exist on the directory (admin, staff, domainadmins, domaincomputers, domainusers… I think that’s it).

Good luck, and hopefully post again shortly after the new year and students return from vacation.

Missing OD Attributes

September 18th, 2009 staze No comments

A while back, when 10.5 first shipped, my boss and I were intrigued by iCal Server. And after getting it going, we found that for some reason, we couldn’t set a calendar server for our users. We’d set the calendar in Workgroup Manager (WGM hereafter), hit save, and right after that, WGM would uncheck the box, and set it back to none. Something was weird with our accounts. My boss deleted his account and recreated it, and that fixed it. So, the problem was obviously something with the “old” accounts on the system (e.g. the accounts that had been migrated from 10.1 -> 10.2 -> 10.3 -> 10.4 -> 10.5. Some attribute didn’t get added.

At first glance, the obvious missing one was there was only one authAuthority entry in our user accounts. Just the ApplePasswordServer entry, no Kerberosv5 entry. And deleting and recreating his account remedied that. So, I put it on the back burner because we figured out that iCal Server was kinda buggy, and later found out that the iPhone couldn’t write to it, just read (no good). When 10.6 was announced, and we heard that iCal Server 2 supported read/write from the iPhone (with version 3.x+ of the iPhone software), I started thinking about it again. So, I looked at the attribute, and figured out that a single “sed” would create the attribute I needed. So, I coded up this. And figured, problem solved.

Alas, it was not. While testing a migration of 10.4 email (cyrus) to a 10.6 server (dovecot) using the migrate_mail_data.pl script, I noticed at the end, the script renames the imap spool folders after the user’s UUID. It failed to rename them for several users (like me) because we didn’t have a UUID. Then it hit me, iCal Server’s URLs contain the UUID for the user. No UUID, no iCal Server calendar. DOH! So, I took the Kerberos script, changed it a bit (using uuidgen), and batch added UUIDs for all the users that didn’t have them (Script here. Re-ran the migrate_mail_data.pl script, and viola, all the spools got renamed. Now, I’m not sure if this really fixes the iCal Server 1 problem, since I don’t have an iCal Server set up, but iCal Server 2 works (though, it works differently, so this issue doesn’t occur). I also got word from a bug I filed about the initial issue, that Apple thinks 10.6 fixed it, so I’d imagine when you upgrade your OD server to 10.6, it adds missing attributes.

More later, work related. Thought I should keep topics distinct.

Thursday

April 2nd, 2009 staze 1 comment

 

Maybe I’ll get into posting on Thursday.

So, I just added some nifty ajax to poll the power usage every 20 seconds. While the file is only updated every 60 seconds, I figure every 20 seconds is low enough bandwidth, and often enough to catch and update frequently enough. If I can figure it out, I’ll try to make this only update every minute right after the file gets updated (it gets updated about 2-3 seconds after the minute). I mainly did this for my own benefit since I wanted to be able to watch the power usage without reloading the whole page. Maybe I’ll set up an RSS script so I can subscribe to my energy usage. =P No, no twitter… seems like a waste of twitter’s bandwidth. 

I was up pretty late last night trying to fix OD replication that hasn’t worked right since my 5 hour marathon session prepping for the term. During that marathon, I demoted and re-promoted the OD Master, and apparently when I reset up the replica, it didn’t work right. Changes to passwords, or creation of user accounts weren’t getting replicated. So, I reconfigured stuff. I’m pretty convinced this is the best way to do it:

  1. Demote replica to standalone, reboot.
  2. Remove directory server from Directory Utility, if present. Reboot. If not present, go on.
  3. Check Permissions on replica, if things are clean, go to step 4. Otherwise, reboot again.
  4. Promote to replica. Once replication is complete (by watching Password Replication Log in Server Admin, or in /Library/Logs/PasswordServer/), reboot.
  5. Reboot the OD Master. 
  6. Do a test password change, account disable, something. Watch the replication log on the Master, or Replica to make sure replication takes place (You should see the replication nearly instantly if you have the master set to replicate upon any change). 
  7. Only other thing would be to add the directory info for replica SSL support, and change the slapd.plist to allow SSL. See here: http://www.afp548.com/article.php?story=20080624005724638 
  8. Optional: Reboot any systems that pointed at the Replica before you started this. There seems to be some directory caching that kept ahold of the “Incorrect Password/Invalid User” responses that the replica was returning when things weren’t working. 

That pretty much fixed it. Bitch is, it took me nearly an hour to do when I thought it would take about 15 minutes. Some of the reboots could probably be removed, but it seems safer this way. Replication now works better than it ever has I think (before, no matter what, replication seemed to happen based on a time period, rather than instantly. Easiest way to check this is to make a change to the Master in WGM or via the command line, then do a “mkpassdb -dump | grep username”. You should get back the password slot for that user, and a change date that corresponds to your change on the master. 

As always, make sure you are using NTP on all the servers/clients (ideally the same NTP server, even if it’s wrong, at least all the clients think it’s the same time), and that DNS is working (“changeip -checkhostname” is your friend). If either of these things are broken, kerberos most likely won’t work, and other things may not work right. DNS not working can cause weird issues, as can time differences (kerberos won’t work AT ALL if the times between client and server are off by more than a few minutes). Thankfully pool.ntp.org exists, as do most campuses and corporations run their own NTP servers (AD has one built in, as does ≥Mac OS 10.4 Server.)

See ya all later.