UPDATE: Monday after a scheduled outage, I demoted my OD replicas to standalone (safer this way, I think), then ran `mkpassdb -kerberize` on the OD Master. About 5 minutes later, I had gone from 3500 Kerberos Principals to about 9500 (about 1500 of those are really old entries that I’ll clear out over the summer). I then added the replicas back. At that point, `kinit username` for previously failing users. We shall see.
UPDATE 2: Two days after the above, we have not seen any users having problems logging in. I will be talking to my AppleCare Enterprise friend tomorrow and seeing if he can shed some light on why AFP is trying to use Kerberos even though it’s supposed to only do “Standard” auth. More to come…
UPDATE 3: Well, that was nice while it lasted. Starting this week, on Monday, we started getting 2-3 users per day that couldn’t log in. Restarting AFP is the only way to get them logging in again. So, I’ve been doing that at about 6:45am each morning. So, I’ve got an open case with Apple at this point seeing what they can figure out. So far, it’s completely stumped them. We’ll have to see.
Starting with 10.5.7, I would occasionally see users (a small subset of users) that when they tried to login from a managed client (loginwindow, 10.5.8 client), they would get an error stating “You cannot login at this time because an error occurred”. If you then went to a computer that was unmanaged, and attempted to do a “Go-Connect to Server” and connect to the server over AFP, you would be presented with their home directory, only blank. Trying to connect over SMB would work, and everything was there.
The only way to make AFP work again would be to restart the AFP process. Obviously, this was really annoying, but I never could figure out the cause. Over the course of the summer break, we upgraded to 10.6 server, and didn’t see any instances of it.
Queue Fall term. We started seeing this problem the first week of the term, though slightly different. First, the clients are still 10.5.8 since we have about 36 PPC machines still in use (all G5 iMacs). Affected users would still get “Unable to login at this time because an error occurred”. So, I email a buddy that works for AppleCare Enterprise and forward on some log entries when it happens. Only things he sees are some IPv6 related messages (which is odd, since IPv6 is disabled), and maybe a Kerberos message… which, I don’t think much about at the time. Trying to connect to the server from “Go->Connect to Server” from an unmanaged client over AFP would result in a message saying “You do not have permission for any shares on this server”. Over SMB would result in seeing the shares, but trying to mount them would give you a permission denied error.
So, I go over my notes from the 10.5.x server days, and because it seemed to make things better with 10.5 server, I change AFP’s Authorization from “Any” to “Standard”. No change in results.
I bang my head against this for several days, trying many different options, but really don’t hit upon anything until an unrelated issue, where I am playing with some ACLs, and notice that if I “Deny” “Full Control” on a folder to a certain group, the folder disappears for that group. Not just “No access”, but it full on disappears. Huh. So many the issue is some kind of permissions thing. But, as my friend at AppleCare Enterprise mentions, the Effective Permissions Inspector (http://docs.info.apple.com/article.html?path=ServerAdmin/10.5/en/c2fs28.html) shows the permissions are fine for the user’s home folder. Okay…
So, I dig around some more, and randomly try “kinit” for an affected user. “kinit: Unable to acquire credentials for ‘[email protected]’: Client not found in Kerberos database”. Hmmm. so I try for another affected user… same thing. I try it for all the users I’ve got records for having seen this issue. All are missing kerberos records. Well shit. So, I use kadmin to add a record for one of the users that’s seeing the problem (`kadmin -p admin -q addprinc [email protected]` then type in the admin password, and their password twice). It adds, and after propagating, I can kinit. But, AFP still doesn’t work. Few hours later, I try AFP again, and I am allowed to mount their home, but it’s blank. Holy crap! Back to the 10.5.8 symptom. So obviously I’m getting somewhere. Later that night, I restart AFP, and suddenly the user account works perfectly. Ah ha!
K, so I get a list of all the kerberos principals on the server, ~3500. Hmm… given we have about 7600 users in the OD, that seems like a problem. But, after looking at most of the users that are seeing this, I find they’re all older user accounts. Meaning they were created when the OD Master was an older machine (G4 Xserve, or an old Quicksilver) running 10.3.9 or 10.4.x (depending on how old the accounts were). All the newer accounts seem to have Kerberos records. But, when we upgraded to 10.6 Server on the OD from 10.5, it seems ALL accounts got an attribute added that says “altSecurityIdentities: Kerberos:[email protected]”. Hmm… I guess I could see this causing an issue.
So the question, other than “why do these users not have kerberos principals?” is “Why is AFP using Kerberos if it’s authorization is set to Standard?” This seems like a bug, or there’s something going on I’m not understanding. Obviously it seems the auth system in SMB changed a bit too between 10.5 and 10.6, since it used to behave differently.
Either way, I’ll be running “mkpassdb -kerberize” on the OD Master on Monday during our systems outage (there is a scheduled, 2 hour, power outage to test power resiliency on campus) (I already ran a test case on a test OD master, and it did add kerberos entries for all the users. So, that’s nice). This should hopefully resolve this issue permanently. I will update this post once I’ve kerberized all users, and things work, and I’ll update again later next week once I know whether or not it resolved the issue. I’m also expecting some info back from my friend about why this might be happening with AFP.
One thing I will say… this has really got me looking at Kerberos. Previous to this, I didn’t really use it at all on our systems. But since playing with it, it seems pretty damn cool. =)
Well, more in a few days.