NFS VCS Solaris 10 Issues

Here's what we figured out along the way (and how to fix it, too ;) For our purposes today (and the way it was then) the NFS cluster component works fine on node-b, but node-a can't mount the NFSresource when it fails over to node-b.

1. The first thing most people do in any investigation is to see if the basic stuff is all up and running. We don't like to be different, so we duly checked that all of the required VCS resources were up and online. They were; which explained the puzzling ONLINE state ;)

2. We then proceeded to ensure that, in fact, node-b was sharing out the NFS resource. Commands like showmount indicated that it, indeed, was. A little research into the subject showed that the issue we ended up having can indicate an RPC failure at this point, as well, but it's best to try step 3, too, just to be sure the problem isn't confined to a single server (although the fix for it is the same no matter which way your story goes ;)

3. Then we finally struck gold, and got an actual error, when we tried to hit the mount from node-a:

node-a # showmount -e node-b
showmount: node-a: RPC: Rpcbind failure - RPC: Authentication error
node-a # rpcinfo -p node-b
rpcinfo: can't contact portmapper: RPC: Authentication error; why = Failed (unspecified error)

4. Unspecified errors are the best kind of errors you can get since there are a much wider variety of possible solutions you can come up with... Or, maybe I have that backwards... There's really not much more to step 4. This step is a practice in surrealism ;)

5. It turns out that the answer lay in setting rpcbind properties (away from the defaults on both servers). The answer to the problem (or the fix, if you will) actually makes more sense than the way things "usually" work. The first thing we did was to set rpcbind to "global" on both nodes. By default, it was set to "local_only." We double confirmed that this is still the case on other cluster setups we have running, in which everything is hunky-dory. You also need to do these steps on both nodes (or all nodes) in your cluster, while, here, we're only showing what we typed on the activeNFS resource-sharing node:

node-b # svcprop network/rpc/bind:default | grep local_only <-- See if the local_only property is set
config/local_only boolean true <-- and there it is!

then move on to fixing the problem (again on both nodes) by setting the rpcbind configuration value to global (which, in the instance of rpcbind, actually means setting the local_only attribute to "false"):

node=b # svccfg
svc:> select network/rpc/bind
svc:/network/rpc/bind> setprop config/local_only=false
svc:/network/rpc/bind> quit

6. Then, just double check to make sure you've gotten it all set up correctly:

node-b # svcprop network/rpc/bind:default | grep local_only
config/local_only boolean true

...well, that's not right, but don't give up just yet! Keep typing. Type, Forrest, Type! ;)

node-b # svcadm refresh network/rpc/bind:default
node-b # svcprop network/rpc/bind:default | grep local_only
config/local_only boolean false

there... that's better.

7. Finally, just make sure you can mount your NFS resource from whichever node isn't currently hosting the NFS resource. You don't necessarily have to test it on both nodes, once you've fixed this issue on both, but why risk the near-future embarrassment?

node-a # showmount -e node-b
export list for node-b:
/our/shared/directory (everyone)