I totally understand your point and what I mentioned was purely based on NFSv3 and number of controller upgrades have done over the years. This does not mean that NFS4 is not fit for production usage, that would be completely false. NFSv4 is much more optimized and efficient compared to NFSv3 and there is no doubt about it, just that we need some solid understanding and some kind of testing to determine the recommended settings and there is very little documentation around it especially around 'Node Fail-over-givebacks'.
I think it's not about just NFSv4/4_1 on the Server side (NetApp/ONTAP), but onus is also on the client (*nix/VMware) side to re-establish the stateID.
According to :RFC:7530 (NFSv4)
If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re-establish the lost locking state. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a client ID invalidated by reboot or restart.
I believe there is a requirement for more testing on determining what would be the 'recommended settings' on ESX side along with Best Practices guide that is already published by NetApp and other vendors for their NFSv4 adaptability. Until then, I would say - We must find out the reasons for failure. As you have already experienced 'time-outs' and disconnections first-hand, can we look at the logs on the ESXi side to determine what were the 'errors' reported so that we can probably 'tweak' the NFS advance settings, there are number of these but one such is 'NFS.DiskFileLockUpdateFreq'. I am not an expert on NFS protocol, but it's a very neatly documented (Public Facing RFC) information and helps in understanding the basic nature of communications between client & server.
The best case would be to: Get VMware & NetApp to jointly look at the case (Which is doable) and probably investigate the reasons for the lost connections during 'Controller Node Reboot'.