Archive

Archive for the ‘Windows 2003 Clustering’ Category

Missing registery settings in cluster nodes for SQL Server

October 27, 2014 Leave a comment

I run into this occasionally, I think in last 3-4 years being SQL Server PFE, I have seen this issue total of 4 times.  So its not common, issue.  So for this post, I’ll use example architecture.  Two-Node Cluster, Node A and NodeB running SQLFCI1 on it.  SQLFCI1 runs fine on NodeA but fails on NodeB.  Looking at the Application Log we see strange messages like “Could not open error log file ”“.  Other messages might around missing various configuration settings that SQL Server needs to start up.  So how can that happen?

When SQL Server is running as a Failover Cluster Instance (FCI); its configuration settings (a.k.a registry keys under HKLM\Software\Microsoft\Microsoft SQL Server) are saved in a cluster hive in registry.  So when the node fails over from active to passive these settings get carried over and applied to passive node.  That is why we have best practice to make all configuration settings on active node only, if you make it on passive node, or if instance is offline.  The Cluster Service will over write them with what it know of the settings.  This is called CheckPoint process.

We can check if all the required SQL Server keys being copied to cluster hive or not.  We can do that from Command Prompt using following command:

cluster.exe . res “SQL Network Name (SQLFCI1)” /CheckPoints

You will get a output similar to below:

Listing registry checkpoints for resource ‘SQL Network Name (SQLFCI1)’…

Resource                   Registry Checkpoint
————————– —————————————————————————-
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Cluster’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQLServer’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Replication’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Providers’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\SQLServerSCP’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\CPE’
SQL Network Name (SQLFCI1) ‘SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\SQLServerAgent’

However if you get only some or no record back we have an issue.

PLEASE NOTE DO THIS ON NODE THAT IS WORKING.  IF YOU DO IT ON NODE THAT IS NOT, YOU WILL LOSE ALL YOUR REGISTRY SETTINGS.

  1. Backup the HKLM\Software\Microsoft\Microsoft SQL Server\ hive on both NodeA and NodeB (just in case, you ignore my warning above/ or murphy’s law kicks in).
  2. Confirm instance is on NodeA, if not failback to NodeA from NodeB (NodeA was the good guy in my scenario above).
  3. Execute following commands to add each of the key registry settings to cluster checkpoint.

cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Cluster”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQLServer”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Replication”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\Providers”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\SQLServerSCP”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\CPE”
cluster.exe . res “SQL Network Name (SQLFCI1)” /Addcheckpoint:”SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLServer\SQLServerAgent”

After this re-run the /CheckPoints command above to verify they were added successfully.

Adding new disk resource to cluster

January 10, 2010 3 comments

My current configuration includes two node cluster with active-passive configuration.  I had requested the Storage team to add a new LUN to the Cluster so I can relocate tempDB from its current location to separate disk.

They had added the disk to the server; but when I went to Active Node (Node1)’s Computer Management > Storage > Disk Management and tried to initialize the disk I got the following error:

“The requsted operation cannot be completed because the media is write-protected.”

Doing research on the internet lead me to think the LUN was not proper configured for the cluster; I tried Actions > Rescan Disk option also with no success.  The new disk showed up with “” and Not Initialized message on the Node 1. I checked the passive node it was similar icon in Disk Management; I tried to Initialize the disk on Passive Node (Node 2); I was able to initialize it successfully.  I figured when the new LUN was added to the server it would be active on one node only as I couldn’t access it on Node 1.

After initializing the disk on Node 2, I failed all the Cluster Resources over from Node 1 to Node 2.  Since new disk was going to be used by SQL Server and was going to hold the temp database few additional steps must be completed before using the disk.  Format the disk making sure 64KB blocks are used (disk offset should also be 64KB off; Storage guys do that).

Adding New Disk as Cluster Resource:

  • Launch Cluster Administrator on Active Node.
  • Go to SQLServer Group.
  • Right click in left pane, select New.
  • Select Resource.
  • Give it a Name, for example Disk T:
  • Select Resource Type, Physical Disk.
  • Click Next.
  • Select the possible owners for the resource, in my case since it is only two node cluster both nodes get added automatically.
  • Click Next.
  • Select dependencies; that is which resources should be online before Disk T is brought online.  This option for disk is not required unless you are using Mount Points; then you would want to make sure Mount Point Disk is listed as a dependencies.
  • After adding dependencies, click Next.
  • From the Drop Down select the new Physical Disk.
  • Click Finish.

Adding New Disk as Dependencies to SQL Server:

  • Launch Cluster Administrator on Active Node.
  • Go to SQLServer Group.
  • Right Click on SQL Server, select Take Offline.
  • Right Click on SQL Server, select Properties.
  • Select “Dependencies” from Tabs at top.
  • Click Modify.
  • Select the New Disk Resource that was just added.
  • Click OK on both windows to get back to Cluster Admin console.
  • Right Click on SQL Server, select Bring Online.

Just FYI, if you don’t set the dependencies information you will not be able to use that disk in SQL Server; when trying to relocate TempDB you will get the following error from SQL Server:

Msg 5184, Level 16, State 1, Line 1
Cannot use file ‘T:\MSSQL.1\MSSQL\Data\tempdb.mdf’ for clustered server. Only formatted files on which the cluster resource of the server has a dependency can be used. Either the disk resource containing the file is not present in the cluster group or the cluster resource of the Sql Server does not have a dependency on it.

Msg 5184, Level 16, State 1, Line 1
Cannot use file ‘T:\MSSQL.1\MSSQL\Data\templog.ldf’ for clustered server. Only formatted files on which the cluster resource of the server has a dependency can be used. Either the disk resource containing the file is not present in the cluster group or the cluster resource of the Sql Server does not have a dependency on it.

After all this was done I tried failing resources back to Node 1 with no luck, it still gave me similar error on disk.  Talking to Storage guys they assumed it was an Windows issue, but after digging into the configuration settings on DMX it seems you can have different settings for read/write properties for same LUN on two different nodes.  After fixing the setting to read/write on both nodes; drive started working successfully.

Windows 2003 Cluster: SQL Node Not able to Join Cluster after Changing MaxTokenSize

July 12, 2009 Leave a comment

As per Cannot Generate SSPI Context article, Link, I had changed the MaxTokenSize to fix the issue. I had read various articles before jumping into this change as being the only proper solution to the problem.

However after changing the setting and rebooting the passive node the node refused to join the cluster; looking at the error logs I have messages like …

  • Unable to get join version data from sponsor xxx.xxx.xxx.xxx using NTLM package, status 5.
  • Unable to connect to any sponsor node.
  • Failed to join cluster, status 53.
  • Physical Disk : [DiskArb] Failed to read (sector 12), error 170.

Researching on net I could not figure out results; first article I ran into was KB886717, [1]. It suggested the issue might be because the C:\Windows\Cluster folder was over size of 10MB; which in this case it was. So I removed the log file and tried to restart services with no luck.

I started reading through the log file and started to pick through error messages and looked and articles [2], [3], [5], [6], and [7]. I verified all the settings each article suggested, Group Policies, Security, firewall, etc. None of it seems to help in making sure the passive node would join the cluster successfully.

I started trouble ticket with Microsoft; they found few settings, on of being NTML Compatibility Level that needed changing as part of their troubleshooting but even after changing these settings we were still getting NTLM, state 5 error messages in Cluster.log file (State 5 means permissions denied). Talking to Microsoft they referenced few more KB articles in addition to what I had found already like [8], [9], [10], & [11] that indicated what “might” be the issue; but none of them seem to help resolve our issue.

This whole time we had not rebooted the active node as it was working successfully; but since we were hitting stone-wall every turn we decided to further troubleshoot the issue node 1 (active node) must be restarted because the errors that were being generated on Quorum disk. After rebooting Active Node, the Passive Node came active and Clustering was working successfully.

I had not read any KB article indicating the issue with MaxTokenSize and Windows Clustering, and neither had the Microsoft guys. So talking to the Kerberos experts we figured issue was similar to [7], in which if you change password or the password length is less then 15 characters of the Cluster Services account permissions or security settings are not properly hashed and generates errors when authenticating the new node to the cluster.

So if you are changing the MaxTokenSize setting on SQL Server and it is a cluster please make sure you change it on EVERY NODE; or you will have lots of strange issues that probably shouldn’t exist.

Reference Links:
[1] KB886717 Issue with Cluster Log file, Link.
[2] Problems with Microsoft Clusters, Link.
[3] How to manually re-create the Cluster service account, Link.
[4] A Windows Server 2003 based-computer that is running the Cluster service may be unable to join a cluster after the computer is first restarted, Link.
[5] Cluster Service May Not Start After You Restrict Available IP Ports for Remote Procedure Call, Link.
[6] Ask Core!, Troubleshooting Cluster Logs 101 Why did the resource failover to other node?, Link.
[7] Cluster service account password must be set to 15 or more characters if the NoLMHash policy is enabled, Link.
[8] You cannot add an additional node to a Windows Server 2003-based server cluster, and error code “0x8007042b” is logged in the ClCfgSrv.log file, Link.
[9] You receive an “Error 0x8007042b” error message when you add or join a node to a cluster if you use NTLM version 2 in Windows Server 2003, Link.
[10] How to enable NTLM 2 authentication, Link.
[11] Cluster service does not start on joining node in Windows 2000 Cluster, Link.

%d bloggers like this: