Archive

Posts Tagged ‘Performance Counters’

Help! How do I find disk I/O problems?

October 3, 2014 2 comments

This is a big topic, so the advice below is not meant to be comprehensive; but more of generaly troubleshooting walkthrought that you can use to try to identify the problem.

Disk performance generally is measured in the disk latency we are experience for reads and writes.  For reads going to disk we would like to have disk latency between 8 – 15 ms (measured by Avg. Disk Sec/Read) and for writes with cache backup we would like disk latency to bet between 4 – 8 ms (measured by Avg. Disk Sec/Write).   In addition to these we need to understand how my IOPS (Input/Outputs Per Second) are being generated on the server.  These counters can be measured by (Disk Reads/Sec, Disk Writes/Sec, and Disk Transfers/sec).  We need to understand the IOPS because generally there is upper limit on amount of IOPS a disk can handle.  For example a 15000 RMP disk can handle approx. 150 – 225 IOPS/sec; or if we are working with SAN then undestanding the ture volume can be benfical.  It might be difficult to assess how many IOPS your SAN can support; however you can have an intelligent discussion if you know what are the IOPS you are generating.

For SAN communication through Fiber Channel, we have to also consider the HBA Queue Deapth Setting.  By default HBA cards Queue Depth Length now days is set to 32.  Which is usually sufficient, however we can use SQLIO (stress testing tool) and SQLIOSIM (testing San/server with similar I/O patterns as SQL will generate), to test other settings; in some cases higher HBA queue depth setting can lead to better performance.  Recommend working with SAN and HBA vendor to figure out best setting for you.

After this we have the throughput being generated, we can look how many bytes are being transferred per second (Avg. Disk Bytes/Transfer).  This can help us assess the total data being pushed through HBA (Fiber Channel) or Network (iSCSI).  A 4Gbs HBA can support 476MB/s throughput at MAX; a 10Gbs can support 1192MB/s throughput at MAX.  However both these methods have limitations defined by the fiber and switches they must travel through and the speed negotiate.

We also like to look at the Avg. Disk Queue Length for both Reads and Writes (Avg. Disk Read Queue Length, Avg. Disk Write Queue Length, and Avg. Disk Queue Length).  It is hard define a range for these values because this calculation used to be done by knowing the number of spindles assigned to a LUN.  However in most modern SANs that is difficult to assess.  That said, it is still important this number is large like 30 or 40+; that still represents an issue. If that happens you should see multiple I/O stalls in SQL Server and disk latency should be poor.

These counters together with underline hardware configuration understand can be used to understand the bottlenecks on the disk subsystem.  For example, lets say we have a blade center enclouser with 16 servers.  Which has 4 HBA cards, each being 4Gbs (total throughput of 1904MB/s) being shared amount 16 servers in total.  However each server in blade center can access at most two HBA at any given time, therefore each server effective throughput is limited to 952MB/s.  So therefore 952MB/s is being shared by at least 8 servers within the enclosure; without knowing the other server workload and assuming equal distribution; we are left with final bandwidth limitation of 119MB/s per blade centre server.  Therefore having understanding of underlying hardware can help use runderstand the numbers we collected; are they an issue or they with in acceptable parameters.

Other indication of the I/O subsystem bottle neck is we start seeing stalled I/O messages with in SQL Server error logs.  In addition to this you can also see stall i/o requests with in SQL Server DMVs (sys.dm_io_pending_io_requests) and the waits being generated on disk by looking at latch waits in (sys.dm_os_wait_stats or sys.dm_os_latch_stats).

“SQL Server is running slow!?!” Help!

April 15, 2013 Leave a comment

How many times have we heard that SQL Server is running slow from end-user, but can’t track down why? Or what happened?  In this post I hope to point out some of the things you can consider.  I’ll approach this scenario two ways, 1) SQL Server is running slow due to processor is pinned 2) SQL Server is running slow but processor is not pinned.

SQL Server is running slow due to processor is pinned

What are some of the things that require CPU in SQL Server?  When dealing with processor related issue, we need to know what might cause CPU utilization and what kind of utilization we have (privileged or user)? Before we start troubleshooting any kind of issue on the server, please ask the first obvious question?  Is it really sqlservr.exe (SQL Engine) causing the issue?  In CPUs case it is pretty easy, we can look at task manager. However if we wanted to look at the Windows Performance Monitor (perfmon), then we might want to look at least three counters; Processor\% Processor Time (_Total), Process\% User Time (sqlservr), and Process\% Privileged Time (sqlservr).

An interesting difference between Processor and Process counter is, Processor counter is an average of all processors visible to the operating system.  Where as Process counter is a summation of all processor visible to the operating system.

CPU Usage

Image 1: Processor Utilization

If we look at the utilization the Process\% User Time is much higher then Processor\% Processor Time, therefore to really see the utilization for SQL when looking at Process counters, we need to average the value to get true utilization.  When looking at Processor\% Processor Time, I would like the values to ideally run around maximum of 70-80%, if these values are going above that for extended period of time, then we need to figure out what is causing CPU utilization.  Maybe it is time to migrate some workloads off this server, maybe its time to purchase new hardware, or maybe we have runaway queries and need to do some performance tuning.  All of this normally depends on having proper baseline of these values.  However lets say we do have an issue so now what?

First thing to note, everything in SQL Server uses CPU.  Like life, nothing comes free.  Memory Reads (Logical Reads), Disk Read/Writes (Physical Reads/Writes), Hash Operations, Execution Plans, etc. all require CPU.  Question is what is using up the CPU?

Maybe its physical operations?

If its physical operations I should see my Process\% Privileged Time higher then normal.  Generally for SQL server we recommend this value should not exceed 25-30%.  In addition to this we should see high disk usage or some kind of memory pressure.  That is causing SQL Server to write dirty pages out or high amount of scans causing light number of physical reads.  So we potentially need to look at the Logical Disk or Physical Disk counters, SQL Server: Access Methods, and SQL Server: Buffer Manager counters.  In SQL Server we might see PAGEIOLATCH_* wait types in sys.dm_os_waiting_tasks DMV.

Maybe its logical operations?

If its logical operations, I should not see high number in Processor\% Privileged Time but should see high number in Process\% User Time.  In SQL Server maybe I want to run a query to get the most expensive queries by Logical Reads from sys.dm_exec_query_stats to see who is causing excessive logical reads.  You can look at Get Top 50 SQL Statements/Query Plans post on how to get this information.

Maybe its hash operations?

Again this should be user time utilization on CPU and if I have lots of Hash Operations, that means I have potential for high TempDB usage.  So I might look for counters SQLServer:Access Methods\Workfiles Created/sec in relation to SQLServer:SQL Statistics\Batch Requests/sec.  These values we like to see this ratio to be around 1:5 ratio.  In SQL Server we might want to look for queries with high amount of worker time.  In addition we might have SOS_SCHEDULER_YEILD wait_type in sys.dm_os_waiting_tasks if the box is being very busy.

Maybe its execution plans?

If its execution plans, then we might have high number of SQLServer:SQL Statistics\SQL Compilations/sec or SQLServer:SQL Statistics\SQL Re-Compilations/sec.  If its re-compilations n the sys.dm_exec_query_stats we might find execution plans with high plan_generation_num.  Alternatively we might have high compilations/sec because lots of Ad-Hoc T-SQL (any code that is not object with in SQL Server, such as stored procedures, triggers, etc.) code is being executed (greater then your baselines).

SQL Server is running slow but processor is not pinned

Well we looked at some of the scenarios where the CPU might be pinned, how about where CPU is not pinned?  What might be issue then?  In these cases, if your CPU is not pinned but things are running slow, the problem might be resource contention issue.  Therefore you might have blocking in SQL Server.  If you look at activity monitor you might see wait types such as LCK_M_*, or ASYNC_NETWORK_IO, etc.

We can look at DMV sys.dm_exec_waiting_tasks, sys.dm_exec_requests, and sys.dm_tran_locks to se where is the contention issue.  From there we can further dig into the SQL statement using sys.dm_exec_sql_text and then further the execution plan using sys.dm_exec_query_plan.

In addition you might need to track blocking in SQL Server, please look at Blocked Process Report in SQL Profiler post.

Conclusion

I am sure everyone has run into their fair share of issues, unfortunately the answer is rarely simple, especially when it comes to performance issues.  However I hope this post gave you some idea for your further troubleshooting.  Another thing to consider is, what if the issue has happened few hours ago or last day and you are just finding about it?  It is really outside scope of this post, but something’s that might help in this case is Management Data Warehouse, looking at the previous execution in the DMVs, or using other external monitoring tools such as Microsoft System Center Options Manager.

%d bloggers like this: