This is a big topic, so the advice below is not meant to be comprehensive; but more of generaly troubleshooting walkthrought that you can use to try to identify the problem.
Disk performance generally is measured in the disk latency we are experience for reads and writes. For reads going to disk we would like to have disk latency between 8 – 15 ms (measured by Avg. Disk Sec/Read) and for writes with cache backup we would like disk latency to bet between 4 – 8 ms (measured by Avg. Disk Sec/Write). In addition to these we need to understand how my IOPS (Input/Outputs Per Second) are being generated on the server. These counters can be measured by (Disk Reads/Sec, Disk Writes/Sec, and Disk Transfers/sec). We need to understand the IOPS because generally there is upper limit on amount of IOPS a disk can handle. For example a 15000 RMP disk can handle approx. 150 – 225 IOPS/sec; or if we are working with SAN then undestanding the ture volume can be benfical. It might be difficult to assess how many IOPS your SAN can support; however you can have an intelligent discussion if you know what are the IOPS you are generating.
For SAN communication through Fiber Channel, we have to also consider the HBA Queue Deapth Setting. By default HBA cards Queue Depth Length now days is set to 32. Which is usually sufficient, however we can use SQLIO (stress testing tool) and SQLIOSIM (testing San/server with similar I/O patterns as SQL will generate), to test other settings; in some cases higher HBA queue depth setting can lead to better performance. Recommend working with SAN and HBA vendor to figure out best setting for you.
After this we have the throughput being generated, we can look how many bytes are being transferred per second (Avg. Disk Bytes/Transfer). This can help us assess the total data being pushed through HBA (Fiber Channel) or Network (iSCSI). A 4Gbs HBA can support 476MB/s throughput at MAX; a 10Gbs can support 1192MB/s throughput at MAX. However both these methods have limitations defined by the fiber and switches they must travel through and the speed negotiate.
We also like to look at the Avg. Disk Queue Length for both Reads and Writes (Avg. Disk Read Queue Length, Avg. Disk Write Queue Length, and Avg. Disk Queue Length). It is hard define a range for these values because this calculation used to be done by knowing the number of spindles assigned to a LUN. However in most modern SANs that is difficult to assess. That said, it is still important this number is large like 30 or 40+; that still represents an issue. If that happens you should see multiple I/O stalls in SQL Server and disk latency should be poor.
These counters together with underline hardware configuration understand can be used to understand the bottlenecks on the disk subsystem. For example, lets say we have a blade center enclouser with 16 servers. Which has 4 HBA cards, each being 4Gbs (total throughput of 1904MB/s) being shared amount 16 servers in total. However each server in blade center can access at most two HBA at any given time, therefore each server effective throughput is limited to 952MB/s. So therefore 952MB/s is being shared by at least 8 servers within the enclosure; without knowing the other server workload and assuming equal distribution; we are left with final bandwidth limitation of 119MB/s per blade centre server. Therefore having understanding of underlying hardware can help use runderstand the numbers we collected; are they an issue or they with in acceptable parameters.
Other indication of the I/O subsystem bottle neck is we start seeing stalled I/O messages with in SQL Server error logs. In addition to this you can also see stall i/o requests with in SQL Server DMVs (sys.dm_io_pending_io_requests) and the waits being generated on disk by looking at latch waits in (sys.dm_os_wait_stats or sys.dm_os_latch_stats).