A very slow SQL Server

A few days ago I found that the replication into a database on one of our SQL 2005 Servers slowed down. The same publisher was replicated to multiple subscribers without any issue. I decided to check the database for blocking processes, because this db was used for reporting:

select * from sysprocesses
where
spid in (select blocked from sysprocesses where spid <> blocked)
and dbid = db_id('MyDatabase')

(The where spid <> blocked clause was needed because from SQL 2000 SP4, the “self-blocking” is also listed as a blocking but it wasn’t interesting in this case.) I expected to spot one of our report writers running a bloated query with massive joins ans sophisticated filters but it wasn’t the case. I found nothing. I had a suspicion that if I restarted the server, this phenomenon would disappear but as the server was still in a usable state, I decided to dig deeper cause I hate mysteries unresolved.

I started to check the distribution agent history in Replication Monitor but the only thing I found was that the agents were very slow. I checked them in the Activity Monitor and noticed that they were spending lot of time in RESOURCE_SEMAPHORE wait state. I checked the excellent white paper about SQL Server wait types, and saw that it usually means that due to high concurrent load, the query should wait for memory grant. So I checked the memory usage a bit.

In SQL 2000, I used DBCC MEMUSAGE, but it disappeared in SQL 2005. Good news is that there’s something even better. While DBCC MEMUSAGE told you the to 30 memory eater object, the sys.dm_os_buffer_descriptors DMV will tell you this info for every object you want. I picked the sample scripts below from Books Online at sys.dm_os_buffer_descriptors and queried both the server and the database level memory usage.

-- server level
SELECT count(*)AS cached_pages_count
    ,CASE database_id 
        WHEN 32767 THEN 'ResourceDb' 
        ELSE db_name(database_id) 
        END AS Database_name
FROM sys.dm_os_buffer_descriptors
GROUP BY db_name(database_id) ,database_id
ORDER BY cached_pages_count DESC

-- and database level
SELECT count(*)AS cached_pages_count 
    ,name ,index_id 
FROM sys.dm_os_buffer_descriptors AS bd 
    INNER JOIN 
    (
        SELECT object_name(object_id) AS name 
            ,index_id ,allocation_unit_id
        FROM sys.allocation_units AS au
            INNER JOIN sys.partitions AS p 
                ON au.container_id = p.hobt_id 
                    AND (au.type = 1 OR au.type = 3)
        UNION ALL
        SELECT object_name(object_id) AS name   
            ,index_id, allocation_unit_id
        FROM sys.allocation_units AS au
            INNER JOIN sys.partitions AS p 
                ON au.container_id = p.hobt_id 
                    AND au.type = 2
    ) AS obj 
        ON bd.allocation_unit_id = obj.allocation_unit_id
WHERE database_id = db_id()
GROUP BY name, index_id 
ORDER BY cached_pages_count DESC

I got back that the 95% of the total server memory is used by my troublesome database and inside that, 95% (approx. 140K pages) is used by the table MSreplication_subscriptions. That surprised me as I was under the impression that that is a small table. I decided to query the table with some stats:

set statistics io on
select * from MSreplication_subscriptions 

And got this:

(13 row(s) affected)
Table 'MSreplication_subscriptions'. Scan count 1, logical reads 142543, physical reads 1234, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Shocking… 13 rows, 142000 reads and no LOB in the table… it wasn’t a perfect day for a broken database… I ran a DBCC on the table:

DBCC CHECKTABLE ('MSreplication_subscriptions')

It returned this info, 5 hours and 43 minutes later:

DBCC results for 'MSreplication_subscriptions'.
There are 13 rows in 147460 pages for object 'MSreplication_subscriptions'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I started to get confused… I thought something was broken, but DBCC didn’t scream and I couldn’t imagine how it could go so wrong that even DBCC failed to spot the error. So I decided to allow myself a break and left the computer. Needless to say, my brain couldn’t stop so in two minutes it came into my mind that I didn’t checked the physical table structure. I went back to query the index stats because the clustered index contains the records themselves on the leaf level, so querying the clustered index stats provides info about the table itself. For a moment I was thinking if the table just got heavily fragmented, but it was a silly idea, given that the page numbers significantly outnumbered the record numbers (and still no LOB). So I queried the index details:

SELECT * FROM 
sys.dm_db_index_physical_stats(db_id(), object_id('dbo.MSreplication_subscriptions'), 
null, null, 'detailed')

As you can see, it wasn’t even a DMV, but a DMF, that is, a dynamic management function. It returned everything I needed, and even more:

database_idobject_idindex_idpartition_numberindex_type_descalloc_unit_type_descindex_depthindex_levelavg_fragmentation_in_percentfragment_countavg_fragment_size_in_pagespage_countavg_page_space_used_in_percentrecord_countghost_record_countversion_ghost_record_countmin_record_size_in_bytesmax_record_size_in_bytesavg_record_size_in_bytesforwarded_record_count
6207305842111CLUSTERED INDEXIN_ROW_DATA5098.12276218282841479421.0024401454624114830347.30607857672351311608002314374347.846NULL
6207305842111CLUSTERED INDEXIN_ROW_DATA5199.743260590500638951389548.59101803805291483030011118101.346NULL
6207305842111CLUSTERED INDEXIN_ROW_DATA5299100110050.230911786508538950011118102.433NULL
6207305842111CLUSTERED INDEXIN_ROW_DATA53021264.64665184086981000011118102.69NULL
6207305842111CLUSTERED INDEXIN_ROW_DATA5401111.4208055349641720011102102.119NULL

At first sight, it’s a mess but it has a logic. Every row in the result set corresponds to a level of an index on a table (recall, indexes are in B-tree structure in SQL Server). We have a single, clustered index here, with 5 levels (see the idex_depth and index_level columns). The page_count cloumn shows the pages used by the level and the record_count returns the number of records. On level 0, that is, on the leaf level, we have 13 records but 148303 pages (table was growing all the time). For the solution, let’s check the two additional alternative record count columns.

The ghost_record_count tells you how many records are in the pages physically which are already not there logically due to deletion (or update which requires more space). Such records are deleted by the GHOST_CLEANUP system process, so it might happen that we can’t see a value other than 0 in the column if the cleanup was wicked fast. The other column called version_ghost_record_count and counts those records which are logically not in the table but cannot be deleted yet due to an open transaction using row versioning (running on snapshot isolation level) and the original row has been changed. In our case, we have a single ghost record and 1608002 version ghost. It’s pretty tough for a table with 13 rows, even if 113 distribution agents belong to those 13 rows and they run continuously, updating at least once in a minute. So I checked for the oldest transaction:

DBCC OPENTRAN
Transaction information for database 'MyDatabase'.

Oldest active transaction:
    SPID (server process ID): 74
    UID (user ID) : -1
    Name          : user_transaction
    LSN           : (21:201:1)
    Start time    : May  13 2009  1:48:22:733PM
    SID           : 0x010500000000000515000000119d203ab2692ec2891a29a8ed030000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

And all of it happened on 28th May. I checked the process and found that it was a distribution agent, ran by a DISTRIB.EXE OS process. Given that I restarted the SQL Agent a few hours before and there was no replication latency a day before, I was pretty sure that it was a zombie agent since 13th May.

I killed the processz and the ghost_record_count went up to 1.5M, then an hour later when I checked back, the whole table was on 8 pages. Happy end.

Although it was a bit long post, I hope someone will benefit from this demonstration of the pretty huge armory of a DBA.

3 Comments

  1. John B:

    What would cause the ghost record to reamin so high when there are no open transactions. LOB_DATA Ghost_Record_Count 14.Mil and no Ghost Version records.

    Thanks
    John

  2. Erik:

    John,
    No idea other than the ghost cleanup is turned off with a special trace flag – but it sounds unbelievable. If you can post/show the whole resultset, it might help.

    thanks,
    Erik

  3. Thomas LeBlanc:

    Great article, I am having a similar problem and this info helped me.

    Thanks,
    Thomas

Leave a comment