I am going to take a small break from the first series of this blog to discuss performance issues and considerations. This will only apply to data warehouses with raw data only (relational, not OLAP cubes).
So our data warehouse went into QA testing this morning and the users came back with some very interesting results. All of our pre-QA performance testing was done against the indexes that was created in our environment, so we were on the mindset that everything was golden because queries were all taking <30 seconds to return. Oh how we were wrong. It’s the normal approach of never letting a developer sign off and test his/her own work; you only test what the system exactly how it was built and have blinders on to any exceptions (that external users will definitely find).
Our main problem relates to query performance. Some ad-hoc queries were taking 15 minutes to process, while others were taking longer than an hour.
The first recommendation that I made to the DBA team was to set the database to read-only mode.
Since this DW will only be updated once a day (3:00am PST), we modified our ETL package to execute the following before any updates/writes occurred:
USE master;
ALTER DATABASE [db_DataWarehouse] SET MULTI_USER;
This allows writes and updates to occur in the DW while we are performing our ETL, and at the end of the package this code is executed to bring the state back into READ-ONLY:
USE master;
ALTER DATABASE [db_DataWarehouse] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
ALTER DATABASE [db_DataWarehouse] SET READ_ONLY;
Setting this option in the DW has increased query performance by an avg of 25% (I will post actual before & after results on this later). This is because the overhead of having to lock rows and tables is eliminated because the optimizer knows that locking is not necessary.
Granted, this is only 1 tip to improving the ad-hoc query performance against our DW. As more improvements are made, I will be sure to share them here.
Cheers,
- Reagan