In large data-warehousing applications, the efficiency of storage systems is critical. With the same hardware and software release, we have seen queries that never terminate (i.e. still runs after a week) under one configuration to be able to finish within hours after optimizations.
Of course in the case of a database like DB2, there are many other factors to consider, and in this article, I’m focusing on the OS level tuning advice from my friend Ben.
When using a SAN over FC, it is possible to get close to wire speed, as the FSCSI protocol has very low overhead. It is usually more the nature of the I/O that slows down performance. Synchronous I/O is going to be slower – more on that later.
A good way to test out tis by compiling up the latest version of IOtest on AIX:
Use it on a file or a junk raw LV while supplying it with various parameters.
Things you will usually see:
- synchronous I/O is going to be far slower than you would expect no matter what you do. (synchronous I/O is produced by commands like dd, cp, mv, cat, etc. as well as by databases and applications that do not use AIO (Unidata database, websphere, apache, etc. all use Synchronous I/O)
- The more writes you have, the slower you will go. Most SAN disk manufacturers design their arrays (and even their spindles) to handle 80% read and 20% write optimally- so that even if you have a large amount of disk cache, you will fill up the allotted “slot” for writes very quickly and ultimately be limited by the ability of the back-end disks to absorb writes.
- The smaller the I/O sizes are, the lower your throughput
- The fewer process threads performing I/O, the lower your throughput – the more process threads you are running, the better the saturation and I/O consolidation, so the better your throughput (but only up to a point)
Detecting a back end disk problem
You can tell if you really have a back end disk problem with
filemon by running these three commands:
filemon -T 20000000 -o /tmp/filemon.out -O lv,pv
That should write you out a file
/tmp/filemon.out that has much information in it. We extract what we are looking for with this command:
grep times /tmp/filemon.out
This should give us a bunch of read and write times. If the “avg” is greater than 20 (milliseconds), then the back end disk is not fast enough to support the load on it. Usually the solution to that situation is to add more disks and spread the load over more spindles.
Increasing I/O performance: general rules
- Use AIO if possible (asynchronous I/O)
- Increase queue depths if necessary (especially if you are representing large arrays of disks as a single LUN on the host)
- Do as much work in a single disk operation as possible
- Run as many I/O generating threads in parallel as you can. This is done on DB2 by increasing the IOSERVERS. We had good results by having one IOSERVER per spindle with a RAID10 LUN.
- Make sure dynamic tracking is on (fscsi device attribute,
dyntrk). More on this later.
In reality, since there are other constraints to observe, these are usually about all you can actually do:
- On each disk, set the queue depth to approximately: 8 * n / h
where n = number of spindles in the biggest RAID array you are using
where h = number of HBAs(fibre adapters) that can see the disk
lsattr -El” and set it with “
- On each HBA (fibre adapter) set your SCSI queue to 2048 with “
chdev -l fcsX -a num_cmd_elems=2048“. If the disks on it are in use, you will have to use this command instead “
chdev -Pl fcsX -a num_cmd_elemss=2048” and then reboot for it to take effect.
- Make sure you are using dynamic tracking (same commands as above, but with the fscsiXX devices and the dyntrk attribute- i.e. “
chdev -Pl fscsi0 -a dyntrk=yes“
- If you are using AIX 5.x make sure
vmo. (set it with “
vmo -po lru_file_repage=0“)
Use this advice at you own risk. Please be very careful when making any changes, and make sure you have backups.