Five IBM Tivoli Storage Manager backup errors and how to prevent them
Learn how to correct the top five IBM Tivoli Storage Manager backup errors in this tip from backup expert Pierre Dorion.
What you will learn in this tip: Like most enterprise-class backup tools, IBM Tivoli Storage Manager (TSM) is flexible and customizable. However, care must be taken when configuring TSM to avoid inappropriate settings that could potentially make the backup environment difficult to manage and lead to errors. In this tip, learn about are five of the most common Tivoli Storage Manger backup errors and how to prevent them.
Database backups and recovery log size
One of the most common Tivoli Storage Manager backup errors is due to improper sizing of the recovery logs. During data backup operations, the TSM server writes backup transaction information to the recovery log and then commits it to the database. The recovery log is used to allow a failed database to be restored from its last backup, and all transactions recorded in the recovery log after the last database backup are applied to the recovered database to bring it to its latest state (known as roll-forward mode).
As data backup activity takes place, the recovery log uses more space as it stores transaction information, and whenever the TSM database is backed up, the recovery log is “flushed” of all transaction information. If the recovery log reaches full capacity before the next database backup, the server will stop accepting new transactions; it will automatically be halted and cannot be restarted until the recovery log is extended manually. This is a common error that is usually caused by failed database backups (e.g., no scratch tapes available) and typically the result of poor activity monitoring.
In earlier versions of TSM, the maximum size for the recovery log was 13 GB. This has been increased to 128 GB as of TSM version 6.1. While this increased log size allows more latitude, it must still be monitored to prevent the TSM server from unexpectedly halting, should that new log capacity be reached. The best way to prevent this situation is to ensure there are always enough scratch tapes for database backups and to monitor the database backup success and recovery log utilization.
Undersized primary disk storage pools
In environments where both disk and tapes are used as backup data storage targets, disk is frequently used as the primary storage pool where backup data is initially written (or staged) and later transferred (migrated) to a tape pool. A common error encountered in these environments is the undersizing of the primary disk pool, which can cause backup delays and even missed backup windows.
When a disk pool fills up or reaches its high utilization threshold, it automatically starts migrating data to a designated tape pool. This can cause serious delays to the backup process, depending on the number of tape devices available and the number of concurrent backup sessions when the migration started. Hard drives are random-access devices and, therefore, a disk storage pool can support as many simultaneous backup sessions as the I/O capacity of the server will permit. However, TSM uses tape devices in a sequential manner and only allows a single data stream at a time between the disk pool and an individual tape device, and will allow multiple migration processes if multiple tape drives are available (one migration process per tape drive). If insufficient tape devices are available, backup sessions are queued and will use storage resources as they become available. In significantly undersized disk pool environments, this may cause some backup sessions to run beyond allowable windows or even fail.
The best practice approach to prevent this type of situation is to size disk pools to hold as much backup data as will be directed to them during a backup window. In other words, if 700 GB of data is backed up daily and sent to a disk pool, this pool should be at least 700 GB in capacity. As an alternate method, large file backup clients such as database servers can be configured to bypass disk pools and backup directly to tape provided there are enough tape devices available to allow it.
Inadequate tape check-out procedures
A properly configured TSM environment will have a scheduled process that copies all primary storage pool data to a copy pool, and performs a daily database backup to be be sent offsite for disaster recovery purposes. These tape volumes must be properly “checked out” of the tape library (ejected) and sent offsite. It's not uncommon to encounter environments where these tapes are checked out and sent offsite only once a week. This can lead to up to one week of data loss in the event of destructive events such as floods, tornadoes, hurricanes, fires, etc., depending on when the tapes where last sent to the vault. If company policies dictate that data must be backed up daily to meet a defined recovery point objective (e.g., 24 hours), then an offsite copy must be created daily and sent offsite daily.
The situations described above can be easily circumvented with proper procedures and capacity planning. Tapes should be ejected from the library and sent offsite daily if an RPO of 24 hours is required. If handling tape media on a daily basis is not practical, then a remote replication solution (disk-based, VTL or TSM native) must be considered. With respect to tape library capacity, the device must be upgraded with enough capacity to hold the entire backup data set with room for growth. Alternatively, part of the data on tape could be migrated to a deduplication-capable disk-based solution and/or the data retention policy can be reduced.
Expiration process
The expiration process frees up storage pools space by marking files for deletion once they have reached a predefined retention period. It also frees the database of entries, which helps manage the size of the database. In larger TSM environments, the expiration process is sometimes not allowed to finish before it's interrupted. In fact, the TSM software allows administrators to set a “duration” parameter to limit how long this resource-intensive process will run daily. While the ability to interrupt the expiration process can be convenient, it can cause some other issues. If you don't allow the expiration process to finish, this can lead to more new database records being created than expired. This can negatively impact the size of the TSM database, and prevent the expiration process from ever catching up.
There are many reasons why the expiration process can take longer than usual to complete, which can include the recent manual deletion of a significant amount or backup data, a resource constrained TSM server (CPU, memory, I/O), or a TSM database that has grown beyond a manageable size for a single TSM server instance. This situation can be prevented with proactive monitoring to address server performance issues before they seriously affect daily TSM processing. Avoiding the deletion of large amounts of data at once can help in already heavily utilized environments. In addition, deploying a second TSM server instance may be required in large environments.
Maximum concurrent sessions
Another common error involves the setting for the maximum number of allowable concurrent backup sessions, or the MAXSESSIONS TSM server option. This error is common in newly implemented TSM environments or in environments where a large number of new backup clients are added at once. The default number of allowed concurrent sessions is 25, which can be easily missed or exceeded. This condition, combined with a short backup start window (default is one hour) can cause backups to be missed. And, if this is allowed to go undetected, it can cause an exposure to data loss. As with many of the other errors outlined earlier, adequate and regular monitoring can easily detect this type of issue. That said, the best way to prevent this from occurring in the first place is to develop a check list of configuration items when adding new backup clients or making any other significant changes in your TSM environment.
IBM TSM is a highly customizable and flexible backup product and, when properly implemented, it's an effective backup solution. That said, the product’s flexibility also adds to its complexity, which sometimes leads to configuration and operational errors. As with any other backup software, the key to early error detection and correction is proper monitoring. Insufficient monitoring is by far the top error, but it's also the easiest one to prevent.
About this author:
Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.