Leading Causes of Data Loss:
Natural Disasters 3%
Viruses 7%
Human Errors 32%
Software Malfunction 14%
Hardware & System Malfunction 44%
Computer's
are more relied upon now than ever, or more to the point the data that
is contained on them. In nearly every instant the system itself can be
easily repaired or replaced, but the data once lost may not be recreatable.
That's why the Data Recovery Center stresses the importance of regular
system back ups and the implementation of some preventative measures.
The
chart above lists the most common reasons that data recovery would be
needed for. In all cases there are steps that you the user can take to
minimize your risk of data loss.
1.
Natural disasters
While
the least likely cause of data loss, a natural disaster can have a devastating
effect on the pyhsical drive. However, Data Recovery Center has rescued
data from fires, floods, lightening strikes and the subsequent power surges.
In
instances of severe housing damage, such as scored platters from fire,
water emulsion due to flood, or broken or crushed platters, the drive
may become unrecoverable.
The
best way to prevent data loss from a natural disaster is an off site back
up. Since it is nearly impossible to predict the arrival of such an event,
there should be more than one copy of the system back up kept, one onsite
and one off. The type of media you back up to will depend on your system,
software, and the required frequency you need to back up. Can you proceed
with a day's data loss? a week's? a month's? Also be sure to check your
back ups to be certain that they have properly backed up. There's nothing
worse than attempting to restore data from a blank medium.
2.Viruses
Viral
infection increases at rate of nearly 200-300 new trojans, exploits and
viruses every month. There are approximately 56,712 "wild" or
risk posing viruses and about 105,000 total known viruses, some of which
are considered non-threatening. With those numbers growing everyday, you
are at an ever-increasing risk to become infected with a virus.
There
are several ways to protect yourself against a viral threat:
a.
Install a Firewall on your system to prevent hackers access to your data.
b.
Install an anti-virus program on your system and use it regularly and
scan to see if you have been infected. Many viruses will lie dormant or
perform many minor alterations that can cumulatively disrupt you system
works. Be sure to check for updates on a regular basis.
c.
Back up and be sure to test your back ups for infection as well. There
is no use in removing the virus only to restore it again form your back
up.
d.
Be wary of any email containing an attachment. If you don't know where
it came from or what it is, then don't open it.
e.
If you have contracted a "wild" virus that there is no known
cure for, quarantine it to that system and contact the Data Recovery Center
for further information and assistance.
3.
Human Errors
Even
in today's era of highly trained, certified, and computer literate staffing
there is always room for the timelessness of accidents. Sometimes referred
to as the U.S.E.R virus, human mistakes are made daily all over the world.
There is not much we can do as users to prevent the intervention of Murphy's
Law, except to be cautious. Here are a few things you might want to try:
a.
Be aware. It sounds simple enough to say, but not so easy to perform.
When transferring data, be sure it is going to the destination you had
in mind. If asked "Would you like to replace the existing file"
make sure you are before clicking "yes".
b.
If you are even a little bit uncertain about a task you are about to carry
out, make sure there is a copy of the data to restore from.
c.
Take extra care when using any software that may manipulate your drives
data storage, such as: partition mergers, format changes, or even disk
checkers.
d.
Before upgrading to a new Operating System, back up your most import files
or directories in case there is a problem during the install. Keep in
mind if you have a slaved data drive it may become formatted as well.
e.
Never shut the system down while programs are running. The open files
will more than likely become truncated and non functional.
4.
Software Malfunction
Software
malfunction is a nessesary evil when using a computer. Even the world's
top programs cannot anticipate every error that may occur on any given
program. There are still a few things you can do to lessen the risk:
a.
Be sure you are using the software ONLY for its intended purpose. Mis-using
a program may cause it to malfunction.
b.
Using pirated copies of a program may cause the software to malfunction,
resulting in a corruption of you data files.
c.
Be sure that you have the proper amount of memory installed if you plan
to run multiple programs simultaneously. If a program shuts down or freezes
up you may lose or corrupt what you were working on.
d.
Back up, Back up, Back up. A tedious task, but you will be glad you did
if the software corrupts your customer data base.
5.
Hardware Malfunction
The
most common cause of data loss, hardware malfunction or hard drive failure,
is another nessesary evil inheirent to computing. There is usually little
to no warning that your drive will fail, but some steps can be taken to
minimize the need for data recovery from a hard drive failure:
a.
Do not stack drives on top of each other-leave space for ventilation.
An over heated drive is likely to fail. Be sure to keep the computer away
from heat sources and make sure it is well ventilated.
b.
Purchase an UPS (Uninterruptible Power Supply) to lessen malfunction caused
by power surges.
c.
NEVER open the casing on a hard drive. Even the smallest grain of dust
settling on the platters in the interior of the drive can cause it to
fail.
If you need hard drive recovery do one of the following:
Fill out an online data recovery
quote form - a representative will get back to you within an hour
of submittal.
Call 13391703060 ( our toll-free number is at the top of every page)
to speak with a representative and receive your quote over the phone.
We answer our phones 24 hours a day 7 days a week.
Fill out a data recovery request
form and ship us your drive. please follow any instructions
on how to package and ship a hard drive.
Exchange Server Data Recovery
Introduction
The capacity planner's role is critical for efficient backup and recovery
for any Datacenter. This white paper is intended to provide a capacity
planner with detailed information and guidelines for performing effective
capacity planning. This paper includes two main sections:
Overview of Backup Technology
Capacity Planning
For those who are not highly familiar with current backup technology,
the first section, Overview of Backup Technology, provides a useful foundation
for the Capacity Planning section.
Overview of Backup Technology
The need to reliably backup and retrieve data has reached a new level
of importance as companies are realizing the importance of saving and
accessing large volumes of data. Today's corporate databases and on-line
applications routinely manipulate hundreds of gigabytes (GB) of data,
and databases one terabyte (TB) and larger are becoming increasingly common.
The amount of corporate data collected electronically is growing dramatically
each year. And companies are realizing the value in saving un-sifted data,
for example, to glean information about market trends that can make or
break their future success.
This reliance on full-time availability of data means the time to backup
data is shrinking, and the demands for 100% availability of important
data and for frequent backups is growing. These trends are placing enormous
pressure on Information Technology organizations to increase the speed
of backups while reducing the degree to which they intrude on day-to-day
operations. Equally important is the need to recover files quickly and
efficiently. Thus scheduled backups and rapid recoveries are activities
that must be predictable, stable, reliable, and fast.
Basics of Backup and Recovery Technology
Current backup technology allows most of the backup process to be automated,
with the exception of initial configuration and subsequent adjustment
as storage requirements expand.
Physical and Logical Backups
There are two basic backup and recovery processes: physical and logical
backups. Physical backups copy a byte-for-byte image of all of the database
disk storage to a backup device. Logical backups copy all of the logical
entities in the database to a backup device. Each process presents a different
configuration problem. Physical backups are usually much faster than logical
backups, because the source is read sequentially and the data can be retrieved
at full device speed. The drawback is that the entire volume must be backed
up as a single entity. Thus raw device backups are most useful when the
entire device must be backed up. In contrast, a logical backup program
reads the superblock to obtain the names of all the directories in the
file system, and then reads logical entities such as directory entries
one by one, almost always not in device order. While slower, a benefit
of logical backups is their ability to inspect the last-modified date
of each file and decide whether or not the file has been updated since
the most recent backup.
Fully-Consistent Dumps
Two backup strategies can be implemented when fully-consistent dumps are
required. One way to make the file system being dumped inaccessible to
modifications is to simply unmount the file system before dumping it.
The file system can then be remounted read-only if such access is required
during the backup. Another option is to lock the file system against the
updates while the backup is being performed. Because these systems prevent
the file system from being modified during backup, they are nearly always
used off-hours. This is usually not a problem unless user batch jobs are
run overnight, as they can be substantially degraded during the backup.
Full-Time Availability
Datacenters that require full-time availability of data can use software
or hardware mirroring to replicate crucial data onto two or more separate
disks. By itself, mirroring does not solve the real backup problem (nor
do other protected storage mechanisms, such as RAID-5), because mirrored
data is also susceptible to application bugs and operator or user error,
and mirrored disks must also be backed up. When full-time availability
is required, a number of options are available, for example, hot database
backups, and the use of snapshot images--read-only copies of data for
backups.
Database Backup Technology
There are three basic type of full database backups: on-line, off-line
and raw device backups. On-line backups are logical backups of a database
that can be simultaneously handling transactions. Off-line backups are
logical backups of a database that is quiet and is not available for transactions.
Raw device backups are physical backups of the raw disk devices.
On-line Backups
On-line backups are the least-intrusive strategy, and they are a popular
solution for databases that must be available 24 hours per day. On-line
database backups are facilitated with software such as Oracle Enterprise
Backup Utility (EBU), which can provide a consistent snapshot of all database
table spaces to backup utilities such as Sun Enterprise Sun StorEdge Enterprise
NetBackupsoftware. With several parallel streams of data provided by the
database,Sun StorEdge Enterprise NetBackup software utilizes the backup
drives to their maximum capacity, multiplexing multiple streams onto single
devices where feasible.
Because transactions must be logged during the backup process, database
performance may be degraded while on-line backups are performed. One way
to backup a database that must sustain high transaction rates, is to mirror
the database and perform a physical backup of the mirror. This requires
first altering the database to begin backup, which establishes a quiescent
database image. Then the mirror is detached so that a static image of
the database is maintained on the detached mirror. The database is then
altered to end backup, which allows logged transactions to be rolled forward
into the tablespaces while a raw device backup of the mirror is done.
When the backup is complete, the mirror is re-attached and the mirroring
mechanism synchronizes the two disk images once again.
Off-line Backups
For very large databases that can be taken out of use for short periods
of time, off-line backups are often the choice. This approach uses a utility
such as Oracle EBU to make the database unavailable for normal transactions.
It synchronizes the state of all its tables and provides a consistent
view of the database to Sun StorEdge Enterprise NetBackup software. Off-line
backups typically outperform on-line backups because of the lack of contention
for system resources, and the fact that they have no impact on transaction
rates once the database is back in use again. Today, with high-performance
backup solutions such as Sun StorEdge Enterprise NetBackup software, off-line
backups are once again being considered viable solutions.
Raw Device Backups
Raw device backups are the simplest way to backup a database, as they
directly copy the raw disk devices to tape. This requires the database
to be in a quiescent state, and uses a utility such as Sun StorEdge Enterprise
NetBackup software to manage the high-speed transfer of disk data onto
tape. Raw device backups are fast because the database itself is not involved
in the process, eliminating all but the essential overhead. They are also
fast because the disk devices are read sequentially, providing data to
Sun StorEdge Enterprise NetBackup software at high speeds.
Advances in Backup Technology
In the past, IT organizations have turned to mainframes as the solution
to large database and high-speed backup needs. While UNIX ® systems
have typically delivered a 50-70 gigabyte per hour backup throughput,
mainframes and their high-speed tape drives have managed throughput nearly
six times faster. Several recent developments have turned the tables on
this equation and have enabled sustained backup rates of more than one
terabyte per hour on Sun servers--while at the same time decreasing the
intrusiveness of backup operations.
Faster Throughput Rates
Tape drive technology has seen dramatic improvements in throughput rates.
The familiar Sun 7 GB 8 mm tape drive provides native (uncompressed) throughput
of 1 MB/second. The Sun StorEdge DLT tape 4000 tape drive almost doubles
this rate by managing 1.5 MB/second, and the familiar IBM 3490E manages
three times the throughput of the 7 GB drive with a rate of 3 MB/second.
Newer drives that significantly change the character of database backup
capabilities include the Sun 20 GB 8 mm AME tape drive with a transfer
rate of 3 MB/second, the Sun StorEdge DLT tape 7000 tape drive that transfers
5 MB/second, and the Storage Technology RedWood SD-3 tape drive that,
with 12 MB/second throughput, outperforms the IBM 3490E by a factor of
four.
Greater Capacities
Along with these improvements in speed have come improvements in capacity.
Sun's StorEdge 20 GB 8 mm AME tape drive stores almost three times the
native capacity of the 7 GB previous generation. The Sun StorEdge DLT
tape 4000 and DLT tape 7000 tape drives have native capacities of 20 GB
and 35 GB, respectively. With a capacity of up to 50 GB on a single data
cartridge, the StorageTek SD-3 drive can hold up to 250 times more data
than a standard
18-track cartridge, and 125 times more data than a 36-track cartridge.
The result is smooth, high-performance backups because less tape handling
is required.
Automated Backup and Recovery Management Procedures
Another important development that is changing the character of backups
is the advent of management software that automates backup policies and
optimally feeds data to tape devices--ensuring integrity and speeding
the backup process. After all, raw tape speed and high-capacity drives
are meaningless without the ability to effectively manage the transfer
of data.
New Approaches to On-line Backups Using Database Technology
Recognizing the need for high-speed backups that require no down time,
database vendors have developed approaches to on-line backups that enable
specialized backup software such as Sun StorEdge Enterprise NetBackup
software to transfer data from the database management system (DBMS) to
backup devices using parallel streams of data. One example is Oracle's
Enterprise Backup Utility (EBU). This utility is responsible for managing
the creation of a consistent database snapshot and feeding parallel data
streams to the Sun StorEdge Enterprise NetBackup software server for multiplexing
onto tape devices. Whereas once this process required dumping database
tables to separate ASCII files and then backing up the files, EBU now
provides convenient interfaces that can be effectively utilized by third-party
utilities.
All of these developments in database backup technology require processing
power and I/O bandwidth in order to work in concert to speed the backup
process. Sun's Ultra Enterprise servers provide scalable, symmetric multi-processing,
scaling from one to 64 high-performance UltraSPARC processors, up to 64
GB of memory, and supporting up to 20 TB of disk storage. The advent of
scalable I/O platforms such as these allows DBMSs to be configured with
the optimal balance of processing power and I/O bandwidth--enabling on-line
backups to proceed without impacting database performance.
Capacity Planning
Capacity planning is critical to the success of efficient backup and recovery
for any Datacenter. Bad performance is usually the result of unrealistic
expectations and poor planning. Realistic expectations and good planning
must consider current and future needs. It must include a plan for the
time and skill to configure the Datacenter, and a plan for training personnel
to operate and fix problems as they arise.
Capacity planning is part science and part art. The capacity planner
must account for numerous variables and virtually unlimited configuration
permutations. Systems are often underconfigured and the wrong products
are often selected for the job. Because installation and configuration
are complex, there is much room for error. Furthermore, because there
are always interrelated bottlenecks, a major aspect of capacity planning
is choosing the preferred bottleneck.
The main role of the capacity planner is to choose hardware and software
for efficient backup and recovery in the Datacenter. To do this, the planner
must first determine the following:
Volume of data the Datacenter will be managing
Availability of that data
How the data will be spread out across the network
Policies for backing up the data
The capacity planner can use this information to derive the following
types of requirements:
Backup servers
Network
Storage
Backup device
Finally, the planner can determine the configuration requirements.
Understanding the Enterprise
Perhaps the single most important factor the capacity planner needs to
assess is the environment to be backed up. This section presents the information
the planner needs to assess the environment.
Dataset Size
The planner's first step is to determine how much data there is to backup
or archive on a regular basis. Two main factors the planner needs to determine
are the total size of the data and the size of the dataset that changes.
Total Dataset Size
The total size of existing data is an indication of the following:
Minimum amount of storage capacity required
Amount of data to be backed up during a full backup
Predictor of total required capacity
Total data size is often one of the easiest pieces of information to obtain,
and tends to be specified as part of the requirements. In addition to
obtaining the total data size, the planner must know or estimate the following
factors:
Number of separate files. The total volume of data may be composed of
a few large files or millions of small files. Certain types of data (e.g.,
databases) may not reside in files at all, but be built on top of raw
volumes. In filesystem backups, there is often a small fixed overhead
per file. The file record needs to be added to the backup database, the
directory information read, and the disk needs to perform a seek to beginning
of file.
Knowing the number of files also helps the planner determine the size
of the backup index database retained by the backup software. On average,
Sun StorEdge Enterprise NetBackup software suggests planning for 150 bytes
in the database per file revision retained on media. That works out to
over seven million file records per gigabyte of index database.
Average file size. By knowing the above two pieces of information, the
capacity planner can calculate the average file size in the enterprise.
If there is a large skew in file size distribution (e.g., many small files
and a couple of very large files that throw off the average), the average
may not be a good predictor of behavior. Therefore the planner must plan
for slightly different performance when backing up small files versus
large files.
Average directory depth. The directory structure into which the files
are organized may also have an effect on the performance of the backup
system. This is partly because long directory paths results in multiple
seeks to the disk. Longer paths also result in larger records, because
each filepath backed up is recorded in the database as a variable size
entry. Therefore, longer paths tend to make the backup index databases
grow faster.
Size of the Dataset that Changes
The size of the dataset that changes determines the volume of data that
needs to be saved during incremental backups. As the number of changed
files or blocks increase, the volume of data that needs to be written
to tape grows. The capacity planner must know or estimate the following
factors:
The frequency of the dataset change. The frequency of the dataset change
determines the frequency for performing backups. The frequency that datasets
change can widely vary. For example, some directories never change, some
change only when something is upgraded, some change only at the end of
the month, and some, like user mailboxes, typically change on a minute-by-minute
basis. In addition, the frequency of dataset change, in part, determines
the volume of data written during incremental backups, because incremental
backups only save the files that have changed.
Amount of data to be backed up. The planner needs to decide whether to
back up all the data or only the changed portions. While it is usually
faster to save only the changed portions, it is also usually faster to
restore whole directories and filesystems from full backups than from
incremental ones. This is because of the restore process: restores from
incremental backups need to first restore from the full backup, and then
from all the incremental backups, until the latest versions of all files
have been restored. This multi-step process often results in numerous
tape mount requests and multiple retrieves of the same piece of data.
The choice of performing full backups or incremental ones tends to be
a matter of which case is most important: a regularly scheduled backup
or an emergency after data has been lost on disk. While the former is
done much more frequently, the latter tends to be a more time-critical
situation.
Data Type
The type of data to be backed up relates mostly to the level of compression
that could be expected from the backup hardware or software. There is
no guarantee that the types of data to be compressed will exhibit similar
properties, so it is safest to assume the data will not be previously
compressed, and to compress all the data to be backed up.
Database or higher-level application data plays a special role in effective
capacity planning. Unless the enterprise has relatively simple availability
requirements for their data, backup will require special modules to save
the data in a consistent state for restore. These modules are available
for many popular database and application environments for both Sun StorEdge
Enterprise NetBackup and Solstice Backup software packages.
The following are types of data the planner needs to consider. The various
data types mentioned below include an example compression ratio for the
DLT tape 7000 tape drive.
Text or natural language. Text or natural language tends to have a lot
of redundancy, and can therefore be well compressed by both software and
hardware. For example, in tests using sample English texts, the DLT tape
7000 hardware compressed the data at ratio of approximately 1.4:1.
Databases and high-level applications. Many popular database packages
and application environments have corresponding backup modules for Sun
StorEdge Enterprise NetBackup and Solstice Backup software packages. For
example, backup modules exist for Oracle, Informix, and Sybase database
packages as well as for application environments like SAP. These modules
enable backing up and restoring data in a consistent state, without taking
the database off-line, making it unavailable to users.
Additionally, while databases and high-level applications tend to have
widely varying contents and structure, they often contain text or numeric
data with a lot of redundancy. This makes them more compressible. For
example, in tests with sample databases from a TPC-C benchmark, the DLT
tape 7000 hardware compressed the data at a ratio of approximately 1.6:1.
Graphics. Many applications require manipulating numerous large graphical
objects. The fact that graphic files tend to be larger than text files
does not imply the filesystem will consist of a few large files. This
is because applications create composite objects from a myriad of smaller
isolated objects.
In general, graphic objects tend to be previously compressed, making
further compression in hardware or software unlikely. Indeed, the nature
of hardware compression algorithms often inflates files that are already
optimally compressed. For example, in tests with Motion JPG data, the
DLT tape 7000 hardware compression showed a compression ratio of approximately
0.93:1.
Combined file types. Data residing on network file servers and internet
servers, the most common server types, is usually a mix of text, graphics,
and binary files. Because these datasets often consist of many small files,
the capacity planner must also evaluate system performance. These mixed
file types compress well. For example, in tests with files from network
file servers and internet servers, the DLT tape 7000 hardware had a compression
ratio of approximately 1.6:1.
File Structure
Another factor the planner must consider is the structure of the files:
will they be backed up using a filesystem or dumped from a raw device?
As mentioned previously, raw dumps copy all the bits from the storage
volume to the backup media. This captures the bits for any filesystem
or database metadata, as well as the actual application data written on
that volume. However, the metadata may be out of synch with the data in
the volume. This is because the metadata on the volume is not interpreted,
and the volume cannot differentiate the backup from another access. To
prevent this problem, the volume is typically taken off-line to prevent
updates to both data and metadata. Another solution is to mark all entities
on that volume read-only for the duration of the backup.
The level of this problem varies depending on the types of filesystems
and databases to be backed up. On-line filesystems maintain consistency,
and do not require periods of unavailability. However, some higher-level
applications may keep their data and metadata in the filesystem, and may
need to be taken off-line or otherwise prevented from updating their files
during the backup. Prevention from file updates during backups is required
so that all the application data can be simultaneously saved and restored
in a consistent state.
Another consideration between raw volume and filesystem backup is the
atomicity of the data. The raw volume is treated as one large entity,
while filesystems are divided into many small logical pieces. The entire
dataset needs to be restored to keep one portion of data (e.g., a file
or database row) that needs to be recovered from a raw volume dump in
a consistent state. Restoring the entire dataset not only takes longer,
but it also overwrites any changes to all the other data that had been
made since the dump. In addition, incremental backups are currently impossible
with raw volumes, because an update to any part of the volume compromises
the integrity of the whole. In this case, the whole volume needs to be
dumped again. With filesystems, only those files that changed since the
last backup need be saved again.
The main advantage of raw volume dumps is the sheer efficiency of dumping
raw bits without further interpretation by the system. The disk accesses
tend to be large and sequential, minimizing the overhead of system calls
and eliminating seeks by the disk drive arms (which are orders of magnitude
slower than data transfers).
In contrast, filesystems add additional overhead. The data from file
accesses is, by default, buffered in the virtual memory system, and this
incurs copies in the kernel. In addition, files are read from disk in
directory order and may be scattered in various areas of the disk, causing
seeks to pass from one file to the next. This process may reduce the data
rate from the disk volume. To perform closer to the level of raw dumps,
the filesystem inefficiencies can be minimized through careful configuration
and tuning. Nevertheless, there are certain situations where raw dumps
are superior, if only for their sheer simplicity.
Filesystems can also offer a number of features that benefit effective
backup configuration and planning. Chief among these is the ability to
turn on Direct I/O. Direct I/O is a method of accessing files in the filesystem
as though they were raw devices. This mainly bypasses the virtual memory
buffering, but this may result in a large saving in CPU time, memory usage,
and overall wall time. (Despite the benefits of Direct I/O, seeking to
various positions on the disk to reach the beginning of file cannot be
avoided.) A recent study showed that Direct I/O saved an average of approximately
20% CPU cycles, and kept the system from thrashing during extraordinarily
heavy loads.
Direct I/O is available in both VxFS and UFS (starting with the Solaris
2.6 Operating Environment software). VxFS provides various mechanisms
for engaging Direct I/O, including a per-I/O option. The most common method,
however, is to use a mount-time option to enable this feature for the
entire filesystem. UFS also allows Direct I/O to be turned on for the
entire filesystem. One additional benefit of VxFS is that a filesystem
can be remounted with different options without first unmounting the filesystem.
This allows users to remain on-line and active, even when Direct I/O is
toggled. This may form a benefit in enterprises where continuous operation
is necessary.
Lastly, the VxFS filesystem provides a quick snapshot capability that
can mount an additional filesystem as a read-only snapshot of the original.
This is done while the original is still active and available. This feature
is implemented via a copy-on-write mechanism that makes sure any blocks
from the original filesystem are copied out to a special area before the
block is changed on disk. A much smaller amount of additional disk space
is required to activate the filesystem snapshot capability than from the
logical volume manager. This is because only blocks that changed during
the snapshot need to be duplicated.
Data Origin
Knowing where the data is coming from will help the planner to plan an
appropriate configuration. The configuration needed for a local backup
at high speed is very different from that needed to backup hundreds of
small PC's over a metropolitan area network. The considerations below
explore this issue in more depth.
Is the Server Where the Data Resides the One Doing the Backups?
When the server where the data resides does the backups, the complication
of configuring networks is eliminated, and the planning focus is narrowed
to the disk and tape subsystems and server processing capabilities. The
server needs to have sufficient tape bandwidth to meet the backup window
requirements--the available time period for backing up a specified quantity
of data. To ensure capacity for multiple backups of the data (e.g., daily
differential, weekly cumulative, and monthly fulls), tape capacity should
be configured for at least three to five times the dataset size.
Disk bandwidth should be configured to meet the backup window requirements
and keep the tapes streaming. (To keep from back-hitching, the DLT tape
7000 tape drive needs to receive data at a rate no less than 3.5 MB/second.)
This may be difficult to ensure, because the server and disk subsystem
are often already in place and tuned to perform a specific set of tasks.
In this case, to determine if the desired backup window is feasible before
planning for a specific set of tape devices, it often helps to measure
the sequential rate of the disk subsystem. If the backup window is feasible,
but backup performance still suffers due to slow disks, the planner needs
to consider reconfiguring or upgrading the disk subsystem as part of the
system upgrade path.
Lastly, the planner needs to consider the CPU resources necessary for
local backup. Fortunately, these tend to be minimal, especially if Direct
I/O is used to access the filesystems. For example, with Direct I/O, a
single 250 MHz CPU should be sufficient to backup at 50 MB/second from
local disk to tape. If the backups will be concurrent with regular operation
and the system is already fully loaded, the additional CPU resources needed
for backup may need to be added.
There are some additional factors the planner must consider. If the system
has spare processing capacity, the planner must determine how much head-room
exists and whether it will be sufficient to meet demands. Secondly, if
the backups will be performed at off-peak hours, the planner must determine
if there are any other scheduled processes to be run concurrently with
the backup, and how much CPU is available for both. The planner also needs
to consider sizing and tuning memory, especially if Direct I/O is not
used. The main consideration in that case is the shared memory buffers
used to coordinate between various backup processes, albeit memory is
needed for essentially all system activities.
Is the Data on Remote Clients?
The planner must consider the requirements for backup of remote clients.
This involves planning for the networking requirements to meet the backup
window and other considerations. There is no recipe solution because of
the virtually boundless varieties and configuration possibilities of enterprise
networks. The planner must carefully plan for a successful network backup
infrastructure, and have a good knowledge of network performance.
Even with the latest networking technologies, network bandwidth tends
to lag behind the bandwidth of storage subsystems. Gigabit Ethernet is
theoretically 100 times faster than Ethernet, but at the same time, FiberChannel
Arbitrated Loop (FC-AL) offers twice the available bandwidth of Gigabit
Ethernet. This discrepancy in bandwidth is unlikely to change anytime
soon, because the tolerances in network connectivity tend to be much tighter
than for storage. Network bandwidth issues are further complicated by
the relatively high cost of upgrading the network infrastructure. While
new storage devices can just be plugged in, adding network capacity may
mean re-wiring parts of the enterprise. Such infrastructure tends to be
very expensive and needs to be planned years in advance. Therefore, even
if the upgrade is committed, there is often a period of time where the
backup solution needs to work around inadequate network bandwidth.
Because of these network bandwidth issues, a frequent challenge when
planning backup solutions is to find ways to satisfy backup requirements
within the confines of a given network bottleneck. To understand the overall
situation and to obtain a satisfactory solution, the planner needs to
find the answers to the following five key questions:
How many clients are there? Knowing the number of clients helps the planner
understand the overall scale of the enterprise. It also helps the planner
determine aspects of backup planning such as level of multiplexing. Knowing
the number of clients is also important because it ties in with the clients'
location in the network in relation to the backup server.
What types of clients are there? To understand the client processing
capabilities, the planner needs to know the types (i.e., the architecture
and operating system) of clients that need to be backed up. For example,
if a client has powerful processing capabilities but little network bandwidth
to the server, software compression may be a good choice in backing up
that client. Both Sun StorEdge Enterprise NetBackup and Solstice Backup
software packages offer client-side modules for most platforms.
Do the clients have their own backup devices? If the clients have their
own backup devices, the best configuration may be a hierarchal master-slave
configuration. In this configuration, the master server initiates and
tracks backups, but data goes to the local device. This configuration
saves network bandwidth, and can often be significantly faster. The master-slave
configuration is recommended on large clients connected to the backup
server by a slow network. The backup server is often less powerful than
the client it controls, and the main backup devices are attached to the
slave clients.
How are the clients distributed? Knowing where in the enterprise network
various clients reside helps the planner determine the available network
bandwidth between the clients and server. This is necessary information
for predicting backup times and data rates available from the clients
to disk. Because the network bandwidth is often inadequate, a hybrid solution
is most appropriate, in which both network backup of some clients and
master-slave configurations are used.
How autonomous are the client systems? Sometimes the client systems are
located in remote offices connected to the backup server via WAN (wide-area
network) links. These systems often do not have dedicated technical support,
and hence need to be managed remotely. By centralizing management, Sun
StorEdge Enterprise NetBackup software helps make that task easier. However,
certain tasks are necessarily manual, and involve personnel at the remote
site. These people will need to be trained to carry out specific tasks
associated with backup (e.g., changing tapes in stand-alone drives).
What Does the Disk Subsystem Look Like?
It is critical to obtain the optimal disk subsystem for good backup performance
with modern tape technologies. This is because the disk becomes the next
most likely bottleneck, assuming the network bandwidth is sufficient or
the backups are being performed locally. The performance of the disk subsystem
depends on numerous factors. To plan backup solutions, the planner can
use the questions below as guidelines for addressing some of the more
important disk-related performance issues:
How are the data on the disks laid out? The data layout on the disk affects
throughput rate, because it determines whether access to the disk is mostly
sequential or random. If the access pattern requires frequent seeks between
portions of the disk, the overall throughput rate of data from the disk
will dramatically decrease.
There are three reasons that the access pattern may require frequent
seeks. The most common one is that the data on the disk was created over
a long period of time. In this case, deleted files are left on scattered
parts of the disk, and they are subsequently filled by newer files. A
seek may then occur to get the next file, because the disk is backed up
in directory order. (In this case, one way to obtain mostly sequential
access to the existing files--albeit not an ideal process--is to backup
all the files once, recreate the filesystem on the device, and then restore
all the files from tape.)
Another common cause for this access pattern is that multiple processes
are accessing different regions of the disk simultaneously. This results
in seeks between the various regions. This can occur, for example, if
two different filesystems on the same disk are being backed up simultaneously.
In this case, it may be possible to serialize the access by scheduling
the backups differently.
A third reason for this pattern is that outer regions of the disk (lower
numbered cylinders) tend to be faster than inner regions. Data that needs
to be accessed more quickly may be laid out on the outer cylinders.
How are the disks arranged into logical volumes? The logical volume configuration
significantly affects performance. To add levels of performance or reliability
to the disk subsystem, most enterprise server environments will involve
some level of logical volume management, using software or hardware RAID.
RAID-0 (or stripes) volumes tend to increase overall performance, but
significantly reduce overall volume reliability. Various combinations
of RAID-1 (mirroring) and RAID-0 increase performance while also increasing
reliability. RAID-5 also tends to increase both performance and reliability.
However, RAID-5 has performance characteristics which slightly complicate
backup planning. Approximately two to three times more time should be
planned for restoring data to a RAID-5 volume than it took to back it
up, because RAID-5 writes (especially small random writes) take significantly
longer than reads. The expected reliability of the logical volumes plays
a role in determining backup frequency. The RAID volume should probably
be backed up more frequently if the following are all the case: the volume
has poor reliability (e.g., RAID-0), it is updated often, and it contains
valuable data.
How are the disks managed? Another important consideration is the mechanism
by which the individual disks are managed or configured into logical volumes.
Two possible mechanisms are host-based and hardware RAID. Host-based RAID
imposes slightly more overhead on the server system than hardware RAID,
but tends to be more flexible. Various volume managers offer different
RAID configuration options (e.g., RAID 1+0 vs. RAID 0+1). Some volume
managers also offer additional features (e.g., snapshot) that are attractive
for backup solutions. A large number of server clients and most workstation/PC
clients do not implement logical volume management at all, and are limited
to the performance and reliability characteristics of the individual component
disks (i.e., JBOD).
What are the disk capabilities? The capabilities of the individual disks
also affect disk subsystem performance and reliability. Newer disks tend
to be faster and more reliable than older disks. This is not only because
of age, but also because of rapid advances in disk technologies. When
doing sequential I/O, each disk tends to be capable of a certain data
rate, and a certain random seek rate. When the disks are managed as RAID
volumes, these capabilities place limitations on the overall logical volume
performance. Additionally, different disks have different MTBF (mean-time-between-failures).
Data Destination
Several key questions below provide guidelines for the planner to plan
for factors related to the tape subsystem, the data's target location.
What Does the Tape Subsystem Look Like?
The tape subsystem is another critical consideration, but tends to be
slightly less complex than the disk subsystem. Overall, tape devices tend
to be relatively predictable and generally behave as advertised. The most
difficult task associated with a high-performance tape subsystem tends
to be in terms of installation and configuration rather than planning.
Planning tape subsystem capabilities is often a matter of using the device
specifications to amass the required storage capacity and throughput.
The planner can use the following questions to consider related issues:
Where do the tape devices reside? The planner needs to determine whether
the tape devices are stand-alone desktop or rack-mounted units that need
to be loaded by hand, or if they are mounted in a robotic library. If
they are the former, the planner needs to consider planning for the human
interaction required to implement an effective backup solution.
The robotic library is a superior choice for enterprise-level backup
solutions. There are many variations of tape libraries, but most commonly
they offer multiple tape drives and internal storage capacities in the
hundreds of gigabytes.
By knowing the required data capacities, the planner can plan for a sufficient
number of libraries to house all the data and to have room to grow. It
may be more reliable to purchase a number of smaller libraries than a
single very large library, because most tape libraries have only a single
robot mechanism.
How many tape drives are there? The planner needs to determine the number
of tape drives needed to meet the throughput requirements, and to configure
at least that many as part of the libraries. The planner must also remember
the SCSI or FC-AL slots on the server needed to connect the tape robotic
devices. If there is an existing tape subsystem, they must determine its
capabilities and supplement them with new equipment, if necessary. They
must also be aware of any forward or backward compatibility issues with
the media, because tape formats change almost as frequently as the underlying
hardware.
What are the drive capabilities? Each individual type of tape drive has
its own characteristics and capabilities. These include native-mode throughput,
tape capacity, effectiveness of compression, compatibility of tape formats,
and recording inertia. While throughput and capacity are relatively simple,
the others also need to be carefully considered.
The actual compression ratio achieved depends mostly on the type of data,
but it also depends on the compression algorithm implemented by the drive
hardware. For example, the DLT tape 7000 algorithm prefers to trade throughput
for compaction, while the EXB-8900 Mammoth 8 mm drive prefers the opposite.
Not all tape drives are capable of using older media, even if the form-factor
is identical. Most can read tapes written with older formats but cannot
write in the older format.
If the backup images are to be archived for a number of years, the upgrade
path is also important. The drive technology will chiefly determine the
recording inertia. For example, linear recording technologies like the
DLT tape 7000 and STK Redwood drives tend to have a stationary read/write
head and quickly moving tape. To perform well, these drives need to be
fed data above a specific rate. Helical-scan technologies like 8 mm and
4 mm tapes have lower recording inertia and are thus less sensitive to
data input rates, but have overall lower throughput capabilities. It is
difficult to balance all these factors, but as long as some minimal requirements
are met, a suboptimal choice usually has little real effect on the overall
performance.
How Are the Tape Devices Distributed?
It is also important to optimally position tape devices throughout the
enterprise. This mainly depends on where it is advantageous to make the
extra effort and attach backup devices directly to servers where the data
resides. The following questions can help the planner examine the relationship
between the tape devices and data, and may help them to focus on the relevant
considerations:
Are all tapes on the master server? If all tape devices reside on the
master server and the bulk of the data is elsewhere, the network needs
to support the transfer rates necessary to move data from the remote clients
to the centralized backup server. This configuration often simplifies
day-to-day management at the cost of a complex networking infrastructure.
As noted previously, networks are traditional bottlenecks for backup applications,
and need to be configured for optimal performance.
Are tape libraries attached to important servers? An effective backup
architecture is to add tape devices to servers where large quantities
of data reside, and task them with being backup slave servers, centrally
managed from the master server. With this architecture, the only information
that is communicated over the network between master and slaves is the
file record information, about 200 bytes per file backed up. Both Sun
StorEdge Enterprise NetBackup and Solstice Backup software packages support
this option.
How close are the tape drives to the data? The proximity of the tape
drives to the data is usually an issue of network bandwidth. This is because
shorter network distances tend to be covered by higher speed network links.
If the tape devices and data are separated by hundreds of kilometers,
the link bandwidth is likely to be low. In contrast, if they are located
in the same data center, it may simple to configure a point-to-point link,
dedicated for backups, between the two. This is mainly important when
deciding where to locate the master server in a widely distributed enterprise,
because the network architecture and data locations tend to be fixed.
A general guideline is to locate the master server as close as possible
to the bulk of the data, and hopefully close to a central location in
the network topology.
Tape Environment
The operating environment influences the reliability of the tape subsystem
longevity. The planner can use the three questions below to address the
main factors:
What are the temperature and humidity like? Tapes perform best in moderate
temperatures and relatively low humidity. The operating temperature affects
things like tape tension and strength, drive part tolerances, and temperature
of internal electronic components in the drive. Humidity may affect the
longevity of the magnetic coating on the tape. This is because high humidity
causes the surface of the tape to become gummy. The ideal operating conditions
tend to be listed as part of the media packaging. For example, the DLT
CompacTape IV lists operating conditions as 10-40 degrees C, storage as
16-32 degrees C, and humidity between 20-80%. Long-term archive storage
(20+ years) requires even more stringent conditions.
How often are the drive heads cleaned? Drive heads need to be cleaned
periodically because they pick up deposits with continual use. This is
usually accomplished by inserting a cleaning tape. Tapes operating in
dirty conditions (e.g., near printers) need to be cleaned more frequently,
as do drives that operate outside of environmental specifications. Brand
new tapes tend to have some manufacturing debris on the surface, and drives
that frequently use brand new tapes should also be cleaned frequently.
Both backup software and tape library hardware are capable of automatically
inserting cleaning tapes after a certain number of uses.
How old are the drives and tapes? As they get older, tape drives tend
to eventually wear out and encounter errors more frequently. Each tape
technology has an associated MTBF (mean-time-between-failures), and media
has a certain rated number of passes it before it is expected to wear
out. These statistics, available from the manufacturers, tend to be optimistic.
The Data's Path
One of the last considerations in the overall system, is the path the
data takes from the disks where it originates, to the tape cartridges
where it is destined. The planner can explore this factor through the
following questions:
Are Data and Tape Local to Backup Servers?
If data and tape are local to backup servers, the planner should focus
should focus configuration and tuning on moving data quickly through the
system between the devices. They should also focus on supporting the potentially
large number of processes involved in managing the backup streams. These
tend to fall into two areas: using memory effectively and providing local
host/RPC capacity.
Is the filesystem buffer cache used? Backups are more efficient when
avoiding the filesystem buffer cache. The buffer cache can be bypassed
by either using Direct I/O to access individual files, or backing up the
raw volume rather than the filesystem.
How much system memory exists? Backup relies on system memory in two
capacities. Primarily, it is used for shared memory regions used to implement
interprocess communication between various backup/restore processes. Memory
is also used when buffering filesystem data in the virtual memory cache.
If data is cached in virtual memory faster than old pages can be purged,
the system may begin to thrash. More memory temporarily forestalls this
condition. However, if the system is in a condition where data is cached
faster than purged, it will likely thrash at some point during the course
of a long backup.
The most elegant solution is to avoid the buffer cache in the first place,
but if that is impossible, the planner needs to tune the memory reclaim
rates to be more aggressive. In addition, to improve I/O to the swap device,
they also need to stripe-swap across multiple spindles. This may eliminate
thrashing, or at least reduce its impact.
What software is being used? The software used determines the overall
efficiency with which data is moved from disk to tape. Both Sun StorEdge
Enterprise NetBackup and Solstice Backup Power Edition software packages
move data very efficiently, but Solstice Backup Network Edition software
is a little less efficient. the Solaris Operating Environment software
utilities such as tar and ufsdump are not particularly efficient and should
not be used to implement enterprise backup solutions.
How much shared memory is available? The amount of shared memory the
system can allocate is controlled in the /etc/system file. This file determines
the memory used for interprocess control (IPC) between the reader and
writer processes in the system. For efficient backup and restore a certain
amount of shared memory should be configured per device and data stream.
What are the TCP tunings like? Tuning various parameters for the TCP
kernel helps determine the buffer sizes used by the system, and the speed
that closed connections in various TCP wait states are flushed from the
system.
Are Data and Backup Server(s) Distributed on the Network?
If the data is connected to its eventual destination on tape by a network,
the planner needs to place emphasis on making sure the connectivity is
uninterrupted and of sufficient bandwidth to meet requirements. To do
this, the planner should consider the following questions.
What kind of network is it? Not all networks behave similarly, although
all networks tend to be described in terms of their bandwidth. Different
networking technologies have different properties. Ethernet variants tend
to be inexpensive and common, but their range tends to be limited to local
area networks. Within local area networks, there are various topologies
that have different performance characteristics (e.g., switched to the
desktop vs. hub vs. shared segment).
In addition, the nature of Ethernet causes overall bandwidth to degrade
as more nodes are active on the network simultaneously. ATM (asynchronous
transfer mode) and FDDI (fiber distributed data interface) networks have
longer ranges and degrade more gracefully under heavier loads. However,
they use fiberoptic connections, which make them less common and more
expensive to install. Gigabit Ethernet and Sun Quad FastEthernet are growing
in popularity due to their familiarity and ease of management, but are
still not common in existing enterprises.
What is the available network bandwidth? A typical enterprise network
consists of multiple segments and various network technologies. The available
network bandwidth from one client to another may be vastly different.
The planner must estimate the available bandwidth for each key path between
backup server and client. This often entails constructing a detailed map
of the enterprise network, which may not be available or up to date. To
obtain this information requires several days of planning.
How many simultaneous clients are sharing it? If all clients are active
at once, the network is more likely to get overloaded than when more clients
are on each network segment. However, when there are more clients, the
level of multiplexing to the tape drives can be increased. This allows
them to stream when a single client is too slow to feed data to the tape
at a sufficient rate.
Enterprise Backup Requirements
Backup requirements tend to fall into a few discrete camps. These are
primarily concerned with the following:
Backing up the data in a certain period of time
Restoring the data as needed
Limiting the impact of the process on day-to-day operations
The planner should thoroughly investigate these requirements before suggesting
any particular solution. The following questions give the planner a useful
start to that investigation:
What Is the Backup Window?
The planner's first step is to determine the backup window. However, a
backup window is not a given, and there may not be an ideal period of
time in which to perform the backup. Some applications and services need
to be available 24 hours a day, seven days a week. In those cases, other
methods of obtaining consistent backups need to be employed. In less extreme
situations, the amount of time necessary to perform the backup may exceed
the natural period of inactivity. These situations require compromises,
in terms of one of the following:
What is backed up?
Back up frequency
Data availability
Performance impact
When are people least likely to need access to the data? Times where demand
for data is light tend to create natural backup windows. This is usually
when few users are on the system, typically at night or on weekends. There
may be other predictable periods of time when system activity is low (e.g.,
lunch time, after quarterly processing is completed, and holidays), and
these are also good opportunities to schedule backups. If these natural
periods of inactivity are insufficient, the planner needs to consider
how or when they could be extended. The planner's goal is to ease the
burden in terms of required throughput necessary to back everything up
during the backup window.
How much data needs to be backed up (full and incremental)? The other
part of the equation is the amount of data that needs to be backed up.
For consistency and recovery purposes, the ideal backup saves the full
set of data. The down side is that the full dataset is usually very large,
consuming a lot of time and tape capacity. Most installations choose to
perform full backups occasionally, and supplement those with more frequent
incremental backups that record only the data that changed.
Sun StorEdge Enterprise NetBackup software offers a number of incremental
backup options. Differential backups record files that changed since the
last backup (either full or incremental). Cumulative backups record all
files that changed since the last full backup. A drawback of cumulative
backups is that they usually record more data than differential backups.
An advantage is that restoring requires retrieving only from the last
full and the last cumulative, rather than fetching the last full and potentially
many differential images.
Solstice Backup software offers similar mechanisms, including multiple
levels of cumulative backups similar to the levels used by ufsdump(1M).
By knowing the potential backup targets of the software and data usage
patterns at the site, the planner can estimate approximately how much
data will be saved during each type of backup. A target data rate that
the backup system should plan to achieve can be obtained by dividing that
amount by the estimated time available. Various margins of error can be
built into the calculations for added control.
What Is the Acceptable Impact of Performing the Backup?
If the window of inactivity is not sufficient to save the required volume
of date without resorting to extravagant hardware, the planner needs to
estimate the impact of performing backup concurrent with regular system
use. To minimize the impact, there are a number of options available,
with no hard rules for planning what to do when. The planner must evaluate
all possible choices and select the most appropriate one.
Is data unavailability acceptable? The planner's central consideration
is whether data can be kept from the users for some period of time. If
it can, the planner needs to determine how that dedicated time might be
best used to perform the backup. If data can be unavailable for some length
of time, it is usually possible to back it up faster than keeping it on-line.
This may be in the form of shutting down any databases and backing up
the raw volume, or unmounting any filesystems and backing up the underlying
devices.
Is degraded performance acceptable? If data needs to be continually available
but the overall performance of the system may be somewhat degraded, one
choice is to continue backups concurrent with user activity. There are
a number of mechanisms for on-line backup, and each has a different degree
of impact on performance. The planner needs to assess the trade-offs and
choose the best possible compromise.
How long is degraded performance acceptable? If data unavailability or
degraded performance are acceptable, the planner needs to determine the
period of time that must not be exceeded. This period is usually smaller
for data availability than degraded performance, but lower performance
may lead to overall lower productivity, and thus should be minimized.
If databases are used, are appropriate modules available? Not all commercial
databases have corresponding backup modules for the Sun StorEdge Enterprise
NetBackup and Solstice Backup software packages. If hot backups of a database
or some other high-level system are needed, the planner must verify that
an appropriate module is available.
What Availability Concerns Should the Solution Address?
Each solution should address the real concerns and objectives of the customer.
It is vital to understand the availability concerns that the backup attempts
to address. For example, a good solution for retrieving accidentally deleted
files is probably not a good solution for disaster recovery in which the
whole site may be destroyed. To start thinking about the relevant issues,
the planner can use the following three questions:
Is it critical to minimize impact of user or operator error? If the major
concern being addressed is loss of individual files, the solution should
be designed to retrieve the file quickly and with minimal effort on the
part of the administrator. Minor issues include tape storage, duplicate
media, and offsite import/export. An important issue is backup frequency,
because the copy on tape should be as close as possible to the file's
final state. The level of multiplexing can be high, because the overall
throughput is not an issue when retrieving a small set of files, unless
the files are very large.
To address such issues, planners may choose to use disk-based rather
than tape solutions. Such solutions include, for example, keeping a third
mirror of the volume off-line and readable in case something needs to
be retrieved, or backing up important files to a disk directory rather
than tape.
Is it critical to minimize impact from loss of equipment? If the goal
is to minimize the impact of failed hardware (e.g., disk-head crash),
backups can be structured to keep data from the same equipment arranged
on the same set of media, and perhaps to duplicate the media. This would
minimize tape fetch time from data that spans several tapes. To reduce
the chance of losing data to failed hardware, RAID software or devices
can be used.
Impacts from hardware failure also relates to highly available and cluster
configurations. Configuring backup for these environments is potentially
difficult and requires some experience. In situations where the entire
system needs to be highly available, the best solution may be specialty
contractors like Comdisco.
Is it critical to minimize impact in case of disaster? Disaster recovery
and preparations need to encompass all aspects of the operation. These
aspects range from frequent training for data center personnel, to using
customized scripts for the backup software. The more common steps, however,
are to keep multiple copies of media, one local and another archived at
a remote site. Another option is to have another site where the data is
imported by the backup software and ready for a restore. Some companies
choose to have a "hot site" available to go on-line within a
few minutes of a disaster, where the configuration has the same capabilities
as the original site.
Expectations
A critical part of effective capacity planning is maintaining realistic
expectations. This is important because a few areas of confusion can cause
a disproportionate number of problems.
Compression
Compression can be problematic for a number of reasons. The main reason
is that the benefit of compression varies with the data being compressed
as well as with the compression mechanism being used. With the same compression
algorithm, different types of data compress to different degrees. The
level of compression depends on how much redundancy can be identified
and remapped in the time available. Some types of data (e.g., video) have
little or no redundancy to eliminate. Therefore, these will not compress
well regardless of the compression scheme that is used. Hardware compression
in the tape drive typically relies on a small buffer in which to temporarily
hold the data as it is compressed. The size of this buffer limits how
much of the data may be examined for redundant patterns. Lastly, the amount
of time necessary to locate all redundant patterns may be longer than
available to the compression mechanism. This is because the compression
needs to happen in real-time, as the data streams into the tape device
and onto the tape.
People often expect either the 2:1 compression ratio frequently quoted
in the tape hardware literature, or expect similar compression ratios
as they see with compress utilities like compress(1) or GNUzip. In the
past, the 2:1 number was sometimes touted as "typical", but
in truth it was typical only of the special test patterns manufacturers
use to test their algorithms. When compressing diverse types of data in
the field, the compression ratios were often lower. If capacity planning
was done expecting 2:1 compression, the system was often inadequate to
the task.
Another typical compression mistake is to compress the target data on
the system using compression programs, and use the observed compression
ratios to estimate hardware compression. This mistake stems from the different
natures of hardware and software compression. Compression utilities can
use all of the system memory to perform the compression, and are under
no time constraints. Hardware compression is limited to the hardware buffer
size, and needs to be compressed in real time. The compression ratio observed
with software utilities will usually be much better than the drive hardware
can deliver. Inadequate systems can occur if capacity planners use those
numbers.
Compression ratios for various types of data (as observed in simple tests)
are shown in Table 1. For hardware compression, the more "typical"
compression ratio to expect is closer to 1.4:1, although some data types
appear to do better. If attempting to save data with little to no redundancy
(e.g., compressed video like MPEG or MJPG), it is better for compression
in the drive to be turned off. In addition, the compression mechanism
has two effects. The first is to speed up the rate at which data is processed
by the device, and the second is to compact the data written to tape so
that tape can hold more information. 1
Compression Ratios Mode
Speedup Ratio
Compaction Ratio
None
1:1
1:1
Text
1:46:1
1.44:1
Motion JPG
0.93:1
0.92:1
Database
1.60:1
1.57:1
Fileserver
1.60:1
1.63:1
Webserver
1.57:1
1.82:1
Aggregate
1.32:1
1.39:1
Overhead
The planner should plan for a certain amount of metadata overhead, on
top of the data that is being saved and restored. The backup software
keeps a database of files residing on tape, with a record for each instance
of the file. An estimated 150-200 bytes are needed per file record; Solstice
Backup software typically requires slightly more bytes than Sun StorEdge
Enterprise NetBackup software. This means that a database containing a
million file records is typically between 143 MB and 191 MB. The planner
should plan for reliable, fast disk space to accommodate the file database,
and they should configure a regular schedule to backup the database itself.
The software also writes a certain amount of metadata to tape in order
to keep track of what is being written where. This metadata tends to be
minor in relation to the dataset size. Simple tests indicate that the
metadata written to tape by Sun StorEdge Enterprise NetBackup and Solstice
Backup software packages are typically below 1%. Other software (e.g.,
ufsdump) may write more metadata to the tape, depending on the format
used.
Recovery Performance
Another common planning error is to assume that restore performance will
be identical to backup performance. Initial rules of thumb suggested expecting
the recovery to take approximately three times longer than the backup.
While this is probably a safe metric to use, recent measurements indicate
that it may be too conservative for Sun's latest systems and software.
With proper tuning and adequate hardware support, it is possible to have
restores perform within 10% of backups. When no other information is available,
it may be safer to use some compromise like 50% or 75% longer. This is
because this performance is predicated on a number of assumptions.
The main reason for this performance discrepancy appears to be the nature
of writes versus reads. For various reasons, writes to stable storage
often take longer than reads. There is also more frequent demand for writes
to be performed synchronously (in order to guarantee consistency). For
example, creating files requires several synchronous writes to update
the metadata keeping track of the file information. Those updates need
to be performed in order to preserve file integrity.
Another component of the longer restore times is the browse delay introduced
at the start of the request. When a restore request is initially issued,
the software needs to browse the file record database and locate all records
that need to be retrieved. This may take some time for large databases
containing millions of records.
The situation is even more complicated for multiplexed restores. This
is because the software usually waits to make sure all requests are received
before initiating the actual restore. Alternatively, it may go back to
retrieve files that were requested after the restore had already begun.
This occurs in order to resynchronize the retrieval of file data intermingled
on the same length of media. Otherwise, the restore operation needs to
be serialized, constantly rewinding the tape to get each additional backup
stream.
Ease of Use and Training Requirements
Modern storage management software offers powerful features behind easy
to use graphical user interfaces (GUI). Modern library hardware has also
been streamlined for ease of use and reliability (e.g., the GUI touch-screen
controls on the Sun StorEdge L3500 tape library). However, the entire
area of backup and data protection is very complex. It will inevitably
be up to the planners, installers, and operators to make complex decisions
which will affect the long-term success of the installation. This requires
training on the products involved, hand-on experience, and at least a
rudimentary understanding of the issues.
It is naive to expect to take the software out of the shrink-wrap, uncart
the hardware, and put together a well-tuned backup solution. On top of
careful planning, even moderately complex backup installations call for
trained and experienced personnel to install, configure, and tune the
various components. This usually takes several days of dedicated effort.
For the most complex installations, it may take multiple weeks to have
everything optimally running.
The most successful approach is to bring in experienced consultants (e.g.,
Sun, Veritas, or Legato professional services) to install and configure
the system for current needs, and to teach on-site personnel the basics
of maintaining and operating the configured system. The on-site personnel
then need to develop in-depth knowledge to be able to modify the configuration
to meet increasing demands; this can be achieved through further training
or other means. Meeting on-going demands is certainly also possible through
long-term contracts with the consulting services that initially configured
the systems.
Measurements and Calculations
A number of simple measurement techniques and calculations are useful
in reaching correct capacity planning decisions. The following sections
should provide the necessary background and tools for the planner to make
simple bandwidth estimates to match capacities. This information can also
provide an example methodology that can be adapted for more complex decisions.
These sections serve as a reference as well as a learning aid.
Network Sizing
Accurately networks can be tricky. There are many different networking
technologies in place today, and more are being added over time. Each
technology has its own characteristics, and these are complexly interrelated.
There are often multiple paths between any two points in the network,
and different paths offer different bandwidths. All these factors combine
to challenge planners trying to understand the layout of the corporate
intranet.
It is usually easiest for the planner to start from scratch and plan
for additional new networks dedicated for backup. Unfortunately, adding
these may involve pulling additional wiring between distance corners of
the enterprise, far more expensive than the purchase of a few switches
and adapters. To meet the new backup demands, most planners need to understand
how to efficiently use the existing network infrastructure.
Principles
To effectively perform network capacity planning, there are a few simple
but powerful techniques that the planner can use. When working without
an existing network configuration for backup, the goal is to configure
sufficient bandwidth between the data location and the tape device. This
can be done by allocating multiple links between source and destination
until the aggregate bandwidth is adequate. Table 2 may help the planner
choose technologies best suited to the task.
Estimated Rates for Various Network Technologies Technology
Theoretical Speed
Realistic Speed
Modem
28.8 KBaud
2 KB/sec
ISDN
128 Kb/sec
10 KB/sec
Frame Relay 256
256 Kb/sec
20 KB/sec
Frame Relay 512
512 Kb/sec
39 KB/sec
T-1
1.54 Mb/sec
115 KB/sec
T-3
44.7 Mb/sec
3.4 MB/sec
Ethernet (10BaseT)
10 Mb/sec
0.75 MB/sec
FastEthernet (100BaseT)
100 Mb/sec
7.5 MB/sec
GigabitEthernet (1000BaseT)
1000 Mb/sec
50 MB/sec
FDDI
100 Mb/sec
8 MB/sec
CDDI
100 Mb/sec
8 MB/sec
ATM 155
155 Mb/sec
11.6 MB/sec
ATM 622
622 Mb/sec
50 MB/sec
HIPPI-s
800 Mb/sec
60 MB/sec
Most realistic environments will already contain significant investment
in network infrastructure that can be leveraged for backup and recovery.
With the high cost of installing additional wiring, it is usually preferable
to strategically place backup servers to use these existing networks.
The planner's first step is to sketch a map of the existing network.
Many enterprises will already have such a map, or know who to turn to
for this information. The goal is to produce a map showing all relevant
network links in relation to one another. These links are then labeled
with their expected available bandwidth during the projected backup window.
The full bandwidth of the link may not be available for backup, because
the networks are usually shared with other users. Network administrators
often keep usage statistics that may point to a time when the networks
are nearly idle, an ideal time to perform backups. The planner also needs
to note how much backup data is located on each segment, and on which
machines it is located. Systems that have a large concentration of data
may turn into hotspots; therefore, they need to be adjusted.
Once a map of the existing network infrastructure is available, the planner
needs to locate the most central point in the network. This is the one
that has the most access to plenty of bandwidth and minimizes the overall
number of hops to the data. This central point is the ideal place to locate
the master server from the standpoint of the network. (Administrative
issues may be a different story.) Once the master server is placed, the
planner needs to estimate the available bandwidth from the various key
data sources to the master server.
If the above estimation process shows that the network would be a bottleneck,
the planner needs to consider adding slave servers to various network
segments. One situation the planner needs to consider is when there are
a few systems that hold the bulk of the data on that segment, and one
system in particular is either the largest or least busy. In this case,
the planner must consider converting that machine into a slave server
by adding a tape subsystem that is sufficient to service the local backup
needs. If no such machine is available and all existing machines are fully
utilized, the planner should consider adding an additional system on that
segment to be the slave backup server. The slave servers will direct all
backup data to themselves, limiting the network transfer to the master
server to just the file record information.
Estimating Available Bandwidth
The planner needs to use the network map to estimate available bandwidth
without accessing the actual network. For each link, the planner needs
to locate the route from data source to the nearest backup server. The
bandwidth of that route is the bandwidth of the slowest link in that route.
To estimate the rates of individual links, the planner should use the
realistic rates listed in Table 2. If multiple streams will traverse a
link simultaneously, it is best for the planner to assume that all streams
will equally share the link. As the number of active hosts increases,
all network technologies degrade somewhat, but Ethernet variants, in particular,
degrade rather quickly. For links using variants of the Ethernet protocol,
the planner must try to keep the number of simultaneous streams below
twenty.
Another approach the planner can use is to measure the available bandwidth
across key points of the network. There is no fail-proof way of doing
this, because most bandwidth measurement tools are invasive, and may or
may not mimic the type of load applied by the application in question.
Perhaps the easiest way for the planner to measure the available bandwidth
is to use the ftp utility. To do this, the planner can use the following
procedure:
Create a large file on the client system. A tar file containing some relevant
files is ideal, although you can just concatenate a number of smaller
files.
From the client, connect to the server system and turn on binary transfer
mode.
Put the large test file onto the server as /dev/null. This will transfer
the bits for the file over the network, but not store them on the other
side.
Use the transfer rate estimated by ftp as the bandwidth of that particular
route.
The above method is simple and uses commonly available tools. However,
the interactive component makes it difficult to script for testing portions
of a large network, and the high overhead of the ftp protocol is likely
to underestimate the bandwidth available to the backup software. To estimate
the bandwidth, a more accurate and flexible method is to use network benchmarks
like NetPerf 2 . The NetPerf tools tend to be relatively simple to use,
and come with directions and sample scripts, but there is a slight learning
curve involved. If this is going to be a once-only exercise, the ftp method
may be preferable.
Once the bandwidth across key routes is estimated, the planner needs
to compute the time necessary to transfer the data from the source to
the destination:
One thing the planner needs to remember is that the units used to describe
network and storage bandwidth tend to be different. Network bandwidth
is usually listed as Mb/sec or Mbps, and refers to 1000 x 1000 bits per
second. In contrast, storage bandwidth is usually listed as MB/sec or
MB/s, and refers to 1024 x 1024 bytes per second. For example, storage
bandwidth of 1 MB/s is equivalent to network bandwidth of 8.39 Mbps.
Disk Sizing
Simpler bandwidth estimation for disks makes capacity planning generally
less troublesome than for networks, but modern tape subsystems can be
sensitive to slow disks. In general, the planner does not need to be concerned
with simple configurations like single spindles and striped JBOD arrays.
Disks should be configured and tuned for their primary purpose first,
and adjusted for backup second. If this planning exercise is an opportunity
to put together a new system from scratch, it might make sense for the
planner to plan for backup as well for as some other primary activity.
Principles
When planning disks for backups, a few simple principles apply. First,
reads tend to be faster than writes. This has to do with factors like
data integrity and prefetching. This is mostly a challenge during large-scale
restores, because the majority of accesses will be reads.
When the disks are combined into logical volumes, RAID performance principles
also come into play. Stripes and mirrors tend to aggregate the performance
of their component disks. It is often sufficient to simply check that
an adequate number of spindles are configured. RAID-5 volumes are more
complex, but tend to have good read performance. If large volumes of data
need to be restored quickly, to match RAID segment size to the restore
I/O size for optimal performance, RAID-5 volumes need to be carefully
configured. To make sure that the configuration is satisfactory, the planner
needs to test the restore performance for the RAID-5 volumes. Otherwise,
when trying to restore in an emergency, they could be unpleasantly surprised.
If long-term performance of the backup appears to be problematically slowed
by the disk subsystem, the planner can consider upgrading or reconfiguring
the storage.
Estimating Available Bandwidth
A number of methods are available to estimate the bandwidth of the disk
subsystem. The first is to analyze the existing or projected configuration,
using established values for each component. For each expected backup
stream, the planner must calculate the number of spindles (individual
disks) available to service that stream. If the spindles are striped together,
they need to consider the available bandwidth to be 70% of the aggregate
disk bandwidths. If they are stripped and mirrored, they need to use 70%
for reads and 35% for writes. If the disks are configured together as
a RAID-5 volume, the planner can estimate the read bandwidth as:
where:
N is the number of data disks,
M the number of parity disks,
N plus M adds up to the total number of spindles in the volume.
For write performance, the planner should use half that value as the
estimate. This gives a rough estimate of the raw performance available
directly from the disks, assuming no bus or channel bandwidth limitations
have been reached. The planner can use Table 3 to estimate the spindle
rates and the overall abilities of the storage arrays in question.
Estimated Rates for Various Disk Technologies Disk Technology
Peak Read Throughput
Peak Write Throughput
4GB 5400rpm disk
5.6 MB/sec
2.8 MB/sec
4GB 7200rpm disk
9.3 MB/sec
4.2 MB/sec
9GB 7200rpm disk
8.7 MB/sec
4.1 MB/sec
9GB 10000rpm disk
11-16 MB/sec
18GB 7200rpm disk
14-21 MB/sec
SSA
18 MB/sec
16 MB/sec
A1000
30 MB/sec
14 MB/sec
A3000
35 MB/sec
20 MB/sec
A5000
168 MB/sec
76 MB/sec
DASD (3390)
3.5-4.2 MB/sec
PC Clients
2-8 MB/sec
For any logical volume configuration, if filesystems are used with Direct
I/O on top of the logical volumes, the planner needs to reduce the value
by an addition 10% for reads and 15% for writes. If for some reason Direct
I/O cannot be used, the planner can divide the calculated raw value by
2 for reads, and 3 for writes.
Once the available disk bandwidth is calculated for all logical volumes,
the planner should consider how the volumes are laid out on top of the
multiple busses and I/O channels (e.g., SCSI, FC-AL, SBus). If the aggregate
volume bandwidth exceeds the bus bandwidth, they can assume all logical
volumes can share the bus equally, and divide the available bus bandwidth
among the competing volumes.
Another method the planner can use for estimating available bandwidth
is to measure it using simple tools. The most easy tool to come by is
dd(1), which can easily generate a sequential stream of accesses to either
a file or a raw device. To test potential backup performance, the planner
can create a large file on the source disk subsystem. On the host, they
can time a dd process reading from the large file and writing to /dev/null
using blocksizes similar to the backup software (a 64 KB block size is
a good guess). The planner can divide the file size by the time it took
to transfer read all the contents, obtaining an approximation for the
disk bandwidth. If the raw device bandwidth is required, they can read
a certain number of blocks from the raw device, and compute the rate based
on the number of blocks transferred rather than the file size.
If restore performance is the goal, they can write to a file from /dev/zero
for filesystem performance. Writing to a raw device only works if there
is no valid data on that device. (The planner must take caution when writing
to a raw device using dd, because this will likely destroy any data on
that device.) These estimates are likely higher than the likely performance
during backup, so they can use perhaps 80% of the measured value in planning.
A more accurate method would be to use the actual programs used by the
backup software and direct their output to /dev/null. This measures the
exact data access load on the disk in isolation from other potential bottlenecks,
such as networks and tape drives. The exact invocation varies from package
to package and filesystem versus raw device. The software CLI documentation
should provide the necessary details to conduct this test, although this
method is most useful when troubleshooting or tuning an existing installation.
This is because it requires the software and data to already be in place.
Lastly, it is generally not a good method to measure backup performance
using standard the Solstice Backup utilities like tar or ufsdump. These
programs are not especially tuned for high performance, and may bottleneck
somewhere other than the disk subsystem.
Tape Sizing
Tape sizing divides an equal concern between adequate on-line capacity
and available bandwidth. Fortunately, both calculations tend to be simple.
The only complication tends to be potential back-hitching for linear tape
devices. This can be addressed by a combination of using fewer devices
and higher levels of multiplexing to the tapes.
Principles
Without other specific requirements, the planner can configure the on-line
capacity as approximately three to five times existing data and expected
near term growth. This allows for multiple copies of the data to reside
on-line, as is necessary when using full and multiple incremental backup
schedules. Tape bandwidth should be configured to match or be slightly
below the bandwidth of networks and disk. This tends to be easy to accommodate,
and reduces the chance of back-hitching.
When trying to back up an existing enterprise with multiple slow clients,
networks, or logical volumes, the planner should configure multiplexing
in the backup schedules. This allows each tape device to be fully utilized.
Each multiplexed stream uses a finite amount of resources (e.g., TCP ports,
buffers, CPU) on the server, so the total number of backup streams handled
by a server simultaneously should be kept below approximately 120 3 .
Estimating Available Bandwidth
The planner can use the advertised native rates to estimate bandwidth
available to tape devices. Table 4 lists capacities and rates for common
tape devices. Most devices include some level of hardware compression.
If compression is going to be used, the planner should take this into
account. As a guideline to plan for compression, the planner should use
a 1.4:1 compression ratio, because the 2:1 advertised ratio tends to be
overly optimistic.
It is also important for the planner to consider the SCSI bus bandwidth.
Generally, they should plan on a maximum of two or three tape devices
per SCSI bus, and should not mix tape and disk devices on the same bus.
Empirical tests of tape bandwidth can be easily accomplished using the
dd(1) and mt(1) commands, although access to library robotics from the
system requires additional software to drive the robot.
Estimated Rates and Capacities for Various Tape Technologies Device
No. Drives
Capacity
Throughput
1:1
GB
1.4:1
GB
1:1 MB/sec
1:1 GB/hr
1.4:1 MB/sec
1.4:1 GB/hr
DDS-3
1
12
16.8
1
3.5
1.4
4.9
DDS-3 Autoloader
1
72
100.8
1
3.5
1.4
4.9
EXB-8900
1
20
28
3
10.5
4.2
14.8
DLT tape 7000
1
35
49
5
17.6
7
24.6
L280
1
280
392
5
17.6
7
24.6
L400
2
400
560
6
21.1
8.4
29.5
L1000
4
1000
1400
20
70.3
28
98.4
L1800
4
1800
2520
20
70.3
28
98.4
L3500
7
3500
4900
35
123
49
172.3
L11000
16
11000
15400
80
281.3
112
393.8
IBM 3490
1
0.2
N/A
3
10.55
N/A
N/A
IBM 3590E
1
0.4
0.56
6
21.1
8.4
29.5
STK Redwood
1
50
71.5
10.5
36.9
14.7
51.7
Memory Sizing
General server memory sizing guidelines should be adequate except in cases
of large-scale local backup of many filesystems in parallel. In those
cases, all attempts should be made to use Direct I/O to eliminate buffering
backup data in the virtual memory cache. If Direct I/O cannot be used,
the kernel memory reclamation rates should be adjusted to be more aggressive
and free up buffers faster than they are used by the backup processes.
When adjusting other system parameters, such the number of filenames and
inodes cached in the system memory, the planner should also consider size.
Another consideration is the amount of shared memory configured for inter-process
communication between backup processes.
System Sizing
Backup performance may be impacted by various aspects of system sizing
configuration. Sun's Ultra Enterprise architecture can perform very well
for backup and restore, as demonstrated in the terabyte-per-hour benchmark,
and a number of other studies. When planning to configure an existing
server for local backup, most choices are already made. The remaining
decisions consist of adding any additional I/O boards to accommodate the
tape hardware, memory for larger buffers, and CPU's if the system is already
close to full utilization. The number of additional I/O boards needed
depends on the number of devices that need to be configured. This follows
directly from capacity of the boards and host bus adapters.
To simplify the configuration of CPU capacity, the planner can estimate
how many CPU cycles are needed to move data at a certain rate. Simple
experiments have shown that a useful, conservative estimate is 5 MHz of
UltraSPARC CPU capacity per 1 MB/second of data that needs to be moved.
This means that for every MB/second of data moved (whether over the network,
from disk, or to tape), the system should have 5 MHz of processing power
available for the transfer. For example, a system that needed to back
up a number of clients over the network to local tape at a rate of 10
MB/second would need 100 MHz of available CPU power. This included 50
MHz to move data from the network to the server, and another 50 MHz to
move data from the server to the tapes. This would keep a 300 MHz UltraSPARC
processor at 33% utilization. As another example, a system that needed
to back up a database residing on local disks to local tape device at
a rate of 35 MB/second would need 350 MHz of available CPU power. The
actual software overhead is small, and is included in the 5 MHz per MB/second
number.
Conclusion
Backup and recovery are essential processes because of the large volumes
of data retained in today's Datacenters. Thus the role of the capacity
planner is critical for designing the optimal backup architecture for
the Datacenter and system requirements.
Capacity planning is not a straight-forward procedure; it requires knowing
how to efficiently use the network infrastructure, and understanding network
performance and bandwidth issues. The planner needs to configure the network
for optimal backup and recovery performance. And because networks are
traditional bottlenecks for backup applications, the capacity planner
often needs to choose the preferred bottleneck.
While a complex process, the planner can follow a series guidelines and
use a number of available tools and methods to obtain information necessary
for making good decisions. The planner first needs to assess the environment
to be backed up. This includes obtaining the following information: (1)
the data type, (2) the file structure, (3) the data origin, (4) the data
destination, and (5) the data's path. The planner also needs to know whether
data and backup servers are distributed on the network, and if so, how
they are distributed.
Knowing the backup requirements of the enterprise is also essential.
This includes determining the time period available for backups, the needs
for restoring the data, and ways to limit the impact of the process on
day-to-day operations.
Finally, the planner needs to maintain realistic expectations. This means
accounting for data overhead caused by additional metadata, understanding
the ease of use in the backup process and assessing training requirements,
and understanding the recovery performance.
Glossary
ATM
Asynchronous transfer mode. A standard for switching and routing all types
of digital information, including video, voice, and data. With ATM, digital
information is broken up into standard-sized packets, each with the "address"
of its final destination.
atomicity
Refers to an operation that is never interrupted or left in an incomplete
state under any circumstance.
backup
A copy on a diskette, tape, or disk of some or all of the files from a
hard disk. There are two types of backups: a full backup and an incremental
backup. Synonymous with "dump."
bus
(1) A circuit over which data or power is transmitted, one that often
acts as a common connection among a number of locations. (2) A set of
parallel communication lines that connect the major components of a computer
system, including CPU, memory, and device controllers.
cache
A buffer of high-speed memory filled at medium speed from main memory,
often with instructions. A cache increases effective memory transfer rates
and processor speed.
data base management system (DBMS)
A software system facilitating the creation and maintenance of a data
base and the execution of programs using the data base.
Ethernet
A type of local area network that enables real-time communication between
machines connected directly together through cables. Ethernet was developed
by Xerox in 1976, originally for linking minicomputers at the Palo Alto
Research Center. A widely implemented network from which the IEEE 802.3
standard for contention networks was developed, Ethernet uses a bus topology
(configuration) and relies on the form of access known as CSMA/CD to regulate
traffic on the main communication line. Network nodes are connected |