Data Deduplication

Data deduplication is a feature of Disk to Disk backup. According to the definition by Storage Networking Industry Association (SNIA), data deduplication consists in the replacement of multiple copies of data with references to a shared copy in order to save storage space and bandwidth. It is accomplished by examining a data-set or I/O stream at the sub-file level, storing and sending only unique data.

In other words, data deduplication is eliminating redundant data from your data storage at the block (sub-file) level. As an example, files having different names might still contain duplicate data at the block level. That data is identified, and replaced by pointers.

Block level data deduplication looks at a sequence of data, and divides the data into variable length blocks. It then identifies repeating blocks, and replaces them by pointers. Pointers are much smaller in size than the original data, so data deduplication is saving a considerable amount of disk space.

As an example, let’s consider a disk staging area, used for short-term storage of disk to disk backups. The backups contain many repeating blocks, and after data deduplication, you might see 50 times reduction in the amount of disk space needed. From the disk staging area, the backups might be replicated to another location, for disaster recovery (DR) purposes. In the replication process, data deduplication identifies the blocks it already sent to the target disk, and is sending only pointers instead, thus significantly reducing bandwidth.

Depending on your company policies, you might need a disk staging area capable to keep a high number of backups for a longer period of time. These backups, of significant size, will have to be replicated.

Data deduplication, by considerably reducing the size of data being replicated, will improve network performance. Deduplication is also saving considerably on disk storage costs, as it reduces the storage size at the disk staging level and at the replication target disk level.

In essence, the benefits of data deduplication are:

  • - Cost effective use of the disk systems, allowing much more data to be stored on disk, when compared to the conventional disk systems.
  • - Much better network performance when replicating backups from disk staging to a DR site.
  • - For long-term storage, data can be pushed to tapes, either from the disk staging area, or from the replication target area. In both cases, data deduplication will considerably reduce the number of tapes needed.
  • - Tape can be also written from a deduplication data store, where data is kept usually few weeks or months, for immediate recovery. After that,  data is sent to tape for archiving purposes.

During the backup process, deduplication can be done using different methods:

  • - Target data deduplication: Done after the backup software collected the data, but before the data is written to the disk staging area. In this case, an inline appliance, installed between the target and the backup server is deduplicating the backup stream. More data is send over the network through your backup software.
  • - Source data deduplication: Done on the source host, during the backup. Less data is sent through the connection to the staging disk (less bandwidth consumed), but the method could slow down the backup itself and other applications running on that particular host.

Limitations of Data Deduplication

It should be mentioned that not all environments will benefit from data deduplication. When data changes are very frequent, chances are that there is not too much duplicate data. Encrypted data (because is mostly unique), and images (because they cannot be compressed) are not well suited for data deduplication.  And again, careful consideration is needed when deciding to do data deduplication at the source, because of the overhead which might slow down applications.

Data Deduplication devices and software:

  • - Virtual Tape Library – a disk based device emulating a Tape backup unit, and using fibre channel interfaces.
  • - Appliances with a NAS interface, using Ethernet.
  • - Software data deduplication, done by the backup software.