News Column

Patent Issued for Techniques for Managing Deduplication Based on Recently Written Extents

August 19, 2014



By a News Reporter-Staff News Editor at Information Technology Newsweekly -- According to news reporting originating from Alexandria, Virginia, by VerticalNews journalists, a patent by the inventors Chen, Xiangping (Shrewsbury, MA); Armangau, Philippe (Acton, MA), filed on June 28, 2012, was published online on August 5, 2014.

The assignee for this patent, patent number 8799601, is EMC Corporation (Hopkinton, MA).

Reporters obtained the following quote from the background information supplied by the inventors: "Block deduplication is the process of (i) finding block mappings that map to separate instances of identical data, and (ii) updating those block mappings to refer to a single instance of that data. Using block deduplication, data storage systems are able to eliminate storage of redundant copies of host data.

"One conventional approach to performing block deduplication in a data storage system involves closely evaluating each block of host data stored by the data storage system for possible deduplication. In particular, the data storage system applies a hash algorithm to each block of host data stored by the data storage system. After the data storage system computes a hash result from a particular block of host data, the data storage system compares that hash result to a database of stored hash results previously computed from other blocks of host data. If the data storage system finds a matching hash result in the database, the data storage system performs a bit-by-bit comparison to determine whether the blocks of host data are identical. If so, the data storage system shares a single instance of the block of host data among block mappings. Otherwise, the data storage system adds a new record to the database, i.e., the data storage system adds the hash result computed from the particular block of host data to the database for possible matching in the future.

"When a host modifies a block of host data that has been deduplicated, the data storage system splits that shared block of host data into separate instances. Along these lines, suppose that a data storage system maintains a first block mapping and a second block mapping to a single instance of host data. Further suppose that a host issues an IO command to modify the block of host data as referenced by the second block mapping, while the first block mapping is intended to continue to reference the original block of host data. The data storage system responds by maintaining the original instance of the host data on behalf of the first block mapping, and creating a new instance which includes the modification on behalf of the second block mapping."

In addition to obtaining background information on this patent, VerticalNews editors also obtained the inventors' summary information for this patent: "Unfortunately, there are deficiencies to the above-described conventional approach to performing block deduplication which involves methodically evaluating each block of host data stored by the data storage system for possible deduplication. For example, a host may frequently overwrite certain blocks with new data. In such a situation, deduplication of frequently overwritten blocks may result in cycles of hash result computation, instance sharing and instance splitting, i.e., inefficient use of deduplication processing.

"Additionally, as these blocks get overwritten and reconsidered for deduplication, the data storage system tends to add new records to the database of previously computed hash results. That is, the data storage system adds new records which refer to the same physical block location thus filling the database of previously computed hash results with stale records. Accordingly, the database becomes unnecessarily large in size thus consuming excess memory as well as increasing the amount of time needed to complete database searches.

"Furthermore, even if deduplication is performed in the background on the data storage system (i.e., during idle system time), consumption of resources for deduplication iterations of frequently overwritten blocks takes away resources that otherwise could be devoted to other services. For example, another background process which is configured to remove stale records from the database of previously computed hash results may be prevented from running as often.

"In contrast to the above-described conventional approach to performing block deduplication which may inefficiently deduplicate frequently overwritten blocks, an improved technique is directed to managing deduplication based on recently written extents. Such operation enables a data storage system to avoid evaluation of frequently overwritten extents and thus save processing and memory resources involved in hash computation, comprehensive block comparisons, and so on. Additionally, such operation provides for a smaller extent sharing index table used for deduplication since the technique is able to eliminate adding table entries corresponding to recently written extents. Furthermore, such operation enables quick cleanup of the extent sharing index table by simply deleting any table entries corresponding to recently written extents. Accordingly, the technique enjoys lower memory consumption by the extent sharing index table as well as quicker table searching.

"One embodiment is directed to a method of managing deduplication of extents which is performed in a data storage apparatus having processing circuitry and memory which stores the extents (e.g., blocks). The method includes constructing, by the processing circuitry, a recently written extent list which identifies recently written extents stored within the memory. The method further includes referencing the recently written extent list to bypass (or skip over) extents identified by the recently written extent list when obtaining a candidate extent for possible deduplication. The method further includes processing the candidate extent for possible deduplication. Here, by identifying frequently overwritten extents on the recently written extent list, the data storage apparatus is able to easily avoid deduplicating and then splitting frequently overwritten extents.

"In some arrangements, the data storage apparatus maintains an extent sharing index table having entries which (i) have existing hash values and (ii) identify extents. In these arrangements, processing the candidate extent for possible deduplication includes digesting the candidate extent to produce a current hash value, and searching the extent sharing index table for an existing entry having an existing hash value which matches the current hash value. Processing the candidate extent for possible deduplication further includes, when an existing entry in the extent sharing index table is found to have an existing hash value which matches the current hash value: (i) searching the recently written extent list to confirm that an existing extent, which is identified by the existing entry, is not identified by the recently written extent list, (ii) when the existing extent is not identified by the recently written extent list, performing a comprehensive compare operation to determine whether to deduplicate the candidate extent with the existing extent, and (iii) when the existing extent is identified by the recently written extent list, adding a new entry to the extent sharing index table, the new entry having the current hash value and identifying the candidate extent. Additionally, processing the candidate extent for possible deduplication further includes, when no existing entry in the extent sharing index table is found to have an existing hash value which matches the current hash value, adding a new entry to the extent sharing index table, the new entry having the current hash value and identifying the candidate extent. Here, if it is discovered that one of the extents has been recently modified, the data storage apparatus is able to avoid performing a comprehensive compare operation between the candidate extent and the existing extent (normally performed to determine whether the extents are identical) and thus preventing deduplication iterations of frequently written extents and conserving resources.

"In some arrangements, the method further includes performing a cleanup operation to reduce the size of the extent sharing index table based on the recently written extent list. In particular, the data storage apparatus deletes, from the extent sharing index table, existing entries which identify extents identified by the recently written extent list. Such operation reduces the amount of memory consumed by the extent sharing index table and improves table searching efficiency.

"Other embodiments are directed to systems, apparatus, processing circuits, computer program products, and so on. Some embodiments are directed to various methods, devices, electronic components and circuitry which are involved in managing deduplication of extents based on recently written extents."

For more information, see this patent: Chen, Xiangping; Armangau, Philippe. Techniques for Managing Deduplication Based on Recently Written Extents. U.S. Patent Number 8799601, filed June 28, 2012, and published online on August 5, 2014. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=8799601.PN.&OS=PN/8799601RS=PN/8799601

Keywords for this news article include: EMC Corporation, Information Technology, Information and Data Storage.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Information Technology Newsweekly


Story Tools






HispanicBusiness.com Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters