News Column

Researchers Submit Patent Application, "Dynamic Caching Technique for Adaptively Controlling Data Block Copies in a Distributed Data Processing...

June 24, 2014



Researchers Submit Patent Application, "Dynamic Caching Technique for Adaptively Controlling Data Block Copies in a Distributed Data Processing System", for Approval

By a News Reporter-Staff News Editor at Information Technology Newsweekly -- From Washington, D.C., VerticalNews journalists report that a patent application by the inventors Subbiah, Sethuraman (Santa Clara, CA); Soundararajan, Gokul (Sunnyvale, CA); Shastri, Tanya (Sunnyvale, CA); Bairavasundaram, Lakshmi Narayanan (San Jose, CA), filed on November 30, 2012, was made available online on June 12, 2014.

The patent's assignee is Netapp, Inc.

News editors obtained the following quote from the background information supplied by the inventors: "The present disclosure relates to data processing systems and, more specifically, to caching of data in a distributed data processing system.

"In many current analytics frameworks, distributed data processing systems may be used to process and analyze large datasets, such as files. An example of such a framework is Hadoop, which provides data storage services using a distributed file system and data processing services though a cluster of commodity servers. The Hadoop based distributed system partitions the datasets into blocks of data for distribution and storage among local storage devices coupled to the servers to enable processing of the data by the servers in accordance with one or more data analytics processes. MapReduce is an example of a computational model or paradigm employed by Apache Hadoop to perform distributed data analytics processes on large datasets using the servers.

"Broadly stated, a MapReduce process is organized into a Map step and a Reduce step. In the Map step, an analytics request or 'job' is apportioned into a plurality of sub-jobs or 'tasks' that are distributed to the servers. Each server performs its tasks independently on its stored data blocks and produces intermediate results. The servers then execute the Reduce step to combine all of the intermediate results into an overall result. Apache Hadoop is a specific example of a software framework designed for performing distributed data analytics on large datasets.

"When deployed in an enterprise environment, however, such distributed systems typically suffer from problems including reliance on a single storage tier (i.e., the local storage device tier) for both performance and reliability, as well as lack of data management features. To address these problems, the system may be enhanced through the addition of a storage system and a caching layer distributed among the servers that increases the number of storage tiers, e.g., a shared storage tier and a distributed cache tier. Yet, the enhanced distributed system may be subjected to congestion conditions, such as local and remote cache bottlenecks at the servers, data popularity at the servers, and shared storage bottleneck at the storage system, that may adversely affect throughput and performance.

"According to the distributed data analytics process, a block of data may reside on a local storage device of a server, as well as on the shared storage system. Different tasks pertaining to multiple jobs that require that block of data may be scheduled on the server. If all the tasks requests the data block, the local storage device may become a local bottleneck, which adversely impacts throughput of the device and server. Each server may also be assigned a limited number of 'slots' or tasks that may be run in parallel. If the slots are occupied by existing tasks, new tasks may be scheduled in a different server, resulting in traffic forwarded from remote servers and creating a remote bottleneck at the different server.

"In addition, a failure may occur to a server of the cluster, requiring that the server's block of data be accessed from the shared storage system, e.g., during reconstruction. If multiple servers of the cluster experience failures, there may be an increase in traffic to the shared storage system to access multiple blocks. The resulting increase in traffic may effectively reduce the size of the cluster supported by the shared storage system and create a shared storage bottleneck. Moreover, there may be one or more blocks residing on the local storage device of a server that are popular in the sense that multiple requests from other servers are directed to those blocks. The increased traffic at the server due to popularity of these data blocks may degrade performance of the server and its local storage device."

As a supplement to the background information on this patent application, VerticalNews correspondents also obtained the inventors' summary information for this patent application: "Embodiments described herein provide a dynamic caching technique that adaptively controls a number of copies of data blocks stored within caches ('cached copies') of a caching layer distributed among servers of a distributed data processing system. A cache coordinator of the distributed system illustratively implements the dynamic caching technique to increase (i.e., replicate) the number of cached copies of the data blocks to thereby alleviate congestion in the system and improve processing performance of the servers. Alternatively, the technique may decrease (i.e., consolidate) the number of cached copies to reduce storage capacity and improve storage efficiency of the servers. In particular, the technique may increase the number of cached copies when it detects local and/or remote cache bottleneck conditions at the servers, a data popularity condition at the servers, or a shared storage bottleneck condition at the storage system. Otherwise, the technique may decrease the number of cached copies at the servers.

"In one or more embodiments, the cache coordinator may cooperate with a statistics manager of the distributed system to maintain statistics pertaining to the data blocks stored on the servers of the distributed system in order to render decisions regarding adaptive cache replication/consolidation. The cache coordinator may then utilize the statistics to implement the dynamic caching technique to adaptively control the number of cached copies of a data block in the distributed system. To that end, the technique may include a replication phase and a consolidation phase. The replication phase is directed to identifying one or more servers, as well as one or more data blocks, that contribute to congestion in the system. Illustratively, the server (i.e., a source server) is designated as congested when the number of data block requests assigned to the server exceeds the total number of data block requests that can be processed, in parallel, by the server. In that case, the technique identifies and selects another server (i.e., a target server) that is not congested and that can accommodate replication of the data block, as well as data block requests directed to that data block from the congested server. The data block is then replicated (copied) to the target server and the data block requests are redirected to the copied data block. In contrast, the consolidation phase is directed to identifying copies of a data block that exceed a minimum number of replicas and then consolidating the copies of the data block in the system. Illustratively, consolidation is achieved by removing a copy of the data block from a source server and redirecting data block requests directed to the removed block at the source server to a target server that stores the data block and that can accommodate the redirected requests.

"Advantageously, the dynamic caching technique adaptively controls the cached copies of data blocks stored within caches of the caching layer to optimize distributed analytics running on the shared storage infrastructure of the distributed system. That is, the dynamic caching technique may increase or decrease the number of cached copies of data blocks to allow users greater flexibility and address problems that customers may encounter in an enterprise environment, such as bottlenecks, failures, and system reconfigurations. The dynamic technique also allows users to balance between performance and storage efficiency.

BRIEF DESCRIPTION OF THE DRAWINGS

"is The above and further advantages of the embodiments herein may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identically or functionally similar elements, of which:

"FIG. 1 is a block diagram of a distributed data processing system;

"FIG. 2 is a block diagram of a storage system of the distributed data processing system;

"FIG. 3 is a block diagram of a server of the distributed data processing system;

"FIG. 4 is is a block diagram of a statistics manager of the distributed data processing system;

"FIG. 5 is a flowchart illustrating a replication phase of a dynamic caching technique;

"FIG. 6 is a flowchart illustrating a find_target_server routine of the dynamic caching technique;

"FIG. 7 is a block diagram of an example distributed data processing system illustrating a local cache bottleneck condition;

"FIG. 8 is a block diagram of an example distributed data processing system illustrating reduction of the local cache bottleneck condition in accordance with the dynamic caching technique;

"FIG. 9 is a block diagram of an example distributed data processing system illustrating a remote cache bottleneck condition;

"FIG. 10 is a block diagram of an example distributed data processing system illustrating reduction of the remote cache bottleneck condition in accordance with the dynamic caching technique;

"FIG. 11 is a block diagram of an example distributed data processing system illustrating a bottleneck caused by a data popularity condition;

"FIG. 12 is a block diagram of an example distributed data processing system illustrating reduction of the bottleneck caused by the data popularity condition in accordance with the dynamic caching technique

"FIG. 13 is a flowchart illustrating a consolidation phase of the dynamic caching technique;

"FIG. 14 is a flowchart illustrating a consolidate_block routine of the dynamic caching technique;

"FIG. 15 is a block diagram of an example distributed data processing system prior to implementation of the consolidation phase of the dynamic caching technique; and

"FIG. 16 is a block diagram of an example distributed data processing system after implementation of the consolidation phase of the dynamic caching technique."

For additional information on this patent application, see: Subbiah, Sethuraman; Soundararajan, Gokul; Shastri, Tanya; Bairavasundaram, Lakshmi Narayanan. Dynamic Caching Technique for Adaptively Controlling Data Block Copies in a Distributed Data Processing System. Filed November 30, 2012 and posted June 12, 2014. Patent URL: http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=695&p=14&f=G&l=50&d=PG01&S1=20140605.PD.&OS=PD/20140605&RS=PD/20140605

Keywords for this news article include: Netapp Inc, Information Technology, Information and Data Analytics, Information and Data Processing.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Information Technology Newsweekly


Story Tools






HispanicBusiness.com Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters