News Column

Patent Application Titled "Cache Swizzle with Inline Transposition" Published Online

July 1, 2014

By a News Reporter-Staff News Editor at Information Technology Newsweekly -- According to news reporting originating from Washington, D.C., by VerticalNews journalists, a patent application by the inventors Kuesel, Jamie R. (Rochester, MN); Kupferschmidt, Mark G. (Bothell, WA); Schardt, Paul E. (Rochester, MN); Shearer, Robert A. (Woodinville, WA), filed on December 12, 2012, was made available online on June 19, 2014.

The assignee for this patent application is International Business Machines Corporation.

Reporters obtained the following quote from the background information supplied by the inventors: "As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.

"One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce 'bubbles,' or cycles where no productive operations are being performed, into the pipeline.

"One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as 'vectorizing' the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.

"The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process 'vectors' of data points at the same time. Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline kept at a minimum.

"Despite the significant performance capabilities of SIMD execution units, it has been found that there is a substantial amount of processing overhead consumed in arranging data into a format that takes advantage of the multiple lane SIMD execution units. This problem is aggravated, for example, when data is stored in memory in an array of structures (AOS) format and an execution unit processes data in a structure of arrays (SOA) format. Furthermore, in many instances, one process may require the data in one format, while another will require the data to be in a different format, which often forces data to be stored in memory in one format, with a processor loading and re-ordering the data into the other format before processing the data with an SIMD execution unit.

"One conventional approach to this problem is to load all the data, and then move it around in the vector register file. This approach, however, typically wastes many instructions. Another approach is to 'swizzle,' or rearrange, the load data right before entering it into the register file. While this approach typically saves functional instructions, the approach still typically requires every load to make multiple accesses into a data cache.

"As an example, many typical workloads that rely on SIMD operations follow a simple loop where there is a vector load, followed by a SIMD floating point operation such as a multiply add, and then followed by a vector store. In many conventional processor architectures, this three instruction sequence will be processed as a four cycle load, a single cycle math operation, and a four cycle store, resulting in a loop that is very cache bandwidth heavy and that does not take full advantage of the processing capabilities of an SIMD execution unit.

"Therefore, a significant need continues to exist in the art for a manner of minimizing the performance overhead associated with arranging data in a suitable format for execution in a data processing system, particularly for execution using an SIMD execution unit."

In addition to obtaining background information on this patent application, VerticalNews editors also obtained the inventors' summary information for this patent application: "The invention addresses these and other problems associated with the prior art by providing a method and circuit arrangement that selectively swizzle data in one or more levels of cache memory coupled to a processing unit based upon one or more swizzle-related page attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A memory address translation data structure may be accessed, for example, in connection with a memory access request for data in a memory page, such that attributes associated with the memory page in the data structure may be used to control whether data is swizzled, and if so, how the data is to be formatted in association with handling the memory access request. As such, when the data is retrieved from the cache memory for processing by a processing unit, the data is formatted in a form that is optimized for efficient processing of the data by the processing unit.

"Therefore, consistent with one aspect of the invention, data is accessed in a data processing system by, in response to a memory access request initiated by a processing unit in the data processing system, accessing a memory address translation data structure to perform a memory address translation for the memory access request; accessing at least one swizzle-related page attribute in the memory address translation data structure to determine whether data from the memory page associated with the memory access request should be swizzled; and causing data from the memory page to be stored in a cache memory in a swizzled format based upon the at least one swizzle-related page attribute.

"These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.


"FIG. 1 is a block diagram of exemplary automated computing machinery including an exemplary computer useful in data processing consistent with embodiments of the present invention.

"FIG. 2 is a block diagram of an exemplary NOC implemented in the computer of FIG. 1.

"FIG. 3 is a block diagram illustrating in greater detail an exemplary implementation of a node from the NOC of FIG. 2.

"FIG. 4 is a block diagram illustrating an exemplary implementation of an IP block from the NOC of FIG. 2.

"FIG. 5 is a block diagram illustrating an example swizzle operation consistent with the invention.

"FIG. 6 is a block diagram of an exemplary data processing system incorporating memory address translation-based swizzling consistent with the invention.

"FIG. 7 is a block diagram of an exemplary ERAT entry format for the ERAT referenced in FIG. 6.

"FIG. 8 is a block diagram illustrating an exemplary memory access using a data processing system supporting memory address translation-based swizzling consistent with the invention.

"FIG. 9 is a flowchart illustrating an exemplary sequence of operations for performing a load access in the data processing system of FIG. 8.

"FIG. 10 is a flowchart illustrating an exemplary sequence of operations for performing a cast out in the data processing system of FIG. 8.

"FIG. 11 is a block diagram illustrating an exemplary data processing system including multiple levels of address translation-based swizzling consisting with the invention.

"FIG. 12 is a block diagram illustrating swizzling of packet headers using address translation-based swizzling consistent with the invention."

For more information, see this patent application: Kuesel, Jamie R.; Kupferschmidt, Mark G.; Schardt, Paul E.; Shearer, Robert A. Cache Swizzle with Inline Transposition. Filed December 12, 2012 and posted June 19, 2014. Patent URL:

Keywords for this news article include: Information Technology, Information and Data Processing, Information and Data Architecture, International Business Machines Corporation.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC

For more stories covering the world of technology, please see HispanicBusiness' Tech Channel

Source: Information Technology Newsweekly

Story Tools Facebook Linkedin Twitter RSS Feed Email Alerts & Newsletters