News Column

Researchers Submit Patent Application, "Caching of Virtual to Physical Address Translations", for Approval

June 24, 2014



By a News Reporter-Staff News Editor at Information Technology Newsweekly -- From Washington, D.C., VerticalNews journalists report that a patent application by the inventor ISLOORKAR, Nitin (Bangalore, IN), filed on December 5, 2012, was made available online on June 12, 2014.

The patent's assignee is Arm Limited.

News editors obtained the following quote from the background information supplied by the inventors: "Data processing systems may use virtual addresses to indicate storage locations while the processing system uses physical addresses that represent actual locations on the silicon to store the data in. Virtual addresses may be used to reduce the number of bits required to identify an address location or to allow several processes to access a restricted memory space.

"A complete set of current mappings for the virtual to physical addresses are stored in memory, however in order to decrease access time to these mappings, recently used mappings are stored in caches that can be accessed more quickly by the processor. There may be an L1 cache that is fast to access and acts as a micro TLB and stores a small subset of recently used mappings, and a slower L2 cache that is the macro TLB and stores a set of currently used mappings while a full set of page tables of mappings are stored in memory. The mapping of virtual to physical address space is done in blocks and these may vary in size, thus, these are for example blocks of 1 Gbyte, 2 Mbyte or 4 Kbyte. The number of bits that need to be mapped for a translation depends on the size of the block that an address is located in. If, for example, an address is in a 2 Mbyte block, then only the higher n to 21 bits need to be found from the translation tables, while if it is in the 4 Kbyte block then the n to 12 bits need to be found.

"The mappings and the final page sizes are therefore stored in tables in the memory, with the first n to 31 bits that represent 1 Gbyte blocks, being stored in a first table, the next 30 to 21 bits being stored in a next table and so on. A page table walk in memory that is used to retrieve the mappings is performed in steps or walks and where the address is in a larger block only the first step(s) needs to be performed.

"When a mapping has been found following a page table walk it is stored in the L2 macro TLB and in the L1 micro TLB.

"As can be appreciated retrieving these mappings from memory is expensive in both time and power.

"In order to increase efficiency in some data processing systems, the RAM used as the level 2 cache for the macro TLB is also used as a walk cache and as a prefetch buffer, such that in addition to retrieving the requested mapping, a mapping for the subsequent virtual address is retrieved during the same page table walk. This can then be written into the prefetch buffer, after the originally requested mapping has been written into the macro TLB.

"The walk cache will store intermediate translations following some of the page table walks, the stored values being an input to a final page table walk required to finish the translation. to In some systems there may be double virtualisation. This occurs where there are several guest OSs and each references a large amount of memory. The memory of the data processing system is not sufficiently large from them to all to reference different distinct portions of memory. Instead they share some memory area but this is not visible to them and is managed by the hypervisor with each guest OS using intermediate physical addresses IPA, that the hypervisor then maps to real physical addresses. In this way the hypervisor controls the memory use and allows each guest OS to believe it has sight of a large dedicated memory space. In this case each guest OS manages the page tables for the VA to IPA translations and the hypervisor manages tables of translations from IPA to PA. In these cases an IPA2PA cache may be provided in the level 2 RAM storing intermediate steps in the IPA to PA translations."

As a supplement to the background information on this patent application, VerticalNews correspondents also obtained the inventor's summary information for this patent application: "A first aspect provides a data processing apparatus comprising: at least one initiator device for issuing transactions, a hierarchical memory system comprising a plurality of caches and a memory and memory access control circuitry, said initiator device identifying storage locations using virtual addresses and said memory system storing data using physical addresses, said memory access control circuitry being configured to control virtual address to physical address translations; wherein said plurality of caches, comprise a first cache and a second cache; said first cache being configured to store a plurality of address translations of virtual to physical addresses that said initiator device has requested; and said second cache being configured to store a plurality of address translations of virtual to physical addresses that it is predicted that said initiator device will subsequently request; wherein said first and second cache are arranged in parallel with each other such that said first and second caches can be accessed during a same access cycle.

"The technology described herein recognises that the retrieval of address translations from memory may be a time consuming process. It also recognises that many memory accesses are performed in an ordered manner such that when one translation is required the next translation that is likely to be required may be to some to extent predictable. Thus, it may be that for much of the time translations are retrieved consecutively, or every other translation is retrieved, or some other general rule may be followed which allows the next translation required to be predicted with some degree of accuracy. Thus, fetching two translations together and storing them in parallel in two caches enables their accesses to be performed efficiently. Although the use of two parallel caches has a slight area overhead, the performance improvement is considerable.

"In this way by recognising that it would be advantageous to store translations in caches in parallel, the time to store the translations and access them is improved.

"It should be noted that an initiator device is any device that issues transactions thus, it may be a master such as a processor, a GPU, a DMA.

"In some embodiments, said memory access control circuitry is configured in response to receipt of a request for an address translation from said initiator device, to look for said translation in said first cache and said second cache and in response to said requested translation not being present, to retrieve said translation from a lower hierarchical data store and to retrieve a translation for a speculative subsequently required virtual address while accessing said lower hierarchical data store and to update said first cache with said retrieved translation and said second cache with said retrieved speculative subsequently required translation at a same time.

"The present technique recognises that many memory accesses are performed in an ordered manner such that when one translation is required a the one that is likely to be required next can be predicted, and it therefore fetches a requested translation and a predicted next translation in a same access. In this way the time to retrieve the two is reduced and as they are fetched together they are ready to be stored at a same time. If these two fetched translations were to be stored in a single cache this would require two sequential accesses to the cache. However, by providing two caches that are arranged in parallel the access can be made in parallel in some cases in a single access cycle thereby improving performance.

"In some embodiments, said first cache comprises a macro table lookaside buffer and said second cache comprises a prefetch buffer.

"The first cache may be a macro table look aside buffer and the second cache a prefetch buffer storing translations for transactions that it has been predicted the master or initiator will issue.

"In some embodiments, said first cache and said second cache comprise level two caches, said data processing apparatus comprising a level one cache comprising a to micro table lookaside buffer, said memory being configured to store address lookup tables comprising virtual to physical address translations.

"The hierarchy of the system may be such that the macro TLB is a level 2 cache and there is a level 1 micro TLB storing a subset of the translations and a memory that stores the page tables for the complete set of virtual to physical address translations.

"In some embodiments, said memory access control circuitry is configured in response to detecting said requested translation in said second cache to transfer said translation to a same line of said first cache.

"In some embodiments if the memory access control circuitry detects the currently requested transaction is in the second cache then it will transfer it to the first cache and in some cases to the same line in the first cache. This means that where for example a predicted translation has been stored in the prefetch buffer it is a simple matter to retrieve and then remove it from the second cache and store it in the first cache, having the caches arranged in parallel facilitates this procedure. Although the translation may not be stored in the same line in the first cache, it may be in cases where the two caches are of a same size and the translations are indexed by virtual address, such that the virtual address determines the storage location. In many embodiments the two caches may not have the same size and the storage of translations may be indexed in a different way.

"In some embodiments, said data processing apparatus comprises a data processing apparatus configured to operate a double virtualisation system that uses a set of virtual addresses and a set of intermediate physical addresses, wherein an operating system operating on said data processing apparatus uses said set of virtual addresses and said intermediate set of physical addresses, and a hypervisor manages the memory space by translating said set of intermediate physical addresses to physical addresses, said data processing apparatus further comprising: a third cache and a fourth cache; said third cache being configured to store a plurality of partial virtual to physical address translations, said partial translations comprising translations for higher bits of said addresses, said partial translations forming an input to a lookup to memory required for completing said translations; and said fourth cache being configured to store a plurality of partial translations of said intermediate physical addresses to corresponding physical addresses, said plurality of partial translations corresponding to results from lookup steps performed in lookup tables in said memory during said translation; wherein said memory access control circuitry is configured to store in said fourth cache a final step in said intermediate physical address to physical address translation at a same time as storing said partial translation in said third cache.

"Many modem systems use double virtualisation so that they can support plural guest operating systems that may each access large sections of memory. A hypervisor will manage this by providing intermediate physical addresses that the guest operating systems will regard as the real physical addresses. Such a system requires further address translations and where two caches have been provided in parallel for physical address to virtual address translations further caches can also be arranged in parallel to store translations relating to the conversion of intermediate physical addresses to physical addresses. It may be advantageous to store partial translations as retrieving translations from memory is a lengthy process requiring several steps or walks. The first few steps may be common to translations within a block of memory and thus may be used by several different translations. Thus, storing these partial translations will reduce the steps required for these later translations. Similarly, with the intermediate physical address to physical address translations these may be used frequently as a particular intermediate physical address space may be used by the hypervisor to service several guest OS's. Thus, the intermediate steps in the translation may be required several times within a relatively short timeframe and therefore caching these can improve performance.

"It should be noted that the third cache may be present without the fourth cache in cases where there is no double virtualisation.

"The present technique also recognises that where a nested translation is occurring then one of the translations of the intermediate physical address to a corresponding physical address step will be performed at the end of the partial virtual to physical address translation and thus, the result for these two will be retrieved from the page tables at the same time and thus, if the two caches storing this information are arranged in parallel the results can be written in the same access step, once again saving time.

"In some embodiments, said third cache comprises a walk cache and said fourth cache comprises an IPA2PA translation lookaside buffer.

"A walk cache is a cache that stores a pointer to the final step in the page table walk for a virtual to physical address translation while an IPA2PA translation lookaside buffer stores the intermediate physical to physical address partial translation steps.

"In some embodiments, said first and second caches each comprise four set associative RAMs.

"Although the caches may be formed in a number of ways, a four way set associative cache is a way of storing the data in a way that provides a reasonable hit rate without too high hardware overheads.

"In some embodiments, said first, second, third and fourth caches are formed from RAMs, said first and second caches being on different RAMs and said third and fourth caches being on different RAMs.

"If two RAMs are provided to form the first and second caches then it is convenient if the third and fourth caches are placed in the same parallel RAMs. In this way two parallel RAMs will provide four caches. If the caches are formed of 4 way set associative caches then each way is formed from a RAM so that each cache is formed of four set associative RAMs and the first and second caches are in different sets and similarly the third and fourth caches are in different sets, the first and either third or fourth being in the same set, while the second and either fourth or third are in the same set. It has been found that the first and second caches are convenient if arranged in parallel as there may be accesses that to these two that are requested at the same time and therefore performing them is efficient. Furthermore, the third and fourth caches have accesses that may be made at the same time and therefore arranging them in parallel also provides improved performance. Providing four RAMs in parallel would require additional area and would not provide an advantage as in general the third and fourth caches are not updated at the same time as the first and second are.

"In some embodiments, said memory access control circuitry is configured to detect a request to invalidate a set of translations for a context and in response to detection of said invalidate request to perform said invalidation and while said invalidation is pending: to determine whether any update or lookup requests for said context are received; and in response to detecting a lookup request, said memory access control circuitry is configured to signal a miss in said plurality of caches without performing said lookup; and in response to detecting an update request said memory access control circuitry is configured to transmit an accept reply to said update request and not to perform said update.

"The present technique recognises that in hierarchical memory systems that use address translations there may be times when the processor or other master switches context and the software controlling the master or the hypervisor sends a request to invalidate a set of translations from a context where the context has completed for example. It will then send this invalidate response to the control circuitry that controls the coherency of these hierarchical memories. There may be pending update requests for some of the translations that are being invalidated and therefore it is advantageous if the memory access control circuitry can in response to detecting an invalidation request, determine whether any update or lookup request for that context has been received and where the lookup request has been received to simply signal a miss without performing the lookup and where an update request has been received to transmit an accept reply to the update request and not perform the update. Performing the lookup and the updates would simply cost power and take time and as these entries will be invalidated there is no need to perform them. However, it is important that nothing is stalled waiting for them to complete and therefore it is convenient if accept replies and miss signals are sent.

"A second aspect provides a data processing apparatus comprising: a initiator device for issuing a stream of transaction requests, a hierarchical memory system comprising a plurality of caches and a memory and memory access control circuitry, wherein said initiator device identifies storage locations using virtual addresses and said memory system stores data using physical addresses, said memory access control circuitry being configured to control virtual address to physical address translations; wherein said memory access control circuitry is configured to detect a request to invalidate a set of virtual to physical address translations for a context and in response to detection of said invalidate request to perform said invalidation and while said invalidation is pending: to determine whether any update or lookup requests for said context are received; and in response to detecting a lookup request said memory access control circuitry is configured to signal a miss in said plurality of caches without performing said lookup; and in response to detecting an update request said memory access control circuitry is configured to transmit an accept update request reply and not perform said update.

"As noted previously where address translations for a context are to be invalidated then it is convenient if the memory access control circuitry can control this process and detect any update or lookup requests and not process them. Thus, update requests can be accepted and then simply dropped while lookup requests will signal a miss in the caches but will not actually perform the lookup. As noted previously this will save power and avoid performing any unnecessary accesses to the cache and as responses have been sent there will be no upstream transactions waiting for them to complete. It may, however, in some cases decrease the hit rate. This occurs, where for example, the page table to be invalidated is a smaller sized page table say 4K. In such a case the larger tables need to be accessed first before knowing that the page size is 4k. The invalidate may not affect the larger sized table, but the lookup to this table is still sent back as a miss.

"A third aspect provides a method of storing virtual to physical address translations in a plurality of caches comprising the steps of: receiving a request for a virtual to physical address translation from an initiator device; in response to determining that said address translation is not stored in one of said plurality of caches; retrieving said translation from a lower hierarchical data store and also retrieving a translation for a predicted subsequent virtual address; updating a first cache with said retrieved translation; and updating a second cache with said retrieved subsequent translation in a same access cycle.

"A fourth aspect provides a data processing apparatus comprising: at least one initiator means for issuing transactions, a hierarchical memory system comprising a plurality of caches and a memory and memory access control means, said initiator means identifying storage locations using virtual addresses and said memory system storing data using physical addresses, said memory access control means being for controlling virtual address to physical address translations; wherein said plurality of caches, comprise a first caching means and a second caching means; said first caching means being for storing a plurality of address translations of to virtual to physical addresses that said initiator means has requested; and said second caching means being for storing a plurality of address translations of virtual to physical addresses that it is predicted that said initiator device will subsequently request; wherein said first and second caching means are arranged in parallel with each other such that said first and second caching means can be accessed during a same access cycle.

"The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

"FIG. 1 shows a data processing apparatus according to the present technique;

"FIG. 2 shows a configuration for storing page walk tables;

"FIG. 3 shows caches used to store information relating to virtual to physical address translations;

"FIGS. 4a and 4b schematically show steps in a nested address translation;

"FIG. 5 shows four way set associative caches;

"FIG. 6 shows a further example of a processing system according to the present technique;

"FIG. 7 shows a flow diagram illustrating steps in a method of performing a VA to PA translation according to the present technique; and

"FIG. 8 shows a flow diagram illustrating steps in a method of performing an IPA to PA translation according to the present technique."

For additional information on this patent application, see: ISLOORKAR, Nitin. Caching of Virtual to Physical Address Translations. Filed December 5, 2012 and posted June 12, 2014. Patent URL: http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=542&p=11&f=G&l=50&d=PG01&S1=20140605.PD.&OS=PD/20140605&RS=PD/20140605

Keywords for this news article include: Arm Limited, Information Technology, Information and Data Processing.

Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC


For more stories covering the world of technology, please see HispanicBusiness' Tech Channel



Source: Information Technology Newsweekly