The assignee for this patent, patent number 8626815, is
Reporters obtained the following quote from the background information supplied by the inventors: "This invention relates to the use of programmable integrated circuit devices (e.g., field-programmable gate arrays or other programmable logic devices (PLDs)) to perform matrix multiplication operations.
"In a multiplication of two input matrices AA and BB to form resultant matrix CC, each resultant element c.sub.ij in the resultant matrix CC will be the dot product of the ith row in matrix AA and the jth column in matrix BB. For example, c.sub.57 will be the dot product of the fifth row of matrix AA and the seventh column of matrix BB. The length of a row of (i.e., the number of columns, k, in) matrix AA is equal to the height of a column of (i.e., the number of rows in) matrix BB. As is evident, the computation of each element c.sub.ij requires k multiplications. Moreover, there are i.times.j elements c.sub.ij in matrix CC, for a total of i.times.j.times.k multiplications. For large matrices (with, e.g., hundreds of elements per dimension), there may not be enough multipliers and logic resources available on a programmable integrated circuit device, such as an FPGA, to perform even the k multiplications to multiply just one row and column together for a single resultant element c.sub.ij. k-1 adders also are required to add the individual products to obtain the dot product."
In addition to obtaining background information on this patent, VerticalNews editors also obtained the inventor's summary information for this patent: "This disclosure describes a method and a structure whereby multiple small dot products can be effectively combined to generate a larger dot product, for each element of a matrix multiplication. A programmable integrated circuit device may be configured as such a structure, to carry out the method.
"Specifically, to deal with the need for a greater number of multipliers than are available just to perform one row-column dot product for one resultant element c.sub.ij, each row and each column can be broken into manageable blocks, with each block loaded in turn to compute a smaller dot product, and then the results can be added together to obtain the desired row-column dot product. The earliest results for each dot product are saved for a number of clock cycles equal to the number of portions N into which each row or column is divided. This can be done with an N-element shift register. The contents of the elements are then added, using N-1 adders, to provide an element c.sub.ij of the resultant matrix. No accumulation is required.
"However, as described in more detail below, this results in repeated loading and unloading of the same blocks at different times as different elements are computed. Moreover, one must have sufficient bandwidth to load all of the values, and memory bandwidth decreases with increasing memory size (because the ratio of edge to area decreases), so that the delays in multiple loadings and unloadings of the same blocks is magnified by the bandwidth bottleneck, increasing the number of clock cycles required to compute a single c.sub.ij calculation.
"Accordingly, pursuant to another aspect of the invention, instead of performing all parts of one c.sub.ij calculation in order and then moving on to the next c.sub.ij calculation, each block or portion of a row is loaded and all calculations that use that block or portion with a block or portion of any column--for any of the c.sub.ij--are carried out. As a result, no c.sub.ij computation can be completed until the partial calculations using the last block or portion of the row in question begin. Therefore, the partial calculations are stored in a set of cache memories.
"In one embodiment, the number of caches is equal to the number N of portions into which each row or column is divided. Each nth cache stores the respective dot products of an nth row-block of matrix AA with the respective nth column-blocks of the columns of matrix BB. Therefore, each c.sub.ij is spread across corresponding locations in the N caches--i.e., c.sub.ij is spread across the N jth locations of the N caches. However, once the Nth cache begins to fill, each c.sub.ij can be burst out as soon as the corresponding location in the Nth cache is computed. Thus, once results start to become available, a new result is burst out on each clock cycle. Specifically, each c.sub.ij for the ith row will be available on (N(N-1)+j)th clock cycle of computations for that row.
"By using one loading of a partial row of matrix AA to compute all products of that partial row and any partial column of matrix BB with which it must be multiplied, this approach increases the effective bandwidth of the memory used to store matrix AA, and reduces power consumption by reducing memory access.
"Therefore, in accordance with the present invention, there is provided a method of configuring a programmable integrated circuit device to perform multiplication of a first multiplicand matrix by a second multiplicand matrix to form a resultant matrix, where the first multiplicand matrix has a first number of rows and a second number of columns, the second multiplicand matrix has that second number of rows and a third number of columns, and the resultant matrix has a number of elements equal to a product of the first and third numbers. The method includes configuring logic of the programmable integrated circuit device as a fourth number of multipliers, where the fourth number is one-Nth of the second number. Logic of the programmable integrated circuit device is configured to break down each row of the first multiplicand matrix into N row-blocks and to break down each column of the second multiplicand matrix into N column-blocks, and to use the fourth number of multipliers to form a respective dot-product of each of the row-blocks with a respective one of the column-blocks to form N partial dot products of each respective row of the first multiplicand matrix and a corresponding column of the second multiplicand matrix. Logic of the programmable integrated circuit device is configured to save each of the N partial dot products until all of the N partial dot products have been computed. Logic of the programmable integrated circuit device is configured to add the N partial dot products to provide an element of the resultant matrix corresponding to the respective row of the first multiplicand matrix and the corresponding column of the second multiplicand matrix.
"A programmable logic device so configured, and a machine-readable data storage medium encoded with software for performing the method, are also provided."
For more information, see this patent: Langhammer, Martin. Configuring a Programmable Integrated Circuit Device to Perform Matrix Multiplication. U.S. Patent Number 8626815, filed
Keywords for this news article include:
Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC
Most Popular Stories
- Obama Administration Releases Proposal to Regulate For-Profit Colleges
- Apple, HP, Intel May Take a Hit from Slowdown in Smartphone Sales Growth
- Elizabeth Vargas' Husband Marc Cohn Addresses Rumors
- Keurig Adds Peet's coffee, Alters Starbucks deal
- U.S. to Relinquish Gov't Control Over Internet
- Motley Crue's Nikki Sixx Marries Model Courtney Bingham
- Quiznos Files for Chapter 11
- Chinese e-Commerce Giant Alibaba Gears for IPO in U.S.
- FDIC Files Lawsuit on Behalf of Banks Allegedly Hurt by Libor Scandal
- Some California Cities Seeking Water Independence