The assignee for this patent, patent number 8626725, is
Reporters obtained the following quote from the background information supplied by the inventors: "By way of background concerning conventional compression, when a large amount of data is stored in a database, such as when a server computer collects large numbers of records, or transactions, of data over long periods of time, other computers sometimes desire access to that data or a targeted subset of that data. In such case, the other computers can query for the desired data via one or more query operators. In this regard, historically, relational databases have evolved for this purpose, and have been used for such large scale data collection, and various query languages have developed which instruct database management software to retrieve data from a relational database, or a set of distributed databases, on behalf of a querying client.
"Traditionally, relational databases have been organized according to rows, which correspond to records, having fields. For instance, a first row might include a variety of information for its fields corresponding to columns (name1, age1, address1, sex1, etc.), which define the record of the first row and a second row might include a variety of different information for fields of the second row (name2, age2, address2, sex2, etc.). However, traditionally, querying over enormous amounts of data, or retrieving enormous amounts of data for local querying or local business intelligence by a client have been limited in that they have not been able to meet real-time or near real-time requirements. Particularly in the case in which the client wishes to have a local copy of up-to-date data from the server, the transfer of such large scale amounts of data from the server given limited network bandwidth and limited client cache storage has been impractical to date for many applications.
"For instance, currently, scanning and aggregating 600 million rows of data having approximately 160 bytes of data each (about 100 Gigabytes of data), using two 'group by' operations and four aggregate operations as a sample query, the fastest known relational database management system (RDBMS), as measured by industry standard TPC-H metrics, can deliver and process the data in about 39.9 seconds. This represents delivery at an approximate bit rate of 2.5 Gb/sec, or about 15 million rows/sec. However, today's state of the art system runs almost
"By way of further background, due to the convenience of conceptualizing differing rows as differing records with relational databases as part of the architecture, techniques for reducing data set size have thus far focused on the rows due to the nature of how relational databases are organized. In other words, the row information preserves each record by keeping all of the fields of the record together on one row, and traditional techniques for reducing the size of the aggregate data have kept the fields together as part of the encoding itself.
"Run-length encoding (RLE) is a conventional form of data compression in which runs of data, that is, sequences in which the same data value occurs in many consecutive data elements, are stored as a single data value and count, rather than as the original run. In effect, instead of listing 'EEEEEE' as an entry, a run length of '6 Es' is defined for the slew of Es. RLE is useful on data that contains many such runs, for example, relatively simple graphic images such as icons, line drawings, and animations. However, where data tends to be unique from value to value, or pixel to pixel, etc., or otherwise nearly unique everywhere, RLE is known to be less efficient. Thus, sometimes RLE, by itself, does not lend itself to efficient data reduction, wasting valuable processing time for little to no gain.
"Another type of compression that has been applied to data includes dictionary encoding, which operates by tokenizing field data values to a reduced bit set, such as sequential integers, in a compacted representation via a dictionary used alongside of the resulting data to obtain the original field data values from the compacted representation.
"Another type of compression that has been applied to data includes value encoding, which converts real numbers into integers by performing some transformation over the data enabling a more compact representation, e.g., applying an invertible mathematical function over the data, which reduces the number of bits needed to represent the data. For instance, real numbers, such as float values, take up more space in memory than integer values, and thus invertibly converting float values to integer values reduces storage size and then a processor that uses the data can derive the float values when needed.
"Still another type of compression that has been applied to data includes bit packing, which counts the number of distinct values of data or determines the range over which the different values span, and then represents that set of numbers or values with the minimum number of bits as determined by an optimization function. For instance, perhaps the each field of a given column spans only a limited range, and thus instead of representing each value with, e.g., 10 bits as originally defined for the field, it may turn out that only 6 bits are needed to represent the values. Bit packing re-stores the values according to the more efficient 6 bit representation of the data.
"Each of these conventional compression techniques has been independently applied to the row-organized information of relational databases, e.g., via rowset operators, yet, each of these techniques suffers disadvantages in that none adequately address the problem of satisfying the delivery of huge amounts of data from a database quickly to a consuming client, which may have real-time requirements, for up-to-date data. Mainly, the conventional methodologies have focused on reducing the size of data stored to maximize the amount of data that can be stored for a given disk size or storage limit.
"However, these techniques on their own can actually end up increasing the amount of processing time over the data according to a scan or query of the data due to data intensive decoding or the monolithic size of the compressed storage structures that must be transmitted to complete the inquiry. For instance, with many conventional compression techniques, the longer it takes to compress the data, the greater the savings that are achieved with respect to size; however, on the other hand, the longer it takes to compress the data with such conventional compression schemes, the longer it takes to decompress and process as a result. Accordingly, conventional systems fail to provide a data encoding technique that not only compresses data, but compresses the data in a way that makes querying, searching and scanning of the data faster.
"In addition, limitations in network transmission bandwidth inherently limit how quickly compressed data can be received by the client, placing a bottleneck on the request for massive amounts of data. It would thus be desirable to provide a solution that achieves simultaneous gains in data size reduction and query processing speed. It would be further desirable to provide an improved data encoding technique that enables highly efficient compression and processing in a query based system for large amounts of data.
"The above-described deficiencies of today's relational databases and corresponding compression techniques are merely intended to provide an overview of some of the problems of conventional systems, and are not intended to be exhaustive. Other problems with conventional systems and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description."
In addition to obtaining background information on this patent, VerticalNews editors also obtained the inventors' summary information for this patent: "A simplified summary is provided herein to help enable a basic or general understanding of various aspects of exemplary, non-limiting embodiments that follow in the more detailed description and the accompanying drawings. This summary is not intended, however, as an extensive or exhaustive overview. Instead, the sole purpose of this summary is to present some concepts related to some exemplary non-limiting embodiments in a simplified form as a prelude to the more detailed description of the various embodiments that follow.
"Embodiments of processing of column based data encoded structures are described. Various non-limiting embodiments enable efficient query processing over large scale data storage. A subset of columns implicated by a query are received as integer encoded and compressed sequences of values corresponding to different columns of data. Query processing buckets are defined that span over the subset of columns based on transitions of compression type occurring in the integer encoded and compressed sequences of values of the subset of data. The query is then processed in memory on a bucket by bucket basis and based on type of current bucket, e.g., pure bucket, single impurity bucket, double impurity bucket, etc., as defined in greater detail below.
"The column based organization of the data, and the application of a hybrid run length encoding and bit packing technique, enable a highly efficient and speedy query response in real-time. Synergy of the hybrid data reduction techniques in concert with the column-based organization, coupled with gains in scanning and querying efficiency owing to the column based compact representation, results in substantially improved data compression at a fraction of the cost of conventional systems, e.g., a factor of 400 times faster at less than 1/10 the cost of the fastest known conventional system.
"These and other embodiments are described in more detail below."
For more information, see this patent:
Keywords for this news article include:
Our reports deliver fact-based news of research and discoveries from around the world. Copyright 2014, NewsRx LLC
Most Popular Stories
- Top Hispanic Tech Companies Push for the Top
- 5 Notable Hispanic Technology Executives
- Taco Bell Rings Up Breakfast Menu
- Russia, Crimea Discuss Referendum
- California Establishes Center for Coffee Study
- China Urges Malaysia Flight Emergency Response
- For Obama, a Last Stab at Improving Ties with Capitol Hill
- Visa, MasterCard Team Up to Focus on Payment Security
- Sunday Starts Daylight Saving Time
- 'Holy grail of guitars' OM-45 Deluxe Available in in NY Auction