OLAP cube

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Yorrose (talk | contribs) at 21:31, 12 October 2006 (linked in hyperion page). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

OLAP (on-line analytical processing) was a term coined by E.F. Codd & Associates who published a white paper in 1994, commissioned by Arbor Software (now Hyperion Solutions), entitled ‘Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate’. (see also article on OLAP).

Codd was one of the originators of the relational database and so his opinions were respected. However Codd was paid to write the white paper to support Arbor's existing Essbase product, rather than the concept being mathematically-based. Neverthless Codd's reputation legitimised a useful new arrangement of data in arrays to allow fast analysis. These arrays are called cubes. The arrangement of data into cubes avoids a limitation of relational databases which are not well suited for near instantaneous analysis of large amounts of data. Relational databases are better suited for creating records from a series of transactions (known as OLTP or on-line transaction processing). Although many report-writing tools exist for relational databases, these are slow when the whole database must be summarised. Codd's paper was viewed as propaganda that users could show to skeptical IT departments who had often thought that everything their users wanted could be done from a relational database.

OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For example a company might wish to analyse some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional methods of analysing the data are known as dimensions.

A financial analyst might want to view the data in various ways, such as displaying all the cities down the page and all the products across a page. This could be for a specified period, version and type of expenditure. Having seen the data in this particular way the analyst might then immediately wish to view it in another way. The cube could effectively be re-oriented so that the data displayed now had periods across the page and type of cost down the page. Because this re-orienation involved re-summarising very large amounts of data, this new view of the data had to be generated efficiently to avoid wasting the analyst's time, ie within seconds, rather than the hours a relational database and conventional report-writer might have taken.

Each of the elements of a dimension could be summarised using a hierarchy. For example May 2005 could be summarised into Second Quarter 2005 which in turn would be summarised in the Year 2005. Similarly the cities could be summarised into regions, countries and then global regions; products could be summarised into larger categories; and cost headings could be grouped into types of expenditure. Conversely the analyst could start at a highly summarised level such as the total difference between the actual results and the budget and drill down into the cube to discover which locations, products and periods had produced this difference.

Because there can be more than three dimensions in an OLAP system the term hypercube is sometimes used. The commercial OLAP products have different methods of creating the cubes and hypercubes and of linking cubes and hypercubes (see Types of OLAP in the article on OLAP.)

Linking cubes is a method of overcoming sparsity. Sparsity arises when not every cell in the cube is filled with data and so valuable processing time is taken by effectively adding up zeros. For example revenues may be available for each customer and product but cost data may not be available with this amount of analysis. Instead of creating a sparse cube, it is sometimes better to create another separate, but linked, cube in which a sub-set of the data can be analysed into great detail. The linking ensures that the data in the cubes remain consistent.

The data in cubes may be updated at times, perhaps by different people. Techniques are therefore often needed to lock parts of the cube while one of the users is writing to it and to recalculate the cube's totals. Other facilities may allow an alert that shows previously calculated totals are no longer valid after the new data has been added, but some products only calculate the totals when they are needed.

Technical definition

In database theory, an OLAP cube is an abstract representation of a projection of an RDBMS relation. Given a relation of order N, consider a (hopefully PK-preserving) projection that subtends X, Y, and Z as the key and W as the residual attribute. Characterizing this as a function,

W : (X,Y,Z) → W,

the attributes X, Y, and Z correspond to the axes of the cube, while the W value into which each ( X, Y, Z ) triple maps corresponds to the data element that populates each cell of the cube.

Insofar as two-dimensional output devices cannot readily characterize four dimensions, it is more practical to project "slices" of the data cube (we say project in the classic vector analytic sense of dimensional reduction, not in the SQL sense, although the two are clearly conceptually homologous), perhaps

W : (X,Y) → W

which, although not PK-preserving, may have some semantic significance nevertheless, perhaps a slice of the triadic functional representation for a given Z value of interest.

The motivation behind OLAP displays harks back to the cross-tabbed report paradigm of 1980s DBMS. One may wish for a spreadsheet-style display, where—to appropriate the Microsoft Excel paradigm—values of X populate row $1; values of Y populate column $A; and values of W : ( X, Y ) → W populate the individual cells "southeast of" $B2, so to speak, $B2 itself included. While one can certainly use the DML (Data Manipulation Language) of traditional SQL to display ( X, Y, W ) triples, this output format is not nearly as convenient as the cross-tabbed alternative: certainly, the former requires one to hunt linearly for a given ( X, Y ) pair in order to determine the corresponding W value, while the latter enables one to arguably more conveniently scan for the intersection of the proper X column with the proper Y row.

The MultiDimensional eXpressions (MDX) language has been developed as a facile means for expressing OLAP problems. While it is possible to translate some of these into traditional SQL, it would frequently require the synthesis of clumsy SQL expressions even for very simple MDX expressions. MDX has been embraced by wide majority of OLAP vendors and became de-facto standard for OLAP systems.

amitava