Title: Building Very Large, Distributed Object Databases
Author: Jamie Shiers
Address: CERN, 1211 Geneva 23, Switzerland
Telephone: +41 22 767 4928
Fax: +41 22 767 8630
The European Laboratory for Particle for Physics, more commonly known as CERN, is in the process of constructing a new accelerator to further its study into the fundamental building blocks of nature and the forces that act between them. This new facility, known as the Large Hadron Collider or LHC, is scheduled to enter operation in late 2005. Aside from the exciting physics that will be possible with this machine, its construction and deployment bring with it many challenges, not least in the area of computing. To address these issues, a number of R&D activities have been established at CERN over recent years. In this article, we discuss one of these projects (RD45), which has been set up to find solutions to the problems of storing, managing and accessing the vast quantities of data - of the order of 100PB or one hundred million terabytes - that will be produced at the LHC.
CERN was established in the mid-50s as part of an effort to rekindle fundamental research in a war-ravaged Europe. From the outset, it has been an excellent example of international collaboration - some 20 nations currently contribute to the funding of CERN, which is staffed primarily, but not exclusively, by personnel from these countries. Over the years, CERN has built a number of accelerators, which have typically been reused as part of the infrastructure for later machines. The largest accelerator that is currently in operation at CERN is the Large Electron Positron collider (LEP), a 27 km ring situated approximately 100 metres below the border between France and Switzerland, just outside Geneva. True to the tradition of reuse, the LHC will in fact be housed in the LEP tunnel and be fed by the existing complex of accelerators. Four major experiments are planned for the LHC. Two of these, ATLAS and CMS, will record about 1PB of data per year at data rates of around 100MB/second. A third experiment, ALICE, which will also store about 1PB/year, will take data at even higher rates - up to 1.5GB/second. A fourth experiment, LHC-B, completes the picture giving rise to a total data volume per year, integrated over all experiments and including simulated and derived data, of approximately 5PB. As the LHC is expected to run for about 20 years, the total volume of data that must be stored is around 100PB, or 1017 bytes. This is approximately twice would have been recorded had 1 bit been stored for every second of the Universe's estimated age of 15 billion years! This clearly gives rise to a data management problem on a large scale, and hence a project, known internally as RD45, was established in early 1995 to investigate and propose solutions to this problem.
The RD45 Project
The RD45 project was established to find solutions to the problems of handling many PB of High Energy Physics (HEP) event data. The project was tasked to explicitly address the issue of handling persistent objects, it being assumed that the LHC experiments would adopt object oriented solutions and C++ as an implementation language. Not surprisingly, we are now also including Java in our investigations. In addition, the project was told to focus on commercial, standards-conforming solutions, if at all possible. As such, we concentrated our initial investigations on the various possibilities discussed in books such as . Although we considered many alternatives, such as language extensions, so-called light-weight object managers and so forth, we fairly rapidly reached the conclusion that only a fully-fledged ODBMS could come close to meeting our initial set of requirements. Indeed, no known ODBMS could meet all of our needs off-the-shelf - this is clearly something that we did not expect. For the rest of this article, we focus on our attempts at using a commercial ODBMS to handle HEP event data and the areas where we believe that enhancements are required.
Object Databases and Mass Storage Systems - the Prognosis
One of the many challenges of a project with such a long time scale is the uncertainty concerning which products will be available and the evolution of technology. It is not enough to be blindly optimistic - one must establish realistic scenarios based on conservative estimates of what will be available, including fallback possibilities should technology not advance as fast as predicted, to protect against the failure of suppliers and so forth. When trying to plan for a quarter of a century into the future, it can be instructive to look as far back in the past. 25 years ago, many of today's giants in the computing industry either did not exist or had yet to break into the mainstream. Architectures such as Digital Equipment Corporation's VAX were yet to be invented … and replaced. Codasyl databases were all the rage.
Looking into the future, it is clearly unrealistic to be too specific - one cannot depend on a given implementation or define precise characteristics. It is more meaningful to ask if there will be a market for a product offering certain functionality. For example, it is hard to believe that the storage market will do anything other than grow, although one cannot predict the exact form factor that disks will have in ten years, nor is it meaningful to attempt to do so. Such an analysis suggests that the basic hardware building blocks necessary to create a multi-PB store will be both available and affordable, although the relative costs of disk versus tape (or its replacement?) are much harder to predict. Indeed, the long-term market for tape as a medium may simply disappear.
In our overall strategy, we include the need for a Mass Storage System as it is far from clear that it will be financially possible to store all of the data on disk storage, and hence the need for some cheaper storage, plus software to manage it. A discussion of the requirements on such a system is outside the scope of this article. However, it is clear such a system must be transparently integrated with the overall data management solution. Latency aside, a user should not (need to) be aware whether his/her data or on disk, or have been faulted in from tape by the system.
Choosing an ODBMS
Procedures for choosing an ODBMS have been described in a number of books and articles, including .
The first thing to realise is that there is no "best" product - the systems that are available differ widely in architecture and functionality, and it is a question of matching requirements against a number of potential solutions. In a perfect world, there would be several alternatives meeting all of the mandatory requirements, and one would be able to select the primary supplier based on non-critical issues. In reality, this is typically not the case for large projects - requirements must be prioritised and compromises made. In our case, the most critical requirement is that of scale. A solution that does not, or could not, with a reasonable number of enhancements, scale to meet our needs in terms of data volume is simply ruled out. Clearly, a single physical database of 100PB is impossible (at least today), and so some form of distributed database would appear necessary. This already brings into more scalability issues - even if one permits physical databases to be as large as 100GB, already very large by today's standards, then one would need to manage one million such databases to reach 100PB. These databases should appear to the user as a single, logical database with transparent, cross-database references and a single (hopefully replicated) shared, and hence consistent, set of schema.
A close second to scalability is performance. Not only must the solution be sufficiently scalable, but it must provide sufficient performance. This can be expressed in terms of the raw data rate that must be supported - from 100MB/second for ATLAS and CMS to 1.5GB/second for ALICE - to the aggregate data rate needed to perform a typical physics analysis. Of course, absolute performance depends on many things, including the layers that reside below the database. Hence, we prefer to express performance as a percentage of the performance of the underlying systems.
The Need for a Fall-back Strategy
In order to maintain a maximum of vendor-independence, it is our strategy to adhere to the programming interfaces defined by the ODMG . In principle, an application that is built using for the example the ODMG C++ binding can be ported from one conforming implementation to another simply by a re-compile. There is even a data interchange format so that one can convert the data too. In reality, this is unlikely to be enough. The ODMG - deliberately - does not specify implementation, only interface. The various products that offer various degrees of conformance to the ODMG bindings have widely differing architectures. Apart from trivial example programs, production applications will almost certainly be widely influenced by the architecture of the product on which they are deployed. What is best for your application? Object server or page server? What level of locking granularity? How is object clustering implemented? These issues mean that a quick re-compile and conversion of the existing, say 25PB of data are far from enough should one need to change vendor. Of course, this is not something that should be undertaken lightly. However, it is a possibility that might occur - and the probability is certainly not reduced with time. Including an escrow clause into the contract is not necessarily the answer. Can you afford to take over the development and support of the product in-house?
Nevertheless, the need for a fall-back strategy seems clear - one cannot afford to cancel the whole programme simply because of a problem in one area. However, one should not imagine that there are necessarily two solutions that are equal in functionality or performance. The fall-back solution almost certainly implies reduced functionality, perhaps a more restrictive computing model, maybe even more time from data taking to the publication of the first paper.
As stated above, the best protection against such problems is a large, healthy market - in other words, adopt commodity solutions where-ever you can! Unfortunately, the demand for fully distributed, highly scalable ODBMSs is not yet at the level that they can be called commodity products. Nor will the HEP market even be big enough that it will be able to sustain such products by itself. However, some of the markets in which ODBMSs are currently successful, such as the telecoms market, can be assumed to be large enough to support at least one ODBMS product. Performance and distribution are also critical in this area, so the question is more can we exploit technology "designed" for (and supported by) this market?
Tests of Scalability
As mentioned above, our key requirement is that of scalability, particular in terms of data volume. The tests described below were performed with Objectivity/DB, and it is first necessary to give a brief overview of its architecture. Objectivity/DB supports distributed databases with shared, consistent, schema. Up to 216 physical databases form a federation in Objectivity's terminology. Currently, each of these databases maps to a normal file in the underlying filesystem. Hence, when using 32-bit filesystems they are limited to 2GB in size, whereas in 64-bit filesystems their size is, for all practical purposes, unlimited. Databases are, in turn, made up of containers, each of which is composed of database pages, on which the objects themselves are stored. Essentially, the 4 16-bit fields of Objectivity's 64-bit object identifier map to the database id, the container id, the logical page and the slot on that page. Our tests included hitting every documented limit in this architecture, although not simultaneously - and trying to find undocumented ones! These tests and others besides are documented in the many RD45 project documents, which can be accessed through the RD45 Web pages . In these tests, the largest federation that was created was a mere 0.5TB - here we were limited by the available disk space and not Objectivity/DB's architecture. We were, however, able to demonstrate physical databases well beyond the 2GB limit from 32-bit filesystems and show that a federation containing the maximum number of databases was indeed possible. By combining these two - which of course still has to be verified - distributed databases up to the PB range are possible. Unfortunately, our requirements go way beyond the PB range - we need to be able to store up to 100PB of data. To achieve such massive databases, we feel that the architecture should, on paper, scale to at least one order of magnitude beyond this requirement. This is to avoid arbitrary constraints, such as being required to fill every slot on every page, every container up to its limit and so forth. This leads to the two main requirements described below, namely the architectural changes to permit much larger federations, and the mass storage interface, so that the storage can actually be afforded!
The main enhancement requests that we have made can basically be split into three categories:
We will not discuss ODMG compliance here, expect to point out that, sadly, vendors have been somewhat slow in supporting the complete ODMG bindings. How many vendors offer training courses based on the ODMG bindings?
ODBMSs typically store their data in files, using the native filesystem on the host machine in question. In other words, some level of the containment hierarchy maps to a standard file, albeit with complex internal structure. Although 64-bit filesystems are becoming fairly widespread, we still feel that a maximum realistic filesize needs to be imposed. For example, if inactive data are stored offline on tape, the basic quantum that is moved to/from tape should a) be no larger than a single tape volume b) be migrated recalled within a "reasonable" period of time. 1000 seconds would be acceptable, 100 seconds preferable. Although there are other reasons for limiting the filesize, such rule of thumb arguments suggest that a maximum filesize of a few GB is reasonable today, perhaps 10GB in a year or so and maybe as much as 100GB in 2005. Using such a maximum filesize, one can show - on paper - that at least 2 ODBMS architectures permit distributed databases in the PB range. However, our requirement is for 100PB - ideally the architecture should permit much larger databases, to avoid any arbitrary constraints.
Basically, if one limits the maximum filesize, then there is only one way of permitting very large databases, and that is by increasing the number of files that can comprise a single logical database. A simple way that this can be done is by increasing the object identifier. However, this is not necessarily the optimal solution, as it has a direct impact on the storage overhead, which can be significant for small objects. Alternative solutions impact the logical - physical mapping. In some systems, this can be achieved by mapping one database to multiple files, often called volumes. In the case of Objectivity/DB, for example, this could be implemented by mapping containers, rather than databases to files. Essentially, this would now permit 232 files - even if these files are kept relatively "small", i.e. a few GB, it is still possible, architecturally, to reach the EB region. These leaves extended OIDs as an option for region the Zettabyte - 1021 bytes - region!
A detailed discussion of the integration of an ODBMS with a Mass Storage System is outside the scope of this article. Suffice it to say that the integration is done at the level of the filesystem - the database issues "standard" I/O calls and the underlying system faults files back in from tape if required, ensuring that there is enough free space to do so by moving inactive data to tape. This of course requires efficient clustering - the latency involved in such operations is significant and one wants to avoid reloading a file to retrieve just a few objects. Hence, objects that are likely to be accessed together must be stored together, or at least for the major access patterns.
An issue that will clearly be very important with the anticipated data volumes is that of data clustering. Already complex in its own right, the combination of an object database with a mass storage system adds further complications. The latency involved in bringing and object into memory from a tape-resident database is significantly larger than that of a disk resident or server-side cached object. Furthermore, whilst some architectures transfer individual objects from server to client, this will clearly not directly from tertiary storage - much larger data volumes must be moved to and from tape than a single object. A related issue is that of metadata. If much of the data is tape-resident most of the tape, maintaining consistent metadata - such as schema - across very large numbers of databases is essentially impossible. For example, if a single schema change required every database to be restored to disk, modified, and eventually flushed out to tape again, the system could never work in production. In our environment, we are helped by the nature of our data. The data that corresponds to individual events - the result of an interaction between the accelerated particles that is subsequently observed in the detector - is independent. There are no inter-event relationships. Furthermore, there is a natural hierarchy within individual events. The information read out of the detectors - the raw data, which forms the bulk of the total information - is processed essentially sequentially, and infrequently (e.g. once per year). Information that is derived from the raw data can also be divided into several categories, leading to small subsets that are used for physics analysis. It is also possible to divide events into different categories - by physics channel. Clearly, these data subsets will overlap, and it will not be possible to define a clustering strategy that is optimal for all types of access. However, we believe that it will be possible to improve substantially over simple time-based clustering, and that we will often, if not always, be able to make use of the random-access capabilities of the database to avoid duplicating data in multiple streams. Some classes of data will remain permanently disk-resident. These include "hot" physics channels, such as Higgs-event candidates, and metadata. Examples of metadata in our field include detector calibration information - these are required to correct for detector response and even determine if an event should be rejected or not, and event characteristics. The latter we call "event tags", and are expected to be the primary entry point into the data. Users will select collections of events based on high-level characteristics - which could be as simple as "Higgs candidates", or "high Pt selection", or based on properties such as the number of electron candidates etc.. Each event will be tagged by its properties, and users will be able to select subsets of data based upon these properties, define new tags and collections and so on. The tag objects will contain references to the main event data, and will so allow users to make use not only of the information that is "copied out" into the tag, but also the full information from the event - right down to the raw data. This is a major advantage of a global, database-oriented solution, over previous, file-based solutions that were used in High Energy Physics. In the latter case, "navigation" from the subsamples used for analysis to the full event information was an ad-hoc and time-consuming operation. Using the database approach, one is able to store not just the data, but also the selection criteria used to produce (logical) subsamples, as well as final results, in a consistent, managed fashion.
Although the main goal of our work is to find solutions that meet the requirements of the LHC experiments, which will only enter the production phase in 2005, an early demonstration of the feasibility of such an approach is clearly necessary. Our initial studies used legacy data - from LEP and other High Energy Physics experiments. However, it late 1995 we were contacted by a running CERN experiment, named CERES or NA45, which had just taken the decision to move to C++. This was felt to be an excellent opportunity for both sides - CERES were looking for help in converting to C++ and we were looking for people willing to test elements of the proposed strategy. During 1996, the CERES experiment reimplemented much of their software in C++, using an ODBMS for persistence. Although the total amount of data stored was small - a few tens of GB - they were able to show that such a system could be used for storing physics data, including a demonstration of writing concurrently from 32 processes into the database. This latter point is particularly important for the LHC, where widespread use of parallelism will be essential to handle the required data rates. Encouraged by this success, CERES plan to use the ODBMS-based solution for future data taking runs, storing some 30TB of data in 1999. A number of other experiments have also adopted the same solution, including the COMPASS experiment at CERN, which will store over 300TB of data per year from 2000 on at data rates of around 35MB/second, and experiments such as BaBar at SLAC in the US. If all proceeds according to plan, we can expect at least 1PB/year of High Energy Physics data to be stored in an ODBMS from 2000 on. Arguably, these is even more of a challenge than the 5PB/year expected at LHC, as it will not be possible to benefit from all the advances in technology that the LHC will be able to profit from.
Mixed Language Support and Java
Given the timescales of our project, perhaps the only thing that can be stated with certainty is that there will be change, and that we had better prepare for it. In the past, it was possible to choose a computing environment at the start of a project and stay with it throughout the entire lifecycle. This already broke down during the LEP - CERN's existing collider - era at CERN. During the construction phase of LEP, the bulk of the offline computing was done using Fortran on mainframes. Now, with perhaps 3 more years of data taking followed by a further 2 of analysis to go, mainframes are a thing of the past. Considerable changes have had to be made to adopt from centralised to distributed computing. More and more analysis is done in C or even C++ rather than Fortran. For LHC, with its longer timescale and the increased rate of change in the computing industry, even more changes can be expected. Although C++ is currently the programming language of choice for the LHC, more and more developments are being done in Java. On the database side, it is important to note that many products already offer multiple language bindings, Java being the most recent. To date, our experience with the Java binding has primarily been limited to its use for developing the user interface to database administration tools. Here, any performance arguments are hardly relevant, and the standard "write-once, run-anywhere" arguments prevail. We have yet to perform detailed cross-language tests, but it is clear that we need to understand to what extent "legacy" persistent C++ objects can be accessed by Java applications. Of course, this is a particularly easy case - there is a lot of similarity between the two languages and they coexist. How we can plan for the languages that will inevitably come in the next twenty-five years is a harder problem…
Our future directions can be stated in a single word - production! Although this work has already been used in production by the CERES experiment, and others, at CERN, the next 2 years will see a dramatic growth in the volume of data that will be stored in ODBMSs are CERN. We expect to remain at the order of a few TB during 1998, reaching tens, perhaps even one hundred TB in 1999, finally entering full swing in the year 2000. At this time, a single experiment at CERN, namely COMPASS, will be storing 1/3 - 1/2 PB per year. HEP experiments at other laboratories around the world are also expected to store similar amounts of data on the same time scale.
Summary and Conclusions
We have successfully deployed an off-the-shelf ODBMS in a production environment at CERN. Although a small amount of code has been written to facilitate the use of this product within our community, the man-power, and hence cost, savings that the adoption of this technology have made are impressive. A single "database" application at LEP - written in-house - is estimated to have taken 10 man-years to write. The equivalent application, based on an ODBMS, was implemented last year by a student. Using this technology, we anticipate being able to manage the vast volumes of data that will be generated at the LHC, using essentially the same amount of manpower as was required at LEP - which produced 3 orders of magnitude less data.