EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH
CERN/LHCC 97-6
LCB Status Report/RD45
7 February, 1997
This document has been produced for the March
1997 LCB review of the RD45 project. In this paper, we present
the status of the project, including a summary of the responses
to the milestones set at the 1996 review by the LCRB, suggestions
for future activities and a risk analysis of the current RD45
strategy.
In addition, we describe activities undertaken
within various experiments and projects, including NA45, ATLAS,
CMS, ALICE, BaBar, BELLE and Zeus.
This documented is complemented by more detailed
reports covering the individual milestones.
RD45 documents may be obtained through the
Web (see http://wwwinfo.cern.ch/asd/cernlib/rd45/index.html) or
via e-mail request to the spokesman.
TABLE OF CONTENTS
1. Executive Summary 11.1 Summary of Activities During Second Year 11.2 Conclusions 22. Introduction 33. Overview of the First Year's Activities 34. Overview of Activities During the Second Year 55. Milestones Set at the March 1996 LCRB Review 66. Collaboration with ATLAS and CMS on Their CTPs 77. Provision of Persistence Service for GEANT-4 78. Milestone 1 - the Impact of Using an ODBMS 88.1 Object Model Issues 88.1.1 The ODMG Object Model 88.1.2 Differences Between the ODMG Object Model and the C++ Object Model 98.1.3 Impact on Existing Object Models and Modelling Guidelines 98.1.4 Conclusions 98.2 Issues Related to the ODMG and Objectivity/DB C++ Binding 108.2.1 Impact on Existing Code 108.2.2 Conclusions 108.3 The Use of an ODBMS with Third-Party Class Libraries 108.4 CASE Tools, Object Databases and Persistent Applications 118.4.1 Classify/DB 118.4.2 Rational/ROSE 118.4.3 StP 128.4.4 Conclusions 128.5 The Impact of an ODBMS on Object Granularity 128.6 End-User Issues 138.6.1 Access to the Database Run-time Environment 138.6.2 Access to Database Catalogue 138.6.3 Summary 138.7 Requested Enhancements 138.8 Conclusions from Milestone 1 149. Milestone 2 - Object Database Features 159.1 Schema Evolution 159.1.1 Areas of Potential Use in HEP 169.1.2 Prototype Investigations 169.1.3 Requested Enhancements for Schema Evolution 179.2 Object Versioning 179.2.1 Areas of Potential Use in HEP 189.2.2 Prototype Investigations 189.2.3 Requested Enhancements for Object Versioning 189.3 Data Replication 199.3.1 Areas of Potential Use in HEP 199.3.2 Prototype Investigations 199.3.3 Requested Enhancements for Data Replication 209.4 Conclusions from Milestone 2 2010. Milestone 3 - Performance Comparison with PAW+Ntuples 2110.1 Current Practice 2110.2 ODBMS Capabilities 2210.3 ODBMS versus Ntuples 2210.4 Raw Performance Measurements 2310.4.1 Read/Write Performance of Test-Bed System 2310.5 Comparisons with PAW and Ntuples 2310.5.1 Full Tag Comparisons 2410.5.2 Reduced Tag 2510.5.3 Queries Using Indices 2610.6 The Effectiveness of Using an ODBMS 2610.7 Conclusions 2711. Risk Analysis 2811.1 Support for Multiple Federations 2811.2 Number of Databases per Federation 2811.3 Number of Containers per DB, Size of Containers 2911.4 Navigation Across Multiple Containers and Databases 2911.5 Very Large Numbers of Associations 3011.6 Very Large Collections 3011.7 Re-clustering and the Effect on Existing Collections 3011.8 Handling Multiple Containers and/or Databases 3111.9 Database Administration Issues 3111.10 Alternative ODBMS Products 3111.11 Alternative Mass Storage Systems 3211.12 Conclusions 3312. Use of Objectivity/DB in HEP and Related Disciplines 3413. Collaboration with Other Projects 3413.1 ALICE 3413.2 ATLAS 3513.3 CMS 3513.4 CERES/NA45 3513.5 GEANT-4 3513.6 AMY 3613.7 BaBar 3613.8 BELLE 3713.9 ZEUS 3714. Standards Activities 3714.1 ODMG-related Activities 3715. Objectivity/DB Workshops 3816. Objectivity/DB User Meeting 3917. ODBMS to MSS Coupling 3917.1 Integrating an ODBMS with an MSS at the Filesystem Level 3917.2 Integrating Objectivity/DB with HPSS 4017.3 Conclusions 4018. Other Database Developments 4019. Future Activities 4120. Proposed Milestones for 1997-1998 4121. Conclusions 4222. Glossary 4323. References 45Executive Summary
The RD45 project is investigating solutions to the problem of
providing persistency to physics data of the LHC experiments,
assumed to be in the form of (collections of) objects.
At the end of the first year, a potential solution, based on standards-conforming
products, was presented. Key elements of this solution, which
proposes the use of an Object Database Management System (ODBMS)
that conforms to the Object Database Management Group (ODMG) standards
[20], together with a Mass Storage System (MSS) that is built
according to the Reference Model for Mass Storage Systems developed
by the IEEE Computer Society (IEEE MSS), have already been used
in production for the storage and management of High Energy Physics
data for the NA45 experiment, as well as in the GEANT-4 (RD44)
project. Prototyping is also going on at other HEP laboratories
with the same or similar technology, including at DESY (Zeus),
KEK (Belle) and LBL (BaBar).
In this report, we summarise the activities of the RD45 project during the past year, including progress on the milestones set by the LCRB, experience with NA45 and other projects such as GEANT-4, together with proposals for future activities.
During the past year the RD45 collaboration has:
The proposed solution to object persistence for LHC event data, based upon a commercial ODBMS and MSS, has been accepted as the baseline solution for both ATLAS and CMS,
pending the final results of the RD45 investigations. The performance
and scalability of such a solution, together with the associated
risks, have been analysed. Whilst more work needs to be done,
particularly in the area of MSS integration and efficient data
access, we believe that this solution is still by far the most
promising of those considered, offering both the functionality
and scalability that is required.
The RD45 project, which was approved in February 1995, is investigating
solutions to the problems related to providing persistent object
services for the LHC experiments. This includes, but is not limited
to, fully distributed heterogeneous architectures capable of scaling,
at least architecturally, to the multi-PB region. Various potential
solutions to this problem were investigated as part of the first
year's activities, including language extensions, object managers
and full-blown Object Databases (ODBMS). It was the conclusion
of the first year that only full ODBMSs provide sufficient functionality
as to satisfy a preliminary list of HEP requirements and that
only a few of the currently available ODBMS products have an architecture
that is sufficiently scaleable as to meet our needs.
During the past year, the RD45 collaboration has focused on ODMG-compliant
solutions, and has demonstrated the use of a standard, off-the-shelf
ODBMS product for storing and managing HEP event data in a production
environment.
Despite the focus on ODBMSs, RD45 continues to follow progress
in other areas, such as persistent object managers, e.g. SHORE,
Object-Relational Databases, including object-oriented offerings
from the traditional relational (RDBMS) vendors and so forth.
RD45 continues to participate in the Object Database Management Group (ODMG) - the standards body that defines and maintains the various standards for ODBMSs, as well as the OMG, and the IEEE Computer Society Executive Committee on Mass Storage Systems (IEEE MSS EC).
During its first year, the RD45 collaboration investigated several
different approaches to solving the object persistency problem,
including language extensions, persistent object managers and
ODMG-compliant ODBMS products.
Using the definition of an ODBMS from the "Object-Oriented
Database Manifesto" [12], it was our conclusion that HEP
requires a system offering all of the facilities listed as mandatory
in this manifesto, all of the features listed as optional, and
indeed several others besides!
On the other hand, both language extensions and known persistent
object managers, both HEP-specific and non-HEP, (see Cattell [10]
for a list) impose major restrictions, such as a lack of support
for platform or language heterogeneity, no support for the full
C++ language model (e.g. no virtual functions), lack of ODMG compliance,
lack of scalability and so forth. In addition, it is the conclusion
of the RD45 collaboration that such systems would require considerably
more man-power to extend and maintain than an existing solution,
already deployed to many hundreds of thousands of end-users.
RD45 was also able to identify an ODBMS product with an architecture
offering the required scalability, and this product has been used
for all of the prototypes built during the past two years, as
well as for the NA45 physics production run.
Although the choice of a system for the current prototyping is
clearly de-coupled from the eventual choice of a system for the
production phase of the LHC experiments, long term support issues
are extremely important and should not be under-estimated. The
lifetime of the LHC experiments will probably be some 10-15 years,
perhaps more, and thus we need guaranteed support until 2020/2025.
Existing object managers, such as SHORE, are research projects
which are unlikely to last more than a few years before a follow-on
project is launched - long-term support is not under consideration
by these groups. On the other hand, products such as commercial
ODBMSs are used to build production systems, such as telecoms
applications, for which long-term support is mandatory. Nevertheless,
the RD45 collaboration continues to monitor fully-featured ODBMSs,
the object-extensions being added by the traditional RDBMS vendors,
as well as simpler approaches, including persistent object managers
and places great emphasis on avoiding dependence on a single product.
The activities of 1995 can thus be summarised as follows:
Further details can be found in the RD45 status report for 1995,
CERN/LHC 96-15 [1], also available via
During the past year, and in addition to the work on the milestones
and recommendations described below, the RD45 project has performed
an initial risk analysis of the current strategy, and established
a joint project with Digital, aimed at providing a test-bed where
detailed performance and scalability measurements can be made.
This test-bed has been used to make measurements directed at milestone
3 [6], described in section 10 on page 21, but will also be used
to understand issues related to parallel filesystems, very large
memories, and so forth.
Numerous presentations of the project have been made both at CERN
and outside, including at external laboratories such as DESY and
KEK.
The number of full-time equivalents at CERN working on the project
has approximately doubled.
We have continued to work within the framework of the ODMG to
ensure that the future evolution of the ODMG standard satisfies
HEP requirements. Additional features that have been requested
as part of the V2.0 version of the ODMG standard include support
for distributed databases, read/write access to the database schema,
schema evolution and user data replication. In addition to the
standards-related activities within the ODMG, we have continued
to make our requirements available to the ODBMS vendors, and to
Objectivity in particular. The latter has been achieved through
regular workshops at CERN, and through the Objectivity user group.
Requested enhancements to Objectivity/DB include an extended Object
Identifier (OID), an interface to the High Performance Storage
System (HPSS) - a Mass Storage System - and support for Parallel
Query and Very Large Memories.
In addition to this status report, and three supporting documents
[4,5,6], each corresponding to one of the three milestones set
by the LCRB, we have produced a draft set of guidelines for Objectivity/DB
Database Administrators [9], available via
and two internal documents, which can be obtained upon request:
The RD45 project was reviewed by the LCRB in March 1996, and recommended
for continuation for a further year, with the following milestones
and comments:
The project has made excellent progress in identifying and
applying solutions for object persistence for HEP based on standards
and commercial products. The milestones set (as revised by the
LCRB in November 1995) have been met.
The LCRB agrees with the program of future work outlined in
the RD45 report (CERN/LHCC 96-15) [1] [ and sets the following
milestones for the second year of the project: ]
In addition, the project was asked to:
Detailed reports [4], [5], [6] describing the work on the LCRB milestones are available in printed form from the spokesman or via e-mail request to Heplib.Support@cern.ch.
Web versions of these documents can be found via the web address
listed below.
The RD45 status report and predictions concerning Object Databases and Mass Storage Systems ("Object Databases and Mass Storage Systems - The Prognosis", CERN/LHCC 96-16 [3]) have been used by both ATLAS and CMS in the preparation of their Computing Technical Proposals [24] [25], as have the reports of the two technology tracking teams in which RD45 is involved. In addition, we have participated in most of the regular meetings of these working groups, made numerous presentations and commented on the draft documents. It is expected that the work on the current and future RD45 milestones will be referenced in future updates of the CTPs and that outstanding questions from these working groups will strongly influence the future activities of the project.
We continue to work closely with the GEANT-4 (RD44) collaboration,
with whom regular meetings are held, to work on the persistent
aspects of GEANT-4. An Objectivity/DB course was arranged for
RD44 members earlier this year, and several sessions at Objectivity
workshops have been devoted to understanding the impact of using
an ODBMS for persistence in GEANT-4, the object model and performance.
We have provided technical assistance in introducing persistence
to the GEANT-4 "Hits" class, We have also investigated
the suitability of an Express/ODL converter, available from Micram,
the distributors of Objectivity/DB in Germany. As part of this
support activity, we have acquired sufficient run-time licenses
for Objectivity/DB for the users of the first prototype of GEANT-4.
Further details are given in section 13.5 on page 35.
The work on this milestone, namely to "Identify and analyse
the impact of using an ODBMS for event data on the Object Model,
the physical organisation of the data, coding guidelines and the
use of third party class libraries", has been divided
into issues related to the following:
As the physical organisation of the data has a strong impact on
performance, the bulk of the work on this issue has been covered
in the context of milestone 3 [6]. The work on milestone 1 [4]
has been limited to high-level issues, such as object granularity.
Although this milestone is largely oriented towards developer
issues, we comment briefly on the impact on end-users of using
and ODBMS for production applications.
The work on this milestone is covered in more detail in [4].
In this section, we describe the main features of the ODMG Object Model and compare it with the C++ object model.
The ODMG object model defines persistence (for C++), to be by
inheritence. That is, for a class to be persistence-capable,
it must derive from the ODMG base-class d_Object.
Instances of such classes may be either persistent, i.e.
stored in the database, or transient, i.e. deleted either
explicitly or automatically when they go out of scope. Transient
classes are in any case limited to the lifetime of the creating
process. Whether an instance is persistent or transient is decided
at object-creation time, through an over-loaded new operator.
The model also includes fixed-length implementations of the basic types, such as int, float, double, etc. (e.g. d_Short, d_Long, d_Float, d_Double.) These are required to provide support for heterogeneity, as implementations of the basic C++ types can vary from platform to platform. Persistent object references are provided through a type-safe smart pointer, d_Ref<T>. Associations, both uni and bi-directional are provided, as are container classes and various utility classes, such as date, timestamp and interval.
The ODMG object model extends the standard C++ object model in
a number of respects:
It also imposes a number minor of constraints over the standard
model, namely:
A number of applications, which, for historical reasons were not
designed from the start with persistence in mind, have been ported
to an ODBMS without major problems. These include applications
from NA45 and GEANT-4, as well as the histogram classes being
developed in the context of LHC++.
The following guidelines are recommended for creating persistent
object models:
Essentially, these guidelines may be condensed into a single rule,
namely:
The ODMG object model can be considered to extend the C++ object model in a very natural way. It has the advantage of being language-independent, and provides additional (required) functionality, such as associations, which would otherwise have to be implemented by 3rd-party class libraries. Implementing persistence by inheriting from a special base class does not pose any major problem for the definition of an object model typical of HEP event data.
There are a number of code changes that need to be made to transient
applications to make them persistent using an ODMG-compliant ODBMS.
By far the most significant of these relates to the use of C++
pointers, which must be avoided in the case of persistent classes.
References to persistent-capable objects must be made using the
ODMG-defined d_Ref<T> smart pointer. In most cases, however,
it is sufficient to change only the type definitions of the pointers
concerned - the user code remains largely unaffected. Pending
the support for standard C++ (STL) containers, changes also need
to be made to switch from the transient, typically Rogue Wave,
containers, to those provided by the ODBMS. In addition, one must
also design and implement appropriate clustering and locking strategies
and handle the database session.
A prototype of a small package to both simplify the porting of new applications, and also to insulate applications from vendor-specific features, is in use within the RD45 collaboration, and a production version of this software will eventually be made available to the HEP community, as part of the LHC++ framework that is currently being built up.
The implementation of persistence provided by an ODMG-compliant ODBMS is a very natural extension of the normal heap allocation performed by C++. The impact of introducing an ODBMS to existing C++ applications is very small compared to traditional I/O systems, which require explicit I/O calls to be coded. A small set of design rules, described in detail in the supporting document for milestone 1, are sufficient to port existing applications or to design new ones that will use an ODBMS for persistence. The provision of a small layer of middle-ware allows a high-level interface to the database to be developed, isolating the application from vendor-specific details whilst also simplifying locking and clustering strategies.
In most cases, there is no incompatibility between 3rd
party class libraries, such as graphics and GUI libraries, and
an ODBMS. We have built a number of prototypes that use libraries
such as OpenInventor, or work in frameworks such as IRIS Explorer,
where the introduction of a database has been transparent.
The exception to this rule is that of libraries such as RogueWave's
Tools.h++, the de-facto standard for collection classes, and the
forth-coming standard C++ library - in other words, collection
and containers classes.
Today, persistent versions of Tools.h++ are provided for a number
of databases, including Objectivity/DB. However, given the emergence
of the Standard Template Library (STL), adopted into the draft
C++ library, albeit with a number of changes, the long-term strategy
should clearly be to use this library, rather than Tools.h++.
V1.2 of the ODMG standard has made some initial steps in migrating
towards full STL compliance, and V2.0 will introduce significant
enhancements in this respect.
Until STL-support is provided by Objectivity/DB, we have provided
two container classes to assist in the migration of applications
from transient to persistent.
We have requested that Objectivity support the ODMG-defined STL subset in the next point release of the product, expect in mid-'97.
Classify/DB is a product of Micram Technology GmbH, the distributors
of Objectivity/DB in Germany. It is the only CASE product designed
explicitly to work with Objectivity/DB - or indeed any ODBMS -
and was the first product to support any of the ODMG bindings.
Classify/DB is based upon the OMT notation, and is capable of
generating the ODMG Object Definition Language (ODL), as well
as the DDL used by Objectivity/DB. It is also supports reverse
engineering, and can handle a variety of other formats, in addition
to those mentioned above, include Step/EXPRESS.
The fact that Classify/DB uses an Objectivity database to store the model information is a strong plus, and it demonstrates that CASE tools capable of generating ODBMS schema can indeed be produced. However, it is our opinion that the same tool should be used for both persistent and transient applications, and that Micram do not have the resources to compete with larger companies such as Rational.
ROSE is a CASE tool produced by the Rational company - that currently employs many of the leading authorities on object-oriented analysis and design. It is the tool used by the CMS and GEANT-4 collaborations. Although ROSE does not directly support the generation of ODMG ODL, we are aware of several customisations of the product that do enable ODL to be produced indirectly. Two of these customisations are available commercially - through the distributors of Objectivity/DB in Japan and through the distributors in Italy. Although we have looked at both of these customisations, they both appear somewhat baroque for something that should be relatively straight-forward. Indeed, some preliminary studies within CMS suggest that the necessary customisation of the ROSE output filter to produce ODL would be simple to perform, although it would clearly be desirable to have support from the product directly.
StP is the CASE tool used by the ATLAS collaboration. StP includes support for requirements definition, object modeling, information modeling, structured development and testing, with code generation support for C++, Smalltalk, OMG IDL, Forte TOOL, Ada, C and SQL for the main relational database management systems (RDBMS). StP stores model information in a Sybase (an RDBMS) and, like ROSE and other CASE tools, can be customised. Given that the ODMG's ODL is a superset of the OMG's IDL, which in turn is based upon C++ syntax, it should be relatively easy to customise StP to produce ODL. This step, however, has not yet been done.
With the exception of Classify/DB, there is no CASE tool that currently supports the generation of ODL directly. Experience from NA45 and GEANT-4 suggests that this is not a major impediment to producing good persistent object models, although, as market penetration of ODBMS products increases, we would expect to see ODL generation directly supported in future releases of the major tools. As the existing output filters have demonstrated, there is no conceptual reason why this should be impossible, or even difficult. It is our recommendation that direct support for ODL generation and reverse engineering be raised as a requirement with the appropriate vendors.
Although the ODMG standard permits implementations where the persistent
base class, d_Object, is dummy, existing implementations
typically incur a fixed overhead. In the case of Objectivity/DB,
this overhead is currently 14 bytes. In other words, an object
that contained a single float would increase in size by
the equivalent of 3.5 additional floats as a result of
becoming persistence-capable. In addition, associations involve
a storage overhead. As a rule of thumb, objects that are less
than about 10 words in size should not be made persistent directly
- smaller objects can be made persistent by containment in a persistent
object.
Reasons for choosing separate, rather than contained, objects
include database support for:
Both of these scenarios, i.e. individual objects and containers of small objects, have been tested using the GEANT-4 "hits" object, as described below.
The use of an ODBMS for storing and managing physics data clearly has implications for end-users. We describe here the main issues related to running applications that use an ODBMS for persistence. Those related to using an ODBMS as input to the analysis stage are covered further under milestone 3.
It is clear that, no matter what system is used for object persistency, access to the run-time environment is required. In the case of Objectivity/DB, access to a single library, available in static, shareable and debug versions, is required. As mentioned earlier in this report, in static form, this library is approximately 40% of the size of the "PACKLIB" component of the CERN Program Library, and slightly less than twice the size of the "KERNLIB" component. The Objectivity/DB server is more then 5 times smaller than the existing ZEBRA-server, but is in any case not required to access data on local or NFS-served disks.
In addition to the database run-time library, persistent applications need to access the database catalogue, which contains the location of the various physical databases that make up a given federated database, plus also the schema, i.e. class definitions, of the objects that are stored in the database. The federated database catalogue is automatically replicated by the database system, so that it is not necessary to access a single, central server. Subsets of the catalogue can also be extracted and stored on mobile computers, e.g. lap-tops, although clearly these catalogue subsets can only be automatically updated with new schema and database information when the host on which they reside is connected to the network.
The ODBMS that is currently being used for prototyping within the RD45 collaboration, namely Objectivity/DB, imposes minimal and acceptable restrictions on the run-time environment. Without, for example, embedding the database schema and/or catalogue location into persistent applications, which are clearly highly undesirable strategies, it would not be possible to reduce further these restrictions.
Based upon the work for this milestone, the following enhancements
to Objectivity/DB have been requested:
The use of an ODBMS to provide object persistence for HEP applications
implies minimal changes to existing applications, and these changes
can be further reduced by the provision of a small layer of software.
A prototype version of such a layer is under development by the
RD45 collaboration, and will eventually be made available through
the LHC++ framework.
With the exception that it is clearly necessary to have consistent
object models - both transient and persistent, an ODBMS imposes
no restrictions on the object model. The impact on physical data
organisation is limited to performance - optimal data clustering
will reduce redundant I/Os and result in improved performance.
Similarly, the storage overhead imposed by ODBMSs means that very
small objects should be avoided. However, this overhead is smaller
than for existing, Fortran-based systems and is hence not a new
constraint.
With the exception of class libraries providing collections or
containers, 3rd-party class libraries can be used freely
with applications that use an ODBMS for persistence. In the case
of collection/container libraries, changes must be made to avoid
storing raw C++ pointers in such collections. However, ODBMS-capable
versions of the principle class libraries involved are available
and the ODMG is following the C++ standard in this respect.
We have evaluated the support in existing products for schema
evolution, object versioning and data replication and analysed
the usefulness of these features in solving data management problems
typical of HEP event data. It is important to point out that none
of these features are currently defined by the ODMG standard.
Schema evolution and object versioning, including support for
configurations, have both been on the list of future enhancements
since V1.2 was finalised, and it has been requested that replication
be added to the list for post-V2.0 developments.
As these features are not yet standardised, any reliance on these
capabilities currently implies the use of vendor-specific enhancements.
We have, therefore, chosen to compare the implementation of these
features in at least two products.
The work on this milestone is covered in more detail in [5].
Schema evolution refers to the process of changing the
definition of a class - its schema - and typically also to the
ability to migrate objects created using previous versions of
the schema to the new representation. This latter capability is
known as object (instance) migration.
Schema evolution operations vary from simple changes, such
as adding or renaming a data member in a class definition, to
complex operations like adding a non-leaf base class or changing
the class of origin of a data member. Schema evolution
operations may require modifications to the affected objects.
There are several ways, which can be used in combination, that
the affected objects can be modified:
In the case of deferred mode conversion, no special steps need
to be taken by the user. The first access to the objects in question
will trigger the conversion, although only update transactions
will result in these changes being stored persistently in the
database.
Examples of immediate and on-demand mode conversion are shown
below.
| ooTrans trans ;
ooHandle(ooFDObj) fdH ; trans.upgrade() ; trans.start() ; fdH.open("TstFB", oocUpdate) ;
fdH.upgradeObjects() ; trans.commit() ; | ooTrans trans ;
// declare fdH, dbH, contH trans.start() ; fdH.open("TstFD", oocUpdate) ; dbH.open(fdH, "tstDB", oocUpdate) ; contH.open(dbH, "tstC", oocUpdate) ; contH.convertObjects() ;
// or dbH.convertObjects() ; trans.commit() ; |
Additionally, for every changed class a conversion function
can be registered extending standard object conversion according
to user requirements.
Finally, schema evolution operations may require that applications are rebuilt in order that the changed objects can be seen.
Support for schema evolution is typically not provided by existing
HEP data management packages, but is clearly required. Schema
are subject to change throughout the lifetime of an experiment,
and a flexible mechanism that permits changes to be made not only
to the schema, but also to the affected objects, much be provided.
Given the volumes of data involved, flexibility in object instance
migration is also mandatory - it would be inconceivable to migrate
the entire event store synchronously each time a schema change
was made.
We consider support for schema evolution to be a mandatory requirement that is not tied to a specific area - it must be supported across the full range of applications, ranging from production to end-user.
A number of prototypes have been built to help understand the
impact of schema evolution on persistently stored objects,
as well as on existing applications. A major goal in this work
was to identify scenarios that permitted existing applications
to continue to access data without having to be rebuilt. Although,
by adhering to the guidelines listed below, this can sometimes
be achieved, it is not always possible.
To minimise the impact on existing applications, one should consider
the following:
To perform object conversion effectively, especially when a large databases are involved, one should use a combination of deferred and on-demand conversions: deferred mode could be applied to convert objects as they are accessed until it is convenient to finish converting all objects within a part of a database, file, or the entire database using the on-demand mode. An immediate conversion is efficient for performing conversion on a small subset of the data.
Based upon our experience with schema evolution support
in the current release of Objectivity/DB, a number of enhancement
requests have been made. These include:
Object versioning is the capability to manage more than
one version of the same logical entity. This implies objects created
using the same schema and not different versions of the
same schema. Support for versions is often very similar to that
offered by code management systems, e.g. CVS, for revision control,
including both branch and linear versioning. In
ODBMS systems, each version of an object is stored separately,
although typically using the same clustering strategy. It is possible
to define a default version, in which case one object per
"genealogy" (set of all versions of an object)
is marked as such and returned unless an specific version is requested.
Versioning features include the possibility of accessing all versions
as a single object; as well as to merge multiple versions into
one. Navigation from a given version in a genealogy to
any other version is trivial. Versioning can be customised, simply
by inheriting from the class that implements versioning, and customising
as appropriate.
There are numerous areas of HEP data management to which the concept of versions could be applied. These include:
Although all of these areas merit further investigation, manpower constraints have limited us to an initial investigation of user-level versioning, i.e. management of selections in an analysis environment.
A prototype application has been built to investigate the usefulness of versioning features for managing event selections. The prototype supports versioning of both selections and their associated cuts, which are versioned separately. The user must first specify a set of cuts (predicate), including the names and types of the individual cuts, and is then provided with a powerful and convenient interface for managing an essentially unlimited number of versions, including retrieval of the full history of the cuts, the possibility to name selections and/or predicates, the ability to change the default version of a selection or predicate and to set the values to be used in the individual cuts. Cuts may be combined using logical and or or. Event collections built using such selections may have associations between them and the selections and cuts used to build the collection - an important feature that helps ensure reproducibility.
The following enhancements to the support for object versioning
in Objectivity/DB have been requested:
Replication refers to the case when more than one "copy" of an object or set of objects is maintained by the system - typically in different location and is a technique that is important for both reliability and performance. Users may continue to access data from a local image ("copy") of a database, even if some parts of the network are down. Certain implementations provide a "voting" mechanism, which guarantees that data integrity is maintained - only the partition with the majority of votes may continue to modify data in a replicated database. Users in those partitions which have a minority of votes may still read data from local images, or may choose to wait for the connection to be restored.
Replication is potentially useful in HEP for a number of areas:
Local area replication can be achieved by other means, e.g. by
mirroring disks or using RAID systems. However, database replication
is more flexible in that it permits data to be replicated to different
physical servers, which may even be distributed in the wide area.
It may nevertheless be combined with e.g. disk mirroring to give
even better performance and/or resilience to hardware failure.
Data distribution and collection has traditionally been performed using magnetic tape. Although affordable network bandwidth may require that at least a portion of the data continues to be distributed using tape or other media, considerable savings in the area of book-keeping and general data management can be made if this distribution is performed under the control of the database.
Data replication is a relatively new feature in ODBMSs and is
still far from mature. Objectivity/DB offers data replication
as from V4.0 of the product, which was only released in early
1997. We have participated in the field-test of this product,
for NT systems only, since mid-96, and were able to make some
preliminary tests of its functionality. These tests show that
replication is indeed transparent to the user application and
that the automatic fail-over from one "image" to another
works. However, we were unable to perform more extensive tests,
such as evaluation of wide area replication, including over both
relatively slow and unreliable connections.
Nevertheless, we intend to pursue this area actively, and will start wide area tests as soon as possible. Tests are planned between CERN, KEK, Krakow and LBL.
Based upon our early evaluation of the replication support in
Objectivity/DB, a number of enhancement requests have been raised.
These are as follows:
All of the features described above are clearly important techniques
for solving HEP data management problems. Their implementation
in ODBMS products is relatively recent, and it is clear that a
number of important enhancements need to be made to existing systems
in these areas.
With the possible exception of object versioning, which can be
managed at the application level, it is our opinion that these
features should form part of the list of mandatory requirements
that must be satisfied by a HEP persistent object manager. Although
versioning could be handled by the application, it would clearly
be an advantage if this too was directly support by the system.
The work on this milestone has been divided into two parts: strict
performance measurements, using both PAW+Ntuples and the raw performance
of the underlying storage systems for reference, and an evaluation
of the effectiveness of using an ODBMS+MSS as input to
physics analysis.
As no appropriate MSS has been available at CERN for these tests,
we report below on the use of an ODBMS with secondary (disk) storage
only, although we have analysed the impact of using different
storage strategies, including striping and parallel filesystems
for performance.
The performance and effectiveness evaluations described below
have been performed using NA45 data - both a standard NA45 Ntuple
and the corresponding data stored in an ODBMS.
An Objectivity performance expert is scheduled to come to CERN for two weeks in March 1997, during which time a workshop, focussing on performance and availability will be held. We intend to use the results of this workshop to finalise the supporting document [6], which will be completed around the end of March. We report below on the results that were available at the time that this document was submitted to the LCB - more recent results will be presented to the LCB open session in March.
The current Physics Analysis Workstation system (PAW) [23] requires
that the input data be converted to a special format, namely HBOOK
[21] Ntuples. Two types of Ntuple are supported - the original
"row-wise" Ntuple, which consisted of a table of single-precision
floating point numbers, and the more recent "column-wise"
Ntuple (CWN). CWNs support all Fortran data types, and allow variable
length blocks to be used. In these performance comparisons, we
have tested both column and row-wise Ntuples, and also the approximate
equivalent using an ODBMS, namely storing each row as a separate
tag object (RWN), or each attribute of a tag object separately
(CWN).
In principle, one of the main performance advantages of CWNs over
RWNs is that only those columns that are referenced by a given
query are read in, offering corresponding performance improvements
when only a few columns are needed. Studies have shown that many
queries only use a small fraction - say 20% - of the columns present
in a given Ntuple, and hence significant gains are to be expected
from using such a strategy. However, this may merely reflect one
of the known weaknesses of the current Ntuples. Creating an Ntuple
is typically a lengthy process, requiring an ad-hoc batch job
which processes a large subset of the data. If it is discovered
that more information than is present in the Ntuple is required,
or if one or more columns needs to be recalculated, then this
lengthy process must be repeated. Hence, the observation that
only 20% of the columns are referenced in typical queries may
simply reflect indicate that users are trying to minimise the
number of times that the Ntuple must be recreated, and store extra
information "just in case".
In addition, the Ntuple stores both the information that is used
for queries and the information that will be analysed -
in principle, the selection of events can be based upon a small
subset of the event characteristics and should not force a common
clustering strategy for the data used for selection and the data
that is to be e.g. histogrammed. Using an Ntuple from NA45 for
comparison, we examine below the benefits of separating the data
used for queries from that needed for analysis.
It is our opinion that the analysis framework should not impose a particular data model or format, and that converting data to such a format is a major inconvenience which should be avoided in future systems. This is particularly important given the volume of data involved in an LHC experiment - redundant copies must be avoided at all costs.
In principle, an ODMG-compliant ODBMS supports the full C++ object model. Whilst this is essentially true, there are a number of important considerations that need to be born in mind, if an efficient physical model is to be implemented. That having been said, any C++ object model can be implemented using an ODBMS, with the proviso that associations between objects are implemented using ODMG smart-pointer classes. Thus, the logical object model is unconstrained, whilst, for performance reasons, some basic guidelines, such as those outlined in section 8.5 on page 12, should be followed.
Unlike HBOOK CWNs, ODBMSs support significantly more general and/or
complex data models. Although some minor constraints are likely
to be imposed by performance considerations, such as avoiding
the use of very small objects (less than 10 words or so), an ODBMS
provides access to all of the data of an experiment, and
not just that subset that has been extracted into a format dictated
by the analysis tool.
Although it would theoretically possible to encode enough
navigational information into a CWN to permit an application to
reference the complete event data from such an Ntuple, this is
not supported by the current analysis tools, such as today's de-facto
standard, namely PAW. On the other hand, this is directly
supported by an ODBMS - transparent navigation from one element
of the data, e.g. the event tag, to another, e.g. the raw data
or "analysis objects", is provided by the ODBMS software
itself.
For performance reasons, efficient data clustering is always likely to be important. However, it would be perfectly feasible to recluster a small subsample of the data - sufficient to develop the necessary cuts etc., and then run a "production analysis" on the full, unclustered, dataset. This is an inherently more scalable solution than one that forces all data to be converted into a special format, i.e. copied, which becomes unworkable when very large volumes of data, such as those expected at the LHC, are involved.
Ideally, the performance overhead introduced by the ODBMS software should be less than a few per cent. In other words, one should be able to read and write data at approximately the speed of the underlying storage system, although this will typically be an upper limit for best-case scenarios. As is the case with all existing ODBMS products, Objectivity/DB uses the standard filesystem in which to store databases - each database appears to the operating system as a normal file. This means that standard techniques, such as parallel filesystems, file caching etc., should translate directly into improved database performance. The maximum throughput obtainable would thus depend only on the hardware resources made available.
Read/write performance of up to 100MB/second has been measured
on a Digital Alpha 4100 server. These figures exceed the initial
goal of 90MB/second. To achieve these results, the following configuration
was used:
Below, we describe performance comparisons between PAW and Ntuples
and simple TagDB implementations. The approach has been to perform
one to one comparisons between Ntuples and TagDB implementations
using Objectivity/DB. To this end, we have used a standard NA45/CERES
Ntuple from their 1995 production data. The same analysis has
been performed using both PAW+Ntuples and Objectivity/DB, under
a variety of different cache conditions. The benchmark environment
used in both cases was identical, using the following:
In all cases, we have measured the both first-pass ("cold")
and second-pass ("hot") cases. At the time of
writing, we have more confidence in the hot measurements - issues
such as the filesystem cache can strongly influence these performance
measurements and, short of rebooting a CS-2 node between each
measurement, it is hard to be certain that no caching is taking
place for the first-pass measurements.
The NA45 Ntuple used contains 302 columns (all floats) and some 21K rows, giving a total size of around 25MB. This is seen to be a somewhat typical size for Ntuples today, although they are often combined into larger logical units using the PAW chaining facility.
The main time during the PAW-based analysis is spent in a single
command, namely:
ntuple/loop [ntuple] ana.f
ana.f is a single, compiled, Fortran function that performs
all section cuts and histogramming. In the current analysis, some
15 columns are typically used to make selections, although this
is expected to rise so that eventually 80 columns are used. Both
the original Fortran version of this function and the equivalent
C++ code are reproduced in an appendix of the milestone 3 document
[6].
In the case of the TagDB, the selection code is as follows:
tagItr.scan(tagCont,oocRead);
Timer t("simple scan");
long total = 0, matched = 0;
t.Start();
while(tagItr.next()) {
total++;
if (tagItr->Match()) //the Match function
//accesses the attributes
//used for histogramming
matched++;
}
t.Stop();
In the tables below, the time shown is that spent on performing the event selection, including the time to access the attributes used in the selection and those attributes used for the histogramming. The time spent in filling and displaying histograms has not been measured, as this is independent of the database.
The first tests were based upon an implementation whereby each
row in the NA45 Ntuple was converted to a separate "tag"
object, i.e. an object with one attribute corresponding to each
column in the Ntuple. This means that no traversals are required
to access additional objects to perform the query or to fill the
histograms. All the tags were stored in a single container in
an Objectivity/DB database, and clustered according to insertion
time. A compiled, user-written selection function was used in
both cases. The time taken to compile and load these functions,
and the time taken to fill the histograms, has been subtracted
from the values shown.
| Time | Comments | |
| PAW + RWN | 11.3s cold 2.5s hot | First pass - 0% cache efficiency Second pass - 100% cache efficiency |
| PAW + CWN | 16.4s cold 2.6s hot | Converted using htonew |
| TagDB | 6.2s cold
1.4s hot |
At the time of writing, the reasons why the column-wise Ntuples show worse performance in the first-pass case are not fully understood, but are being investigated in collaboration with the PAW support team in IT/ASD. Although, in this particular case, the TagDB implemention based upon Objectivity/DB shows better performance, we interpret these results as showing comparable performance, pending further, more detailed, investigations. It is fair to say, however, that these first, unoptimised results, are encouraging, particularly when one considers the significant amount of effort that has been spent on optimising the performance of PAW.
A slightly more complex implementation than the case described
above is one where the subset of the data used for the selection
is stored separately from that used in the analysis of the selected
events. In this case, two objects are used - one containing the
15 "columns" used in the selections and the other containing
the remaining data. These objects were clustered separately and
were stored in different physical databases in the same federation.
Only in the case that an event is selected is a traversal made
from the reduced tag object to the remaining event data.
| Tag Implementation | Time | Comments |
| Full Tag | 6.2s cold 1.4s hot | No traversal |
| Reduced Tag - 0% selectivity | 1.3s cold 1.0s hot | No traversal |
| Reduced Tag - 2% selectivity | 5.5s cold 1.2s hot | One traversal per selected tag |
| Reduced Tag - 20% selectivity | 7.5s cold 1.5s hot | One traversal per selected tag |
In the above table, the performance of the reduced tag, in the case of low selectivity, improves with respect to that of the full tag simply due to the decrease in I/O that is required. As no events are selected, only the tag objects are read in. As the selectivity rises, we see the effect of object clustering. The objects correspond to a selected event may well have been brought into the client cache as a result of a previous I/O. The potential benefit is very dependent on the object size and page size. Tests using different object and page sizes have not yet been made, but will be included in the supporting document [6].
Further optimisations can be achieved by introducing indices on
the tag objects, either for the full or reduced tags. Objectivity/DB
uses a B-tree with short OIDs, which are 4 bytes rather than the
usual 8 bytes. This implies that they can only refer to objects
within the same container, although Objectivity/DB V4 also provides
federation-wide indices, based upon normal OIDs.
The storage overhead of a single index entry is the size of the
attribute together with a 4 byte overhead. Thus, a single 8KB
database page can store 1000 object references indexed on a single
32 bit field.
Unfortunately, it was not possible to include performance measurements using indices in this report. However, detailed measurements can be found in the supporting document [6], which will be available via the Web in draft form from early March and submitted to the LCB in final form by the end of March 1997.
In principle, storing all of the data of a given experiment under
a consistent scheme offers significant benefits at the analysis
stage. Today's techniques of data reduction, which have evolved
over many years, have been driven by necessity. The cost of random
access storage relative to sequential media (tape) was so high
that successive data reductions were imperative. Even at the startup
of LEP, the idea of providing a mere 100GB of staging space per
experiment on the central mainframe was simply unaffordable.
Today, the situation has changed dramatically, and trends suggest
that this will continue into the future. Even tape media now offer
some degree of random access - typically fast block addressing
- and the amount of disk space that can be afforded has increased
enormously.
Thus, it is important that the next generation of experiments
are not constrained by the technological limits that inhibited
previous ones. Nowhere is this more true than in the area of data
management.
Today's experiments use a wide variety of data formats for rawdata, DSTs, Ntuples, calibration, meta-data and so forth. This gives rise to extreme difficulty in navigating e.g. from a histogram to the rawdata of the events corresponding to the entries in the histogram. Despite many man-years of effort, both centrally and within the experiments, this is still largely an unsolved problem and results in considerable inefficiency in extracting physics results from the data - in other words, a waste of extremely valuable resources.
An ODBMS approach offers the possibility of revolutionising our
approach to physics analysis - offering not only more efficient
access to the data, but also permitting more complicated analyses
to take place. Initial tests show that comparable performance
to today's systems can be achieved by even naïve approaches,
and that separation of data into the part required for selection
and that used for analysis (i.e. reduced tag) offers performance
improvements for sufficiently selective queries. Only minor performance
enhancements have been made so far and it is expected that significant
performance optimisations can be made with time. The ease of access
and transparency to all of the event data has been demonstrated,
and the cost of traversing associations to the event objects shown
to be small. The above results have been based on initial measurements
and only preliminary interpretations can be made at this time.
More complete information will be made available in the supporting
document for milestone 3 [6], presented at the LCB open session
in March 1997 and discussed at the RD45 workshop to be held at
CERN from March 12-14.
Extrapolating from the current prototyping activities, which are
at a scale of GB to hundreds of GB, to a production system capable
of scaling to 100PB - an increase of some 6 orders of magnitude
- requires detailed and careful analysis of the risks involved.
We present here the main risks that we have identified, and outline
ways that these issues may be better understood in the short-term
and possible fall-back scenarios.
Details of investigations concerning the limits and issues listed below can be found in [6].
The current RD45 model is that each experiment would use a single
logical database in which all of their data would be stored. Such
a single logical view is implemented in Objectivity/DB as a so-called
federated database, consisting of multiple physical databases,
which may be stored on different servers across the network.
The databases of each experiment are expected to be independent,
and there would thus be no need for a single application to access
multiple logical databases, e.g. those of ATLAS and CMS, concurrently.
Indeed, neither the current version of the ODMG standard nor Objectivity/DB
support simultaneous access to multiple (federated) databases.
Due to the current architecture of Objectivity/DB, multiple federations
may be required as part of a fall-back solution, e.g. if
the extended object identifier (OID) described in section 11.2
on page 28 are not implemented, in which case a separate federation
would be required for each year of data taking.
One could also consider the use of multiple federations to handle user data. However, it is our conclusion that multiple federations, and heterogeneous federations in particular, should be avoided, and that alternative approaches to these problems be investigated.
In the Objectivity/DB architecture, a single logical database
is composed of many physical databases. Currently, each
physical database is mapped to a file and the logical database
is termed a federated database.
The current RD45 model, using multiple physical databases limited
to some 100GB, requires that we will use many thousands of physical
databases, spread across multiple servers. Indeed, the current
64-bit OID used by Objectivity/DB implies a maximum size of a
federated database of only 6.5PB (216 - 1 databases
of 100GB each).
To understand whether such large numbers of databases can really
be handled by the current architecture, we have used the existing
test-bed to build a federation containing the maximum number of
databases, but have limited their size to 1MB. This has allowed
us to understand issues relating to the number of databases, without
requiring a massive storage system in which to place them.
We were able to create a federation containing 13,000 databases,
which would limit the federation to around 1PB, if each database
were allowed to grow to 100GB, as foreseen. However, we observed
some performance problems when creating many (more than 500) databases
in the same process, or when adding new databases to an already
large federation. These issues are being pursued with Objectivity.
As it would seem unwise to plan on a physical database size of
much more than 100GB, there is a clear requirement to increase
the current 64-bit OID, or at least change the mapping from logical
to physical model to circumvent this problem.
This has been raised as a requirement with Objectivity.
The number of containers per physical database is similarly limited
to 32K (215 - 1) . Although we have no obvious requirement
for a greater number of containers per database, we have nevertheless
tested building a database with such a large number of containers.
We were able to reach this limit without problems. Attempting
to exceed the limit results in an error, as expected.
The current Objectivity/DB architecture limits the maximum container size to a multiple of the database page size. Using database pages of 8K, the maximum container size is 229 bytes, or 0.5GB. Using the maximum database page size, containers are limited to 4GB. As a physical limitation, this is not considered to be a significant problem, although there is a clear need for logical containers, which group together multiple physical containers. A prototype of such logical containers has been built, although it currently lacks support for appropriate iterators, which can iterate over the entire logical container.
Objectivity/DB permits associations to be established between different objects, regardless of where there are stored in the federation. To test cross-container and cross-DB navigation, we have built a number of prototypes, varying the physical implementation from a single container in a single physical database to multiple containers in multiple databases distributed across many servers.
Some object models, such as the current prototype for the ALICE
raw data, require very large numbers of associations. However,
in the case of individual events, we expect that the number of
associations that will be required will be of the order of 10-100,
or at the most 1000. The theoretical limit on numbers of associations
in Objectivity/DB is 232, due to their implementation
based on VArrays, and thus this is not considered a risk area.
In tests, we have been able to build up to 5 million associations for a given object without problems. As the current implementation is based upon VArrays, the actual limit depends on resources on the database client. The usage of a "paged-VArray", which only loaded the required pages into the client memory, would circumvent this problem.
Some physics channels at the LHC are estimated to include very
large numbers of events - perhaps 109 or even more.
It is highly unlikely that collections are a viable approach for
managing such large numbers of objects, and an approach based
on containment, i.e. where all events corresponding to a certain
channel would be stored in a given (set of) containers and/or
databases, is more appropriate.
Solutions to this problem include "collections of collections",
e.g. where a given physics channel is divided into multiple collections,
each corresponding to a data taking period, or direct support
from the database, in a manner that does not require that the
entire collection is loaded into the client cache.
Collections are implemented using a VArray of object references, and hence the limits and comments described in section 11.5 above are also applicable here.
The ODMG does not define the implementation of an OID, and various
different strategies by the vendors. That taken by Objectivity
is to use an OID that has a direct physical mapping. This has
significant advantages in terms of performance over logical OID
implementations, but implies that object re-clustering is likely
to render existing collections invalid. If bi-directional associations
are used between the collections and the objects, then re-clustering
can be performed at any time without rendering these collections
invalid. However, if uni-directional associations are used, which
is expected to be the case for user collections, then re-clustering
will render such collections invalid.
A number of scenarios exist which minimize, or even hide, the effect of re-clustering on user collections. For example, a validity stamp could be used to determine automatically whether collections were still valid, and even update the collections if required. However, it is clear that further investigation is required to fully understand the issues involved.
Independent of the current limits on container and database sizes, there is a clear requirement for a facility whereby containers and databases can be limited to a given size, with new containers/databases created automatically as required. In addition, facilities to iterate over the multiple containers/databases must be provided. Such a facility could be provided either by the database vendor or by HEP-specific application code. The preferred solution would be for the vendor to provide such libraries, although it is highly unlikely that such implementation-specific areas will ever be standardised, and hence the usual caveats concerning vendor-specific features apply.
Deploying a fully-distributed database system will clearly involve
a certain amount of administration. Many issues need to be better
understood, including the real tolerance of the system to prolonged
network failures, the propagation of database catalogue and schema
changes over faulty networks, and the possibility of applying
"rolling-upgrades", i.e. upgrading the database software
on the various servers in turn, whilst keeping the database available
to users.
Although initial tests can be made with test configurations at
CERN, much more exhaustive studies will need to be made in the
wide area with remote sites, requiring careful coordination. A
number of projects concerning regional centres and wide-area replication
are currently being discussed, and it is expected that these issues
will be further researched in joint collaboration between RD45
and these projects.
It is clear that data management for multiple PB of data in the
fully distributed environment will always involve a non-negligible
amount of overhead. However, it is clear that this overhead must
be kept as low as possible, preferably requiring less manpower,
whilst providing considerably more functionality, than today's
ad-hoc solutions.
Further investigations will be made in this area, particularly related to issues concerned with wide-area distribution.
Although the current RD45 prototyping activities are being performed
using Objectivity/DB, great care is taken to avoid using vendor-specific
features. Certain important features, such as schema evolution,
are not part of the current ODMG standard and we are therefore
working with the ODMG to extend the standard to ensure that it
is sufficiently complete to satisfy our requirements.
Some features, including DBA-related functionality, are unlikely
ever to become standardised, and hence migration from one product
to another will always require work. Nevertheless, by adhering
closely to the ODMG standard, we are able to protect ourselves
as much as possible. In addition to the portability of application
code between different vendors, the ODMG also provides an interchange
format, so that the associated data can also be moved. However,
migrating many TB or PB of data will never be a task that can
be undertaken lightly.
A recent IDC report on the ODMG estimates that the ODBMS market
is currently worth $115M per year and growing at 24% per annum.
Object Design International (ODI), one of the two ODBMS vendors
that went public in 1996, announced total earnings for the 3-month
period that ended in September 1996, of nearly $10M. This growth
is expected to accelerate such that the total market is estimated
to reach $1.6B by the year 2000. Like other analyses, the IDC
report predicts that the Web and the Java binding in particular
will be important markets for ODBMSs.
The ODMG standard is widely accepted as being "the"
standard for ODBMSs, and all major products already offer partial
conformance. We can confidently expect that new products in this
market will conform to this standard, and hence that standards-conforming
ODBMS products will continue to be marketed for the foreseeable
future.
Today, the market for very large databases is small (but non-zero),
although this is predicted by many analysts to grow considerably
in the coming years. We are aware of a number of projects which
call for databases of several hundred GB to a few TB in the immediate
future, scaling to tens to hundreds of TB by the end of the decade,
and believe that many more such project exist.
Several ODBMS products, including Objectivity/DB and Versant,
are currently targetting the telecoms market, which requires distributed
databases, scalability and performance. It is probably safe to
say that a product capable of satisfying the requirements of the
telecoms industry will continue to exist. This market is sufficiently
large as to be able to sustain at least one, if not both, vendors,
and hence that a product capable of satisfying at least a minimum
set of HEP requirements will continue to exist.
Nevertheless, fallback strategies need to be considered, including the use of a "commodity" ODBMS, should an appropriate high-end system cease to be available.
The only known MSS that - even theoretically - offers the scalability
and functionality required for LHC is HPSS. The absence of alternatives
is indicative of the fact that this is very much a niche market.
Other MSS products exist, but are typically targetted at much
more modest volumes of data, and almost certainly could not satisfy
our requirements.
The US National Laboratories, such as Lawrence Livermore, Los
Alamos, Sandia, etc. are all involved in the HPSS consortium and
are all expected to use HPSS. Several HEP sites (CERN, DESY, FNAL,
IN2P3, etc.) are considering or planning to use HPSS, which could
provide the critical mass needed to ensure HPSS's survival.
It is clear that the effort to produce a system as powerful as
HPSS is simply not available within HEP, and so the absence of
a suitable product in this area would be a major inconvenience.
However, this is also true for the US National Labs - by adopting
a common strategy, there is much more chance that such a strategy
will survive than if we pursued separate paths. In other words,
should HPSS fail, the wisest strategy would be to combine forces
with other sites facing similar problems to build, or preferably
commission, a replacement system. The design phase for such a
system could be considerably shortened by basing the system upon
the IEEE Reference Model for Mass Storage Systems, and even using
the standard APIs that are currently being developed.
As with the ODBMS, a fallback solution needs to be considered.
Unlike the ODBMS case, no clear alternative currently exists.
Commodity products typically target the backup market, and today
have no clear way of scaling sufficiently to meet the requirements
of LHC. It is possible, however, that as backup volumes increase,
it will be feasible to use a small number of such systems, e.g.
one per year per experiment, to manage LHC event data volumes.
This is clearly an area of risk which needs to be studied further.
Many of the risk factors associated with the current strategy
can be both identified and tested today. Work over the next months
will allow us to better understand the precise risks involved,
and develop work-arounds and/or alternative strategies as appropriate.
Initial investigations of the limits and scalability of the current
Objectivity/DB architecture lead us to the following requirements:
Further work needs to be done in the area of distributed database management, and the ODBMS and MSS markets need to be followed closely, so that alternative strategies can be developed in time should, for example, the HPSS project or the current ODBMS supplier fail.
Over the past year, several projects have been started within
HEP that use an ODBMS and Objectivity/DB in particular. In addition
to the work at BaBar, mentioned in the previous status report,
Objectivity/DB is now installed at DESY, for some prototyping
activities on Zeus; KEK, for work related to the BELLE experiment
and is under consideration at FNAL, for some studies related to
the use of Objectivity/DB for run 2 physics data. More details
of these activities can be found below.
In addition to the activities described in this report, there are several other prototypes at CERN using Objectivity/DB, most particularly in CMS, including test-beam and calibration database studies, as well as the CRISTAL project. Objectivity/DB is also the database system used by one of the EDMS systems currently under study at CERN.
In addition to the activities described above, directly related to the LCRB milestones and referees' recommendations, RD45 has worked with the LHC experiments at CERN as well as experiments at other laboratories, on issues related to data management and object persistence. More details are given below.
In the context of the ALICE experiment, a first version of an
object model describing the raw data has been developed. Persistent
classes describing the raw data for the 7 main detectors have
been defined, consistent with the typical raw event size of 40MB.
In addition, a pseudo-event generator has been built, which generates
events according to this object model, but without physical content
- the data members of the objects involved are simply numbers
generated randomly within the defined limits for each quantity.
This generator has been used to test the feasibility and consistency of such a model, as well as to investigate various alternatives in the design. Finally, it has been used to test some of the ODBMS limits, corresponding to the "risk analysis" described in section 0 on page 27.
Collaboration with ATLAS has increased during the last year, and a new ATLAS sub-group, which will work closely with RD45, has recently been set up. Amongst other activities, this group will study issues such as wide-area replication, by network or tape, of physics data. In addition, members of ATLAS are proposing using RD45-like solutions on CDF for run II of the Fermilab collider.
There are a number of prototyping activities exploiting Objectivity/DB
within the CMS collaboration. Perhaps the most significant is
the plan to store some 50GB of test-beam data in an Objectivity/DB
database during 1997, and use the elements of the LHC++ environment
to analysis this data. In a possible future extension, this project
may exploit the data replication option and HPSS interface of
Objectivity/DB, but in the short term will probably be limited
to disk storage and manual copying of database files from the
test beam area to the computer centre.
This activity is clearly strongly related to the proposed milestones for 1997, listed in section 20 on page 41.
Starting in late 1995, the NA45 collaboration completely redesigned
their reconstruction and filtering software and re-implemented
it in in C++. The total package consists of some 30K lines of
code, and has been ported to use an ODBMS for persistence. The
system has been used to write to a single logical store from multiple
(16) processing nodes in parallel. So far, some 20GB of data have
been stored. A reprocessing is planned which will store 60GB of
data in the ODBMS.
More details concerning RD45 collaboration with NA45 can be found in [4] and [5].
Persistence for calorimeter and tracker "hits" objects
has been introduced in GEANT-4 using Objectivity/DB.
Two different implementations have been tested:
In both cases, as can be seen from the tables below, the overhead
introduced by making the objects persistent is very small. In
the case of the calorimeter hits, two collections are created,
of 19 and 17 objects respectively. The objects are accessed 100
times, as the energy deposition is accumulated. In the case of
the tracker hits, collections of 1900 and 1700 objects are created.
However, each object is accessed only once (at construction time).
The tests were performed on the SP-2 at CERN and the times shown
below are in seconds.
| Calorimeter Hits | Tracker Hits | |||
| Transient | Persistent | Transient | Persistent | |
| User time | 7.96 | 9.63 | 8.80 | 13.09 |
| Real time | 12.2 | 14.22 | 9.63 | 26.33 |
| User time | 8.66 | 8.37 | 9.66 | 8.89 |
| Real time | 10.96 | 15.87 | 11.28 | 14.41 |
The slightly better user time in the case of "persistence
by containment" comes from improved optimisation in the persistent
collection class, which is based upon the Objectivity-supplied
VArray, whereas the transient case uses a Rogue Wave collection
class. Further optimisations to the persistent versions are possible,
for example, the performance of the "individual objects"
implementation should improve if multiple persistent objects were
created at the same time, rather than individually, as shown above.
The small overhead introduced by the database is striking, and can be compared with that incurred by storing ZEBRA objects in an RZ file. In the case of a very simple test, e.g. using a linear chain of 1900 banks, each containing 10 data words, the I/O overhead represents a small factor, i.e. the performance is several times worse in the persistent case, rather than a fractional increase, as is seen to be the case when using an ODBMS for persistence in the GEANT-4 prototype.
The AMY experiment at Tristan, KEK, originally used a Fortran-based bank system, known as the Tristan Bank System (TBS). DST-level data has been converted from TBS format and stored in an Objectivity/DB database using a variety of different object models, and performance comparisons made of the different approaches as well as with the original TBS-based system.
The BaBar collaboration are currently planning to use an ODBMS both for calibration data and also for physics events. An evaluation of two commercial ODBMS products recommends the use of Objectivity/DB, based upon its superior performance and scalability characteristics in a HEP environment. Work is progressing on the design of an ODBMS-based event store.
Objectivity/DB is currently being evaluated at KEK for the BELLE collaboration. A system is being built up based on 7 28-node UltraSparc servers with nearly 4TB of disk space and 4 Sony tape robots attached. The system will use the Petaserve MSS from Sony, which is based on the Lachman Open Storage Manager (OSM) that is currently in use at DESY.
The ZEUS experiment have built a prototype event directory based upon Objectivity/DB. The philosophy has been to follow an evolutionary approach - first to reproduce more or less the functionality provided by the existing, ADAMO [21]-based, event directories, but with more flexibility, and then to use an ODBMS for micro-DST-level data, as input to physics analysis. Using a sample of 106 events, corresponding to about 100MB of data, the prototype demonstrated about the same performance as the existing, highly-optimised, solution. Work on ODBMS-based event directories continues, hopefully leading to a production system in 1997.
In the context of RD45, CERN has associate membership of
the Object Management Group (OMG) and is a reviewer member
of the Object Database Management Group (ODMG). CERN is also represented
in the IEEE Computer Society Executive Committee on Mass Storage,
which is the body to which the various standards sub-groups report.
During the past year, the only significant involvement of CERN
has been with the ODMG, although a workshop focussing on the current
theory and practice of high-end data management, to be held near
CERN, is planned for 1997.
Version 1.2 of the ODMG-93 standard, finalised in 1995, was published
in early 1996. During the past year, the ODMG has concentrated
on version 2.0 of the standard, for which CERN helped to set the
priorities. This version of the standard should be finalised in
the February/March 1997 timeframe, after which work will start
on the next release of the standard. The first meeting after finishing
V2.0 is scheduled for July 1997, to be organised by CERN.
The bindings defined by the ODMG are intended as portability
bindings. That is, an application built on top of one ODMG-compliant
database should port without source code changes to another
compliant product. In addition to providing application portability,
the ODMG have defined an interchange format, permitting data portability
between the various conforming products.
In reality, the current standard is insufficient to satisfy all
of our requirements - it does not, for example, include schema
evolution, distributed databases, replication and so forth - and
hence it is inevitable that current prototypes exploit vendor
extensions. However, RD45 places strong emphasis on working within
the ODMG to ensure that the standard is enhanced to minimise,
and perhaps eliminate, the need for reliance on such features.
This work will inevitably span numerous updates to the standard
and so cannot be considered a short-term goal.
Important new features expected in V2.0 of the standard include
the data interchange format mentioned above, access to schema
meta-objects and an ORB adaptor.
Post V2.0 options for the ODMG include merging with the OMG, which would decrease the control that the ODBMS vendors have over the direction of the standard, but giving a corresponding increase in the amount of user participation that would be possible.
A number of workshops, focusing on Objectivity/DB, were held at
CERN over the past year. The first two workshops, held in February
and May 1996 respectively, were largely devoted to discussions
of initial prototypes and modelling experiences. They were extremely
useful in helping us to better understand the current Objectivity/DB
product and future enhancements, and in deciding the implications
of various implementation choices, such as object granularity.
The final workshop of 1996 focussed on the results of the work
in meeting the current LCRB milestones, plans for performance
and scalability measurements, the risk analysis described above,
and discussions on the requested Objectivity/HPSS interface. In
addition, presentations were made on requirements from ATLAS and
CMS, prototyping activities in ALICE, AMS, ATLAS, CMS, GEANT-4,
NA45 and ZEUS, and discussions of problems encountered in a number
of these prototypes.
This workshop was attended by some 20-30 people and provided important
feedback on the progress on the LCRB milestones, as well as high-lighting
a number of areas where product enhancements are required.
These workshops have been attended by consultants and/or architects
from Objectivity, and have proved extremely profitable. It is
our intention to continue regular workshops as required. For 1997,
three workshops are currently planned, two of which will be held
at CERN:
The annual Objectivity/DB Developers' Conference was held in Santa Clara - close to Objectivity's headquarters in Mountain View - on April 26-27 1996. This meeting included sessions on new features of Objectivity/DB, including schema evolution, user data replication, the Java JDBC interface, Objectivity/DB-based Web servers, performance tuning etc., as well as presentations from the user community, such as RD45 and the Sloan Digital Sky Survey (SDSS). This meeting provided ample opportunity to discuss with other users of the product, as well as to meet the developers and support staff.
It is our recommendation that CERN participate regularly to these meetings, using them to provide feedback on CERN's (HEP's) requirements.
The RD45 collaboration has identified a number of ways that an
ODBMS could be coupled to an MSS. The most promising of these
are:
There are currently two investigations of integrating Objectivity/DB
at the filesystem level:
Although neither of these Mass Storage Systems are under active consideration for production deployment at CERN, these activities offer a useful existence proof of a transparent interface between an ODBMS and MSS.
A course on HPSS was held at CERN during October 1996, and was
attended by an engineer from Objectivity. During this course,
a powerful new mechanism whereby Objectivity/DB could be interfaced
to HPSS was identified.
The Objectivity/DB server - a light-weight page server - uses
basic I/O calls such as lseek(), read(), write(). The HPSS
client API provides equivalents for these routines, e.g. hpss_read().
It would thus be possible to interface the Objectivity/DB server
to HPSS without even making code changes - simply by providing
jacket routines to the HPSS library - and relinking the server
with the HPSS library. This would have the considerable advantage
that client applications would remain the same. HPSS would be
responsible for managing the disk space, and also the tertiary
storage, and would move entire databases (bit-files) to/from tertiary
storage as required.
A requirement for HPSS support has been raised with Objectivity, and there are plans for a proof-of-concept prototype by the time of CHEP '97, and a full product by the end of 1997.
A loose-coupling between ODBMS and MSS, as described above, offers
an extremely simple yet powerful way of extending disk-based object
management solutions into the tertiary storage region, as required
by LHC experiments and others.
During the coming year, the most promising of these techniques will be investigated further, both at CERN and outside.
Many people predicted that the ODBMS market would take off during
the past year. Arguably, this did indeed occur, although in a
somewhat more modest fashion than foreseen. Two ODBMS companies,
ODI and Versant, went public in July 1996 and others are expected
to follow.
Although the relational vendors have largely ignored the ODBMS
market, all, except IBM, established some relationship with an
ODBMS company. ORACLE, which had marketed the Omniscience product
as ORACLE-Lite since early 1996, took over the company in November.
Almost without exception, the ODBMS vendors have pre-announced
Java bindings and put significant emphasis on the Web. It is predicted
that both of these two areas will play a significant role in further
developments of the ODBMS market.
Activity on various Internet newsgroups, e.g. comp.databases and comp.databases.object, has grown considerably, indicating that many more people are working with, and developing applications on, ODBMSs.
The future activities of the RD45 project are driven almost exclusively
by the needs of the LHC experiments. As part of the development
of their CTPs, both ATLAS and CMS are developing a list of issues
that require further study, which we expect to have a strong influence
on the future milestones and activities of the project. Indeed,
the proposed milestones, listed below, draw from the milestones
from the Computing Model chapters of the CTPs.
In addition, we will continue to work with GEANT-4 collaboration, the NA45 experiment and other groups both at CERN and outside who are investigating the same or similar technology.
We propose the following activities to be considered for the milestones
for the third year of the RD45 project. These suggestions have
been prepared in consultation with ALICE, ATLAS, CMS and NA45.
By mid-1998, a proof-of-concept MSS interface should also be demonstrated. In addition, it is expected that further investigations of the areas covered by the current milestones will be required. Examples include further performance investigations, a study of the feasibility of wide-area data replication, and so on.
We have identified and described the impact of using an ODBMS
on physics applications, the potential benefits of ODBMS features
such as schema evolution, object versioning and data replication
on HEP data management, made an evaluation of the effectiveness
of using an ODBMS as input to physics analysis as compared with
traditional techniques and made an analysis of the key risks involved
in an ODBMS+MSS based physics event store. In addition, we have
worked closely with the ATLAS and CMS Computing Model working
groups and with other projects, both at CERN and outside, that
are using or considering the use of an ODBMS for object persistency.
The use of an ODBMS+MSS for storing and managing event data is
currently the baseline assumption of both ATLAS and CMS Computing
Technical Proposals [24] [25], pending further investigations
into performance and scalability, and is also being considered
by a number of pre-LHC experiments at other laboratories. We will
continue to work with these groups to identify and investigate
the key issues and propose a strategy of gradually scaling from
the current 100GB-1TB region to the 100TB region in the years
before LHC data.
ADAMO - a system, developed in the ALEPH collaboration, based on the Entity-Relationship (ER) model.
ADSM - A storage management product from IBM
AFS - the Andrew (distributed) filesystem
CASE - Computer Aided Software Engineering
CORBA - the Common Object Request Broker Architecture, from the OMG
CORE - Centrally Operated Risc Environment
CWN - Column-wise Ntuple
CTP - Computing Technical Proposal
DFS - the OSF/DCE distributed filesystem, based upon AFS
DMIG - the Data Management Interface Group
EDMS - Engineering Data Management System
GB - 109 bytes
HPSS - High Performance Storage System - a high-end mass storage system developed by a consortium consisting of end-user sites and commercial companies
IEEE - the Institute of Electrical and Electronics Engineers
KB - 210 (1024) bytes - normally referred to as 103 bytes
LCB - LHC Computing Board
LCRB - LHC Computing Review Board
LIGHT - Life Cycle Global Hypertext
MB - 106 bytes
MSS - a Mass Storage System
NFS - the Network Filesystem, developed by Sun
ODBMS - an Object Database Management System
ODMG - the Object Database Management Group, a group of database vendors and users that develop standards of ODBMSs
OID - Object Identifier
OMG - the Object Management Group
OQL - the Object Query Language defined by the ODMG
ORB - an Object Request Broker
OSM - Open Storage Manager: a commercial MSS
PAW - the Physics Analysis Workstation
PETASERVE - an MSS based upon OSM
PB - 1015 bytes
RWN - Row-wise Ntuple
SHORE - Scalable Heterogeneous Object REpository
SQL - Standard Query Language: the language used for issuing queries against databases
SSSWG - the Storage System Standards Working Group
STL - the Standard Template Library: part of the draft C++ standard albeit in a modified form
TB - 1012 bytes
TOOLS.H++ - the current de-facto standard container/collection class library, now based on the STL
VLDB - Very Large Database
VLM - Very Large Memory
VMLDB - Very Many Large Databases
XBSA - the draft X/Open Backup Services Application Program Interface