EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH
CERN/LHCC 99-28
LCB Status Report/RD45
21 September 1999
The RD45 collaboration
CERN, Geneva, Switzerland
This document has been produced for the October 1999 LCB review of the RD45 project. In this report, we present the status of the project, including a summary of the responses to the milestones set at the 1998 review by the LCB, suggestions for future activities and a revised risk analysis of the current RD45 strategy.
RD45 documents may be obtained through the Web via http://wwwinfo.cern.ch/asd/cernlib/rd45/index.html.
The RD45 Collaboration
David Malon
Argonne National Laboratory, Argonne, Illinois, USA
Martin Purschke
Brookhaven National Laboratory, USA
Julian Bunn, Harvey Newman, Rick Wilkinson
Caltech, USA
Eva Arderiu Ribera, Dirk Düllmann, Bernardino Ferrero Merlino, Gunter Folger, Romuald Knap, Marcin Nowak, Andreas Pfeiffer, Jamie Shiers (spokesman), Kurt Stockinger
CERN/IT
Geneva, Switzerland
Pavel Binko, Koen Holtman, Vincenzo Innocente, Arthur Schaffer, Lucia Silvestris, Heinz Stockinger, Lassi Tuura, Ian Willers
CERN/EP
Geneva, Switzerland
Martin Gasthuber
DESY/Hamburg, Germany
David Quarrie
Lawrence Berkeley National Laboratory
Berkeley, CA, USA
Youhei Morita
KEK, Oho,
Tsukuba, Ibaraki, 305 Japan
Stansislaw Jagielski
Faculty of Physics and Nuclear Techniques,
UMM Krakow, Poland
Alexei Klimentov
MIT, USA
Christian Arnault
Laboratoire de l'Accelerateur Lineaire
Orsay, France
Yemi Adesanya, Jacek Becla, Gabriele Cosmo, Andrew Hanushevsky
Stanford Linear Accelerator Center, CA, USA
Sunanda Banerjee
Tata Institute of Fundamental Research
Bombay, India
Simona Rolli, Krzysztof Sliwa
Tufts University, USA
TABLE OF CONTENTS
1 Executive Summary
*1.1 Summary of Activities Since April 1998 Review *
1.2 Conclusions *
2 Milestones from the April 1998 LCB Review
*3 Interim Referees Reports
*3.1 December 1998 *
3.2 June 1999 *
4 Milestone 1
*4.1 Introduction *
4.2 Production Setup *
4.2.1 Introduction *
4.2.2 Lock and Database File Servers *
4.2.3 Installing Objectivity/DB Software *
4.2.4 Objectivity/DB 4.0.2 *
4.2.5 Objectivity/DB 5.1 *
4.2.6 Starting the Lock Server and AMS Server *
4.2.7 Coexistance of Objectivity/DB Releases 4.0.2 and 5.1 *
4.2.8 Objectivity Backup Issues *
4.2.9 Lock Server Naming Conventions *
4.2.10 Objectivity Databases residing in AFS *
4.3 ATLAS *
4.3.1 Database Servers *
4.3.2 1TB Milestone *
4.4 CMS *
4.4.1 Test-beam Activities *
4.4.2 Test Beam I/O Simulation *
4.4.3 100MB/s Milestone *
4.4.4 Cristal *
4.5 CHORUS *
4.6 NA45 *
4.7 COMPASS *
4.8 Conclusions *
5 Milestone 2
*5.1 Introduction *
5.2 Federation Backup for Production Environments *
5.2.1 Federation Backup Goals *
5.2.2 Backup Tools provided by Objectivity *
5.2.3 Implementation for the Objectivity Production Service *
5.2.4 Example: Backup for the NA45 production federation *
5.3 Database Browsers and Adminstration Tools *
5.3.1 The CERN DRO_Tool *
5.3.2 Hudson *
5.3.3 The SLAC Database Browser *
5.3.4 Conclusions *
5.4 Data Import/Export *
5.4.1 Event Store *
5.4.2 Conditions *
5.4.3 Copying / Shadowing *
5.4.4 Summary *
5.5 Conclusions *
6 Milestone 3
*6.1 Introduction *
6.2 Event Collections *
6.3 Naming and Meta-Data *
6.4 Conditions Database *
6.5 Conclusions *
7 Milestone 4
*7.1 Wide-Area Database Usage *
7.1.1 Reconstruction of Simulated CMS Events *
7.1.2 WAN Tests Between CERN and KEK *
7.1.3 MONARC *
7.2 Clustering and Re-clustering Issues *
7.3 Mass Storage Integration *
7.4 Multi-User, Multi-Federation Issues *
7.4.1 Introduction *
7.4.2 Decoupling Environments *
7.4.3 Decoupling Schema and Catalog *
7.4.4 Copying a Federation *
7.4.5 Backing Up a Federation *
7.4.6 Summary *
7.5 Conclusions *
8 Experience at BaBar
*8.1 Introduction *
8.2 Database Requirements *
8.3 Overall Design *
8.4 Event Structure and Placement *
8.5 Processing Framework *
8.6 Transient / Persistent Decoupling *
8.7 Release Builds *
8.8 Schema Evolution *
8.9 Data Distribution *
8.10 Data Protection and Safety *
8.11 Access Statistics *
8.12 SLAC Configuration *
8.13 Production Federations *
8.14 Database Commissioning and Performance *
8.15 Current Status *
8.16 Summary *
9 Experience at DESY
*9.1 H1 *
9.2 ZEUS *
10 RD45 "White Papers"
*11 Risk Analysis
*11.1 Introduction *
11.2 HEP Computing Models *
11.3 Choice of Technology *
11.4 Choice of ODBMS Vendor *
11.5 The Home-Grown Approach *
11.6 Summary *
11.7 References *
12 Objectivity/DB Enhancement Requests
*12.1 Support for STL-based Collection Classes *
12.2 Support for the Linux Operating System *
12.3 ODBMS to MSS Coupling *
12.4 SLAC AMS Enhancements *
12.5 Architectural Changes to Support VLDBs *
12.6 Schema Handling Enhancements *
12.7 Access Control Support *
12.8 ODMG Compliance *
12.9 Conclusions *
13 General Database Activities
*13.1 RD45 Workshops *
13.2 Collaboration with MONARC *
13.3 Objectivity/DB User Meeting *
13.4 Objectivity/DB European Technical Forum *
13.5 SIGMOD *
13.6 ECOOP Workshop on ODBMS *
13.7 CERN School of Computing *
13.8 Licensing Issues *
13.9 Objectivity/DB Support *
14 Other Database Developments
*14.1 Object-Relational Databases *
14.2 The Object Database Market *
14.3 Versant *
14.4 O2 *
14.5 Objectivity/DB *
14.6 Conclusions *
15 Objectivity/DB Alternatives
*15.1 Introduction *
15.2 Draft Requirements *
15.2.1 Introduction *
15.2.2 Use Cases *
15.2.3 Data and Resource Sharing *
15.2.4 Scope of the Data Store *
15.2.5 Data and Meta-data *
15.2.6 Consistency *
15.2.7 Storage Manager Requirements *
15.2.8 Meta-data Requirements *
15.2.9 Language Binding Requirements *
15.2.10 Summary *
15.3 General Design Issues *
15.3.1 Scope of the Prototype *
15.3.2 Feasibility Issues *
15.3.3 Scalability Issues *
15.3.4 HEP Data Models *
15.3.5 Conclusions *
15.4 Architectural Overview *
15.5 Design of the Prototype *
15.5.1 Introduction *
15.5.2 Page Server vs Object Server *
15.5.3 Physical vs Logical OID *
15.5.4 Transactions and Recovery *
15.5.5 Lock Server *
15.5.6 Schema Handling *
15.5.7 C++ Language Binding *
15.6 Current Status *
15.7 Future Work *
15.8 Manpower Issues *
15.9 Conclusions *
16 Standards Activities
*17 Future Activities
*17.1 Introduction *
17.2 Production Activities *
17.3 Research Activities *
17.4 Summary *
18 Proposed Milestones for 1999-2000
*19 Conclusions
*20 Previous Milestones and Recommendations
*20.1 Milestones at the end of the second year (1997) *
20.2 Milestones at the end of the first year (1996) *
20.3 Initial Milestones and Recommendations (1995) *
21 Glossary
*22 References
*Since 1995, the RD45 project has been investigating solutions to the problem of providing persistency to physics data of the LHC experiments, assumed to be in the form of (collections of) objects. At the end of the first year, a potential solution, based on standards-conforming products, was presented. During the second year, this possible solution was studied further performance comparisons with existing systems and tests of functionality and scalability were carried out and production demonstrations were made.
During the past 18 months, production services have been offered, based on a combination of the two most promising commercial products identified in the first stage of the project, namely Objectivity/DB and HPSS. These tools, together with a small amount of HEP-specific code, are now distributed via LHC++. In parallel, research has continued on a number of critical issues identified by the experiments. In addition, an in-depth risk analysis has been performed.
In this report, we summarise the activities of the RD45 project since the last review by the LCB in April 1998. We report on progress on the milestones set by the LCB, experience with experiments such as BaBar, COMPASS and NA45, the on-going risk analysis and make suggestions concerning possible future activities.
The RD45 collaboration has continued to focus on ODMG-compliant solutions, and has demonstrated the use of a standard, off-the-shelf ODBMS product for storing and managing HEP event data in a production environment.
Despite the focus on ODBMSs, RD45 progress in other areas is also followed, such as persistent object managers, Object-Relational Databases, including object-oriented offerings from the traditional relational (RDBMS) vendors and so forth.
As in the past, RD45 participates in the Object Database Management Group (ODMG) the standards body that defines and maintains the various standards for ODBMSs, as well as the Object Management Group (OMG), and the IEEE Computer Society Executive Committee on Mass Storage Systems (IEEE MSS EC).
Since the last LCB review the RD45 collaboration has:
The technologies and specific systems identified by RD45 are used as the basis for production services at a number of HEP laboratories, including CERN, DESY and SLAC, as well as at regional centres and other institutes involved in the corresponding experiments at these laboratories. The experience gained by these experiments some of which are acquiring data at rates and in volumes that are similar to those expected at the LHC will be extremely valuable in helping to make a decision concerning the choice of ODBMS to be deployed at the LHC. However, it now appears that previous predictions concerning the evolution of the ODBMS market were over-optimistic. Given the lack of any convincing commercial alternative to Objectivity/DB and HPSS, the risk analysis that has been performed by the RD45 collaboration suggests that the development of a fallback alternative is required. A first step in this direction is the preparation of a revised list of requirements and estimates of the manpower that would be needed to develop such a system. In the longer term, the possibility of using a suitable Object-Relational product, which now appears likely to become the dominant DBMS technology, should also be considered.
The RD45 project was reviewed by the LCB in April 1998, and recommended for continuation for a further 18 months, with the following milestones and comments:
"The project has achieved the initial R&D goal of investigating and identifying potential solutions to the problem of persistent data storage for LHC experiments. The proposed solution: ODBMS (Objectivity) is now adopted for data persistency not only by all the LHC experiments but by many others (BaBar, NA45, COMPASS, RHIC) ready to take data in 1-2 years.
RD45 has met the 1997 milestones set by the LCB and is congratulated for its excellent work. The project has addressed many of the initial questions about the use of object databases for data persistence during its four-year lifetime and shown that commercial solutions can be applied. The project is now moving into a new phase combining the creation of a production data management service while at the same time continuing R&D to enable the technologies to be used effectively in LHC collaborations. In this new phase, in which R&D will be carried out in parallel with production activities which are equally important in validating the overall strategy, the LHC experiments are expected to actively participate in defining and performing the necessary R&D. The LCB recommends that the project be continued with the next status report scheduled in 18 months time in order to accommodate both production and R&D activities. The LCB internal referees should make a brief status report to the LCB at 6 month intervals."
LCB endorsed the following proposed milestones:
LCB asked that a list of very specific R&D activities, concerning point 4), be identified between the project and experiments, and discussed in the next meeting.
Work on these milestones and recommendations is covered in detail below.
The following is reproduced from the minutes of the LCB meeting held in December 1998. The full minutes are available via http://www.cern.ch/Committees/LCB/public/minutespub/minutes_dec98public.html.
The LCB referees reported on the RD45 project status based on the recent RD45 workshop and the extensive published white papers and other documents. The LCB notes that excellent progress is being made on both the production and R&D milestones. Important lessons are being learned from the production service, identifying issues and potential problems that will now be solved well in advance of the LHC. The LCB appreciates the hard work by the groups responsible for running this production service.
The LCB in particular looks forward to publication of results from the ATLAS 1 TeraByte tests which will begin to answer important performance questions on production size databases. In general the experiments appear to be getting more involved with the RD45 project, although the project still awaits updated user requirements and use cases from the experiments. The cooperation between RD45 and the new MONARC project should help in this regard, as well as enhance the investigation of wide area database performance.
The LCB discussed the RD45 risk analysis and agrees that it would not make sense to do a full port to a second OODBMS vendor's product at this time.
The LCB agrees with the other suggested steps to mitigate risk, with the addition of trying to insure that user code in reconstruction and analysis programs is kept as standards compliant as possible.
The second interim referees report was made at the June 1999 LCB meeting. The minutes of that meeting were not available at the time of writing, but will eventually be accessible via http://www.cern.ch/Committees/LCB/public/meetingspub.html.
The first milestone set at the April 1998 review of the RD45 project was as follows:
"Provide, together with the IT/PDP group, production data management services based on Objectivity/DB and HPSS with sufficient capacity to solve the requirements of ATLAS and CMS test beam and simulation needs, COMPASS and NA45 tests for their '99 data taking runs."
The model that has been adopted to support production Objectivity federations has been as follows:
The following text is reproduced from "Objectivity/DB Services - Installation and System Configuration" and describes the setup of production Objectivity/DB services at CERN.
An Objectivity service comprises a Lock Server and one or more Objectivity database servers. The Database servers may exceptionally be part of the Lock Server but in general will be hosted on separate system(s).
Objectivity/DB groups databases within Federations and each Federation has a unique Federation Identification number and is described by a Federation file. The Databases themselves map onto files in UNIX systems and are subdivided into Containers. The containers are further divided into Pages. Access to the database is controlled by locks allocated by the Lock Server process and this locking is performed at the Container level. Data access however is performed at the Page level.
OBJY_VERS=5.1; export OBJY_VERS OBJY_DIR=/usr/local/Objectivity/$OBJY_VERS; export OBJY_DIR . /afs/cern.ch/rd45/objectivity/objyenv.sh
setenv OBJY_VERS 5.1 setenv OBJY_DIR /usr/local/Objectivity/$OBJY_VERS source /afs/cern.ch/rd45/objectivity/objyenv.csh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1 - Lockserver Hosts
ools -OO_NO_AUTOREC
Two development systems are currently in place for ATLAS:
Both the above systems run Objectivity/DB V4 and V5, including the HPSS-aware AMS.
An internal milestone of the ATLAS collaboration was to store 1TB of realistic physics data into Objectivity/DB + HPSS by the end of 1998. The main purpose of this milestone was to demonstrate the feasibility of the different elements of the ATLAS software using a first approximation of the raw data model of an ATLAS event. A secondary goal was to understand the performance of the various components of the system, although actually obtaining the maximum throughput was not the main focus of the test.
The configuration that was used is shown in Figure 1. Raw events (digitisations from jet production using GEANT-3 based software stored in ZEBRA FZ format) were staged to disk and then converted into Objectivity/DB format using 5 concurrent clients on IBM AIX systems (RSPLUS nodes). Two systems running the Objectivity/DB AMS (data server) were used both Sun Solaris machines. The data rates achieved were as follows:
It is understood that disk contention was the primary cause for the lower data rate in the former case.
As the tests were performed over the Christmas period, the operational efficiency was lower than would normally be achieved around 50%. As a result, some 19 days were required to reach the 1TB milestone.
It is intended that the test be repeated perhaps on a smaller scale using a modified event model, based on the use of a segmented VArray with an optimised iterator.

Figure 1 - ATLAS 1TB Milestone Logical Configuration

Figure 2 - ATLAS 1TB Milestone Physical Configuration
CMS have used Objectivity/DB for numerous test-beam activities (H2b, X5b, T9). The Data acquisition framework is common (Figure 3). Only few specifics "storage classes" need to be specialized to reflect differences in the readout electronics.
For each test beam, a separate federated database is populated. These federated databases are composed of:
Two federations are used per test-beam: one online and the other offline. Data are written locally on the online system and then the databases (files) copied to the offline data-server system using rfcp and attached to the offline federation (see Figure 4). All Run Databases are also stored under HPSS.
Prompt data monitoring is performed directly online polling the online federation while it is populated (a "commit-and-hold" is issued each few thousand events).
Offline analysis is performed directly on the data-server or remotely, accessing the data through the AMS.
Although data are also stored in HPSS no automatic tape staging is used yet, due to the unavailability of a reliable version of the "linkable AMS". CMS plans to start tests of the Objectivity-HPSS interface, on a non-production machine, this fall.

Figure 3 - CMS Test Beam Setup
The data volume handled in each test beam is summarized in Table 2. It is worth noticing that during last X5b test-beam up to 25 concurrent users were accessing data (mostly interactively) on the offline system without any observable degradation of its response.
A new test-beam for tracker detectors will be performed in October in X5b. About 150 GB of data are expected to be collected.

Figure 4 CMS Online and Offline Federations
The data acquisition rate is currently limited to 1MB/s by the bandwidth of the network used by the central data recording system to transfer files from the test beam area to the offline system. An upgrade of this network to 100MB Ethernet is foreseen for next year.
|
Test Beam |
Period (Real DAQ) |
Number of Database files |
Total Size (GB) |
Average file size (MB) |
Average data transfer speed (KB/s) |
Number of users |
|
T9 (Tracker) |
3/5-10/5 |
350 |
150 |
500 |
800 |
10 |
|
H2b (Muon and Silicon beam telescope) |
16/7-21/7 |
100 |
6 |
30 |
800 |
3 |
|
X5b |
14-8-25/8 |
560 |
500 |
900 |
900 |
25 |
Table 2 - Summary of CMS Test Beam Data Volumes (1999)
To study possible performance bottlenecks of the test-beam setup we run a simulation of the data recording system focusing on the Objectivity/DB part.
The simulation was run on a SunEnterprise 450 (4 X UltraSPARC-II 296MHz) with a memory size of 512 Megabytes. The system was equipped with a local 80GB striped file-system of 19 slices 128K "away".
Such a system is capable of a raw I/O rate in excess of 90 MB/s.
We created a persistent raw data structure similar to those typically written in current test beams with objects of random size (in average 2.4KB of useful data each).
Each process produced 4000 events with 45 raw-objects each (180K raw-objects in total) corresponding to a total of 436MB of useful data. Including event structure and overhead a total of 469MB were written to disk per "run".
Objectivity I/O was optimized (as in real Test-Beam DAQ) using a federation with 32KB page size and a data model with no associations (only VArrayT<ooRef()>). Critical run time parameters were "INDEX mode" set to 0 and the maximum cache size set at 100 pages. A commit and hold was performed each 1000 events.
The test was run changing the number of processes running in parallel (from 1 to 16) and populating the same federation.
Results are summarised in Figure 5 where the total elapsed real time (and the corresponding I/O rate) is plotted against the number of concurrent processes running.
The figure includes the time spent to create databases and containers, which requires the processes to acquire a federation lock and therefore to queue.
These results prove that we can write a realistic raw-data event structure using Objectivity/DB at 5MB/s per processor. Parallel I/O does not introduce major performance penalties. Indeed in a later test, running on two similar systems concurrently (single federation, remote "system databases", local run-databases), we were able to reach a sustained aggregate speed of 40MB/s confirming that a remote federation catalog and remote application "system databases" are not a performance bottleneck.

Figure 5 Real Time and I/O Rate vs Number of Concurrent Processes
An internal CMS milestone, to be achieved by June 1999, was to populate an ODBMS with raw data at 100MB/s. This milestone was successfully accomplished ahead of time, and at significantly higher data rates. Using CMS multi-jet QCD events, data rates of up to 172MB/s were obtained using the Caltech 256 node SMP Exemplar system (Figure 6).
The simulated raw events consisted of 6 objects and had an average event size of around 260KB. All Objectivity/DB client processes were running on the Exemplar system, whereas the lockserver and federation catalogue were maintained on separate HP workstations.

Figure 6 - Aggregate Throughput vs Number of Worker Processes
The CMS Cristal project uses an Objectivity/DB federated database to store information related to product and process tracking for CMS detectors. The databases are populated from Windows/NT clients. As their disk space requirements in 1999 are rather modest, two trays of 18GB disks are connected directly to the lock server in a RAID 5 configuration, including a hot-spare disk. The total usable disk space is 200GB of which some 1.5GB were occupied at the time of writing. However, more than 1TB of data are expected over a four year construction period, including sites at CERN, in China, France, Italy, Russia and the UK.
The main goals of the Cristal project are to provide:
The current work of the project is focussed on:
The CHORUS experiment is looking for neutrino oscillations in the CERN muon neutrino beam. For active identification of potential tau neutrinos, 1800 kg of nuclear emulsions have been exposed. The CERN Chorus group is building automatic scanning stations, using commercial, industry-standard components in terms of hard- and software wherever possible.
Extrapolations into the emulsion sheets of tracks reconstructed by the electronic detectors are obtained from DSTs and stored in an object oriented database (Objectivity), which is implemented on a server running AIX. The four scanning stations, which are equipped with one PC running Windows NT each, run Objectivity client software to obtain the predictions. On each scanning station, a digital Megapixel CCD camera is connected via interface boards both on the microscope and the PC side, with a DSP and a frame grabber in the PC. The PC also communicates via serial interface lines with the light source controller and the motor controller. The motors move the desired section of the emulsion sheet below the microscope objective.
In automatic mode, the frame grabber serves mainly for the display of the captured images from the camera and for the monitoring of the local DAQ, whereas the DSP is in charge of controlling the camera and capturing the digital images which are processed on the DSP to recognize grains. The Pentium CPU of the PC is used to run the software to recognise tracks in the emulsion. The system is designed to eventually make optimal use of the possibilities of pipelined processing offered by the hardware. Data of a given image are then sent back to the database server, where they are stored. Because of the high network bandwidth required, FastEthernet connections are being used.
Control messages about the status of each scanning station are sent to a central dispatcher running on the AIX station, from where they can be requested by virtually any machine on the LAN. The operator interacts with the scanning stations via a dispatcher client, on top of which a Tcl/Tk panel has been implemented. Manual scanning, which is performed after automatic scanning to cross-check interesting events, also uses a Tcl/Tk based interface. Another dispatcher client uses Mathematica for further quasi-online analysis and monitoring of the data.
Most currently used software components, which have been written in the framework of the project, use C++ as implementation language, with some parts written in Tcl/Tk. While the DSP code is being re-written in Assembler for performance reasons, components which involve network communication and/or user interfaces are being migrated to Java.

Figure 7 - CHORUS Scanning System
NA45 the first HEP experiment to use Objectivity/DB in production plans to store data at 15MB/s in a data taking run starting in November 1999.
Data is collected by 4 PCs running Linux. 2 Sun 450s are used to convert the data into Objectivity/DB format, and Gigabit Ethernet connections are used to transfer the data to the CERN Computer Centre. An aggregate bandwidth of 20MB/s has been achieved, using 4 streams per sending PC, writing databases locally to the Sun systems (no AMS).
The COMPASS experiment expects to acquire over 300TB of data per year at data rates up to 35MB/s. Both these volumes and rates are similar to those of ATLAS and CMS, but need to be handled some 5 years earlier.
In the COMPASS system, data are sent from the online system as a byte-stream. They are received in the Computer Centre and stored in Objectivity/DB, as shown in Figure 8. The setup is designed to avoid single points of failure, including mirrored disks for critical information and a hot spare for the lockserver. Tests of robustness have been made by inducing a crash with many open transactions and replacing the lockserver with the spare.

Figure 8 - COMPASS Central Data Recording
The production infrastructure required for ATLAS, CMS, NA45 and COMPASS has been provided. A number of important internal milestones have been met by these experiments, including the ATLAS 1TB and CMS 100MB/s tests. The use of central data recording has been demonstrated by the NA45 and COMPASS collaborations at data rates of 15 and 35MB/s respectively. Production use of Objectivity/DB has also been made by other experiments, including the CHORUS collaboration. Other laboratories in particular DESY and SLAC have also gained valuable production experience with Objectivity/DB and the regular RD45 workshops continue to be an important forum for information exchange, identifying new requirements and following up on outstanding enhancement requests.
The second milestone set at the April 1998 review of the RD45 project was as follows:
"Develop and provide appropriate database administration tools, (meta-)data browsers and data import/export facilities, as required for [milestone] (1)."
In order to support the production services required for milestone 1, adminstration tools are clearly required. Although Objectivity/DB comes with a set of tools for managing Objectivity/DB federated databases, these tools are somewhat basic and typically need to be integrated into the local environment. For example, the oobackup tool allows federated databases to be backed up and restored, but will typically be embedded in a site-specific script, much the same as any Unix backup program. Similarly, whilst Objectivity provides a generic browser, there is a clear need for an extended browser that understands the object model in question, as well as local conventions, such as those used by BaBar for database naming.
Under the LHC++ Web page, an "Objectivity/DB at CERN" page has been setup. This page contains information about how the software is installed at CERN, how a local installation may be performed, where to find and how to run various test programs and benchmarks, pointers to documentation both local and Objectivity-provided and a description of the various tools that have been developed. These tools are described in more detail below.
To ensure a reliable use of Objectivity/DB in production a strategy for backup and restore of the production federation has to be established. The backup procedure should cater for recovery from data loss caused by either hardware faults (e.g. disk failures) or the more likely case of data corruption by faulty application software.
Since object databases like Objectivity/DB allow data located in different files to be associated (e.g. using cross-file object IDs), the recovery of a consistent transactional state of the whole federation is usually more complicated than storing and retrieving single unrelated files from tape. The backup procedure needs to be integrated with the database lockserver and should only copy data that is in a consistent state (not currently updated by an ongoing transaction). Also one has to make sure that all copied files belong to the same transactional state of the federation.
For federations containing large data volumes one would typically like to apply different backup strategies to different parts of the data model. In many cases one can identify databases that contain data that is relatively easy to regenerate so that data loss could in fact be accepted. On the other hand are some parts of the federation like the file catalogue and the schema that are of central importance for the ODBMS (or the particular application) so that data loss in this area can not be tolerated. A flexible backup facility would therefore allow restricting a backup run to e.g. the subset of the database files, which contain important data in order to minimise the I/O overhead, media costs and the time required performing the backup.
To address the problem federation backups Objectivity/DB provides several tools (e.g. oobackup and oorestore) which allow either complete or incremental backups on a deployed federation. A complete backup stores the content of all database files to a backup medium (e.g. tape), the incremental mode allows storing only the contents of those containers that have been modified since the last backup.
To guarantee the consistency throughout the federation, the Objectivity tools access every database in the federation and check it for updates. For federation containing large volumes of read-only data stored in a mass storage system this approach is clearly not very practical since it would yield to a complete transfer of all data in the MSS to disk.
An extension of the current backup tools has therefore been requested from Objectivity, which would allow the user to specify explicitly the set of databases that should be considered for the current backup. For other databases (typically containing read-only data) the user (or the mass storage system) would take the responsibility of keeping valid data copies.
Since these extensions to the Objectivity backup tools have not yet been implemented, the production service is currently using a set of scripts that implements a partial federation backup based on database file copies. To guarantee consistency the backup script acquires an update lock for all involved databases and keeps it until all files have been successfully copied. The set of databases that are copied defaults to the federation file and the boot file. This list is supposed to be extended by the user to include any other central application databases like collection registries or calibration databases.
The NA45/CERES production system uses a central lockserver machine, which also stores the central database meta-data like the boot file and the federation file. In the case this machine should experience a hardware or software fault it is replaced by a second identical machine that always contains a valid disk copy of all central federation files. The backup procedure in fact utilizes the fallback machine to store temporarily a complete copy of the production federation until it has been copied to tape.

Figure 9 - Objectivity/DB Backup Procedure
The backup procedure involving these two machines is organized as follows:
DRO_TOOL (Data Replication Option Tool) is a tool for managing the configuration of a federated database. With this tool, the database administrator is able to observe, control and manage the fault tolerance, and detect failures in the system as soon as possible.
Objectivity/DB provides tools and programming interfaces to help perform administration tasks. The tasks and tools are similar on all platforms, but without a friendly interface (most of them are command line programs).
DRO_TOOL is a visual tool that offers the same capabilities as the command line tools included with the Objectivity/DB, together with some additional functionality that is provided in the Java binding. The tool has been implemented in Java to avoid multi-platform problems and has been tested on Windows NT and Solaris.
The tool has been developed using the JDK 1.2 beta 2, and uses the JFC Swing package included on it (Swing 1.0). It also uses the Voyager package from ObjectSpace and has recently been ported to Objectivity/DB V5.1.
The tool is available via http://wwwinfo.cern.ch/asd/rd45/tools/dro_tool.html.

Figure 10 - Creating a Federation with DRO_TOOL

Figure 11 - Moving a Database with DRO_TOOL
Hudson is a database browser developer by Micram Object Technology the distributors of Objectivity/DB in Germany. Hudson offers a number of features that are similar to DRO_TOOL, including the ability to create new autonomous partitions and databases, as shown in Figure 13.

Figure 12 - Hudson Main Window


Figure 13 - Creating a New Database
Hudson also contains a lock monitor, that displays entries for all locks being currently held by applications working against the federated database opened by the browser (see Figure 14).
Each entry contains information about:
The content of the lock monitor is refreshed with a period specified in the settings dialog.

Figure 14 - Hudson Lock Monitor
The following information is taken from the BaBar database browser Webpage.
The BaBar database browser provides both logical and physical views of a BaBar federated database. The client is implemented in Java and uses CORBA to communicate with a C++-based server that accesses the database using the Objectivity/DB C++ interface.

Figure 15 Starting the BaBar Browser
The main GUI panel (see Figure 16) contains two tree widgets: The Database Tree and The Event Collection Tree (see Figure 17). They allow you to browse through database files and event collections respectively. Place the mouse pointer above a database icon for a brief moment: The database ID should be displayed in the form of a "tooltip".

Figure 16 Main Window of BaBar Browser
The Database tree presents either a logical or physical abstraction of the database hierarchy. Use the "View" menu to toggle between the two modes. The logical view is a BaBar-oriented structure of domains and authorization levels. The physical view shows the placement of databases according to file hosts and filesystem paths.
Both the Database tree and Event Collection tree have pop-up menus which may be accessed by clicking the right hand mouse button. You may hide the database/collection icons if you're browsing a large federation. Also, you may choose to display the number of events by default (rather than having to double click on every collection). Use this option with care since a read lock is required in order to retrieve the number of events.
The ability to browse event collections is a high priority in the list of requirements. Double-clicking on a collection will now allow you to view the list of BdbEventIDs and any subcollections. Double-clicking on an event entry will open up a new window displaying event information. The list of event stages will be shown. If the Event contains a tag, the tag data will be listed. Please use with caution especially if other users are writing to a federation.

Figure 17 - Event Selector Window
In the case of collections that contain descriptions of tag attributes, you can make simple tag queries. The Event browser (shown above) now displays an "Event Filter" button. If the button is active, the collection contains tag descriptions. Double-clicking on an Event entry will display its tag attributes. These attributes may be used as a means of selection. Use the "Event Filter" button to open the Filter dialog window. Select an attribute and a value. Apply your selection and now the Event browser will only display events that match the selection.

Figure 18 - TagFilter Window
A number of database browsers are available, in either prototype or production form, covering a range of functionality from database administration functions to object-level browsing. There are useful features in all of these tools and it would appear to be an area where collaboration and code-sharing would be of benefit. The ideas discussed at the RD45 workshop held in July 1999 concerning an extensible browser framework clearly require further thought.
Although the movement of databases between different federations is carried out by numerous experiments, by far the largest experience in wide-area data import and export has been gained by the BaBar collaboration. We therefore describe the data distribution techniques developed by BaBar.
BaBar currently stream their data into two streams, which are stored in separate (sets of ) databases and even filesystems:
It is expected that the number of streams will increase in the future, probably to four. All data distribution is performed using the "IsPhysics" stream, with data exported to centres in France, Italy and the UK, together with a full copy at IN2P3 in Lyon.
Two sorts of data export from the event store are supported:
In both cases, the tool will extract the databases in question and attach them to the target federation.
In the case of conditions data, incremental export is supported as follows:
The databases are then extracted and attached as above.
Database maybe exported between federations using a number of techniques, including copying and shadowing (see section 7.4).
The above scheme for data import and export has been used in production at BaBar since the first day of data taking (import/export between federations at SLAC). Many improvements are needed, included crash-recovery and minimisation of the interference with data taking and production.
The development of the appropriate tools for managing an Objectivity/DB federation such as administration scripts, database browsers and import/export facilities is an on-going activity. Given the different environments that need to be supported, it is unclear whether generic tools can be provided and further work is required to understand if (sufficiently) extensible browsers and administration scripts are feasible. However, the provision of the appropriate tool set is clearly an important element of a production service and work in this area will continue.
The third milestone set at the April 1998 review of the RD45 project was as follows:
"Develop and provide production versions of the HepODBMS class libraries, including reference and end-user guides."
The HepODBMS class library provides an insulating layer that completes and extends that defined by the ODMG [22]. On the one hand, it extends vendor compliance and minimises the application developer from minor API changes in different releases of the product, whilst on the other it extends the standard to provide a more complete interface, particularly in areas of concern to the HEP community.
The HepODBMS class library is released and distributed as part the LHC++, ensuring consistency between the library itself, the version of the underlying database (Objectivity/DB) and applications that are built on top of the library.
The main areas of activity concerning the HepODBMS libraries over the review period have been:
The user guide and reference manuals have been updated to reflect these changes. The user guide is marked up in XML, allowing versions optimised both for the screen (currently HTML generated automatically from the XML) and printer (currently PostScript) to be generated from a single source. The reference manual is generated automatically from the source code itself, using the DOC++ scheme. In addition, as is now the standard with all LHC++ components, "CERNLIB-style" short-writeups have been produced for all packages, facilitating keyword-based search and integration into the existing CERNLIB documentation scheme.
The C++ standard includes the definition of a set of collections based upon the work of Alex Stepanov and Meng Lee. These collections are frequently referred to as the "Standard Template Library" or STL [33], although the more correct term is simply the Standard Library.
The STL includes the following:
The ODMG standard includes STL-compliant collection classes. For the main STL collections, a persistent equivalent exists, designated by a leading d_.
Objectivity/DB provides persistent STL collections classes based upon the ObjectSpace implementation - also available in transient form via LHC++.
In addition to the persistent STL collections described above, HepODBMS provides highly scalable collections - capable of handling very large numbers of entries. They were designed to handle up to 1000 million objects, which would not be possible using the standard persistent STL classes offered with Objectivity/DB.
HepODBMS defines following templated collection that is usable for any kind of persistent objects:
typedef h_seq<Event> EventCollection;
The interface of this collection offers the following:
For example, we may wish to access an existing event collection by name. We first define the collection, as follows:
EventCollection evtCol("/usr/dirkd/collections/myEvents");
We then need to define an iterator for this collection (STL-like):
EventCollection::const_iterator it;
We can now iterate through the collection and read individual event objects:
it = evtCol.begin();
while( it != evtCol.end() )
{
cout << "Event: " << (*it)->getEventNo() << endl;
++it;
}
// support for (some) STL algorithms
int cnt=0;
count(evtCol.begin(),evtCol.end(),1,cnt); |
Writing to the collection - e.g. adding new events - is shown below.
HepRef(Event) evt;
for (int i=0; i<500000; i++)
{
// create a new event using the clustering hint of the sequence
evt = new(evtCol.clustering()) Event;
// store the new object ref in the sequence (only needed for ref collections)
evtCol.push_back(evt);
// fill the event
evt->setEventNo(i);
} |
The following example shows how the HepODBMS collection class may be used to store events. It is available via /afs/cern.ch/sw/lhcxx/share/HepODBMS/examples/createCollection/createCollection.cpp
#include "HepODBMS/clustering/HepDbApplication.h"
#include "EventSeq.h"
d_Ref<ooObj> VStore<Event>::store_clustering;
void print_evt(d_Ref<Event> evt)
{
cout << "Event Nr=" << evt->getEventNo() << endl;
}
int main(int argc, char *argv[])
{
//
// Persistent event collection tests.
//
HepDbApplication app;
app.init();
app.startUpdate();
ooHandle(ooDBObj) db_h = app.db("Sequences");
ooHandle(ooContObj) ooc_h;
ooHandle(ooContObj) eventCont = app.container("Events");
ooc_h= app.container("Stores");
ooDelete(eventCont);
ooDelete(ooc_h);
VStore<Event>::store_clustering = app.container("Stores",1);
eventCont = app.container("Events");
app.commit();
cout << "created db and container" << endl;
app.startUpdate();
h_seq<Event> seq("dirks","vector");
app.commit();
cout << "created store object" << endl;
app.startUpdate();
HepRef(Event) evt;
for (int i=0; i<5000; i++)
{
// create a new event, use the clustering hint of the sequence
evt = new(seq.clustering()) Event(i);
// store the new object ref in the sequence (only needed for ref collections)
seq.push_back(evt);
}
app.commit();
return 0;
} |
The naming facility provided by HepODBMS allows a name to be associated to any persistent object that is stored in the database. A name is simply a text-string that is associated with an object. Of course, it is not intended that every object in the database is named - this would not be efficient and would have a significant overhead. Naming is most useful for defining "entry points" into the database. For example, a collection of events could be given a name - such as "Higgs candidates". As a flat naming scheme is often inadequate - and would clearly not be useful in a multi-user system, HepODBMS provides a hierarchical naming scheme, similar to that of a Unix filesystem.
The HepNamingTree class provides the following Unix-like methods:
The usage of these methods is shown in the following example. The HepDBApplication class automatically places the application in the "directory" corresponding to the current username for example: /usr/dirkd to avoid possible conflicts between the naming trees of different users. Naturally, it is possible to navigate to any point in the naming tree.
typedef h_seq<Event> EventCol;
// initialize DB session
HepDbApplication app;
app.init("fdBootName"); // implicit cd /usr/$USER/
// move to test-beam
app.naming.changeDirectory("test-beam");
evtCol = EventCol::findByName("inputEvents");
EventCol::iterator it;
for (it = evtCol.begin(); it != evtCol.end(); it++)
{
-- do something--
}
The basic features of the BaBar conditions database were described in the RD45 Status Report submitted to the LCB in April 1998 [1]. Essentially, calibration objects are inserted into the database with a specified validity range (valid from to), and then retrieved by validity instant. Multiple calibrations may exist for a given instant, default calibrations for such an instant may be set, or the user may choose a calibration explicitly.

Figure 19 - Multiple Calibration Objects
In order to introduce the conditions DB package into HepODBMS, dependencies on the BaBar environment and Rogue Wave Tools.h++ classes were removed replaced where appropriate by the corresponding ObjectSpace classes. An example of such a class is HepTime, based on the ObjectSpace 64bit time classes (thus avoiding Y2K-like problems until the year 32766AD).
The current version of the conditions database is used by the NA45 experiments and it is being evaluated by ATLAS, CMS and LHCb. The original version of the system is in production use at BaBar, where further enhancements have been made. It is foreseen that these enhancements be integrated into the HepODBMS version. Additional enhancements include the possible provision of a "global tag", covering calibrations from multiple sub-detectors. However, more discussions on the exact requirements are needed before design and implementation.
Production releases of HepODBMS and associated manuals have been made. The library will continue to be ported to new releases of Objectivity/DB and enhancements will be provided as required. The bulk of the library is expected to be rather stable for the immediate future, although further work on the conditions DB is expected, as described above.
The fourth milestone set at the April 1998 review of the RD45 project was as follows:
"Continue R&D, based on input and use cases from the LHC collaborations to produce results in time for the next versions of the collaborations' Computing Technical Proposals (end 1999)."
An initial list of R&D activities, proposed by the LCB referees, is given below:
We report below on a number of cases where Objectivity/DB has been used in the wide-area.
As part of the GIOD project, roughly 600,000 fully-simulated multi-jet QCD events where generated on the Caltech Exemplar system using the CMS simulation program, CMSIM. This production resulted in some 600GB of data, stored in ZEBRA FZ format in the Caltech HPSS system. Some 400,000 events where shipped on tape to CERN and reconstructed on 10 PCSF nodes in parallel, using the CMSOO program. The resultant data was stored in an Objectivity/DB federation of around 32GB in total. The individual database files each some 200MB in size where transferred back to Caltech using ftp and attached to a local federation using ooattachdb. The steps involved are shown in Figure 20 below. The data rate achieved for the trans-atlantic ftp was approximately 11GB/day or ~1TB/year. Scaling to a 622Mbps link, approximately 1PB/year should be achievable.
This test did not make direct use of Objectivity/DB support for WAN usage, such as writing remotely to the Caltech federation from the PCSF nodes at CERN, or the oochangedb utility to move databases from one node to another. Nevertheless, it demonstrates a pragmatic solution to the use of Objectivity/DB in the wide area, re-using standard tools and minimising risk.

Figure 20 - GIOD Data Flow
An Objectivity/DB test-bed has been established at KEK, as part of the MONARC test-bed activities. Using this facility, a test of Objectivity/DB Data Replication (DRO) has been made, including performance comparisons between terrestrial and satellite based 2Mbps links. The tests included two scenarios:
These tests show that the basic DRO protocol functions correctly in the WAN, but that rather extensive hand-shaking occurs. In addition, the packet size, which appears to be the page size for data transfers, and 4172 bytes for "system" transfers, is not optimal for networks with long round-trip times.
In the framework of the MONARC project, a testbed working group has been setup. The goals of the working group include:
The sites currently participating in the MONARC testbed working group and the associated resources are shown in Table 3 below.
|
CERN |
SUN Enterprise 450 (4*400MHz CPUs, 512MB memory, 4 UltraSCSI channels, 10*18GB disks) Use of mass storage management (HPSS) facility is being planned. |
|
Caltech |
HP Exemplar SPP 2000 (256 CPUs, 64 GByte memory) HPSS (600 TB tape + 500 GB disk cache) HP Kayak PC (450 MHz, 128 MB memory, 20 GB disk, ATM) HP C200 (200 MHz CPU, 128 MB memory, 10 GB disk) Sun SparcStation 20 (80 GB disk) Sun Enterprise 250 (dual 450Mhz CPUs, 256 MB memoryΉ) Micron Millennia PC (450 MHz CPU, 128 MB memory, 20 GBytes disk) ~1 TB RAID FibreChannel disk (to be attached to the Enterprise 250Ή) Ή shortly to be ordered |
|
CNAF |
SUN UltraSparc 5, 18 GB disk |
|
FNAL |
ES450 Sun Server (dual CPUs), 100 GB disk + access to a STK Silo |
|
Genova |
SUN UltraSparc 5, 18 GB disk |
|
KEK |
SUN UltraSparc, 100 GB disk |
|
Milano |
SUN UltraSparc 5, 18 GB disk Access to non dedicated facilities is available at CILEA: to a SUN system similar to the dedicated one and to the HP Exemplar SPP 2000 of the Centre, for agreed tests. |
|
Padova |
SUN UltraSparc 5, 117 GB disk + SUN Sparc 20, 20 GB disk |
|
Roma |
SUN UltraSparc 5, 27 GB disk |
|
Tufts |
Pentium II 300 MHz PC, 12 GB disk (+ Pentium-II 400 MHz PC, 22 GB in July) |
Table 3 - MONARC Test-bed Systems
The ODMG standard allows for a "clustering hint" that can be specified when a new instance of a persistent-capable object is created. This allows an application to request that the new object be stored physically close to an existing object typically on the same or adjacent database page. As disk and network I/O typically takes place in chunks that are larger than the size of individual objects, I/O can be reduced by taking advantage of such clustering. For example, if objects A and B are stored on the same page, a request to access object A will automatically bring object B into the client cache. A subsequent attempt to access object B will be satisfied from memory, thus avoiding an explicit I/O for object B.
Support for object clustering is provided directly by Objectivity/DB and further extended via the HepODBMS layer.
In the HEP environment, data analysis is often increasingly selective: subsequent analyses often access smaller and smaller subsets of the data, resulting in fragmented access. To avoid such fragmentation, data were typically distilled into smaller and smaller subsets, implying redundant data copies that some became out of date. This problem is shown schematically in Figure 21 below, where different jobs read increasing small subsets of events.

Figure 21 - Data Clustering and Increasing Selectivity
To improve the clustering for jobs that access only a small fraction of the data, one can simply copy or move the accessed events so that they are contiguous, as shown in Figure 22.

Figure 22 - Reclustering Operation
Such reclustering has been studied in the CMS collaboration [38], where data are dynamically reclustered based on observed access patterns.

Figure 23 - Job time for accessing clustered versus unclustered data
Clustering and re-clustering has also been studied within the ATLAS collaboration [37], where a so-called Hamming algorithm has been developed, as it is based on the use of the Hamming distance between two bit-vectors. Using this algorithm, data are clustered according to multiple access patterns, without any data duplication. By controlling the order in which data are accessed (the iteration order), this algorithm reduces the number of disk seeks to almost the theoretical limit. Performance is maintained for 15 40 access patterns, depending on the overall selectivity. For larger numbers of access patterns, the use of data duplication becomes attractive.

Figure 24 - Disk Re-clustering using the Hamming Algorithm
Tape clustering has been studied as part of the HENP Grand Challenge Project in the US. In this case, data are clustered on tape according to the distribution of events in a multi-dimensional space. This type of clustering has also been studied in the NA48 collaboration, and is suitable for cases when the number of dimensions is relatively small.

Figure 25 Distribution of Events in a multi-dimensional Space
Other tape-related clustering issues that have been studied, in CMS, include filtering and clustering of data chunks cached from tape (see Figure 26).

Figure 26 - Cache Filtering and Chunk Reclustering
In conclusion, it appears that good clustering and re-clustering will be important to achieve and maintain good I/O performance. The effectiveness of any of the re-clustering strategies studied above is strongly coupled to user access patterns, which are not yet well known, and to the performance and characteristics of the storage devices that will be deployed.
Although the architecture of Objectivity/DB theoretically permits federations in excess of 1EB (1000PB), it is not currently feasible to store such a large quantity of data on disk, as is assumed by the basic Objectivity/DB architecture. Hence, a mechanism whereby inactive databases can be stored offline e.g. in a Mass Storage System (MSS) is required. An interface between Objectivity/DB and HPSS was agreed at a joint meeting between representatives of the major HEP laboratories, Objectivity and the HPSS consortium in May 1997. The implementation of the interface has been performed as a collaborative effort between Objectivity, SLAC and CERN.
A prototype of an interface Objectivity/DB and HPSS was presented in a previous RD45 Status Report to the LCB [1]. This prototype interface suffered from performance problems, due to the mismatch between HPSS tuned to perform well on large data transfers and Objectivity/DB, where small data transfers are used. To circumvent these problems, it was proposed that the original interface, which used direct calls to the HPSS client API, be replaced by a more traditional staging system. Although such a system would not exploit the disk-cache management capabilities of HPSS, it would enable the performance problems of the prototype system to be avoided.
The interface between the two systems is shown schematically in Figure 27, and consists of 3 components.
The use of the clean I/O interface enables alternative MSSs to be used, as has been demonstrated at INFN/Rome. This is important not just for smaller sites, which cannot necessarily afford or do not need the functionality and complexity of a system such as HPSS, but also to allow a smooth migration path to alternative systems in the future.

Figure 27 - Objectivity/DB - HPSS Interface
When an attempt is made to access a database that is already present in the stage pool (see Figure 28), the operation procedes normally as if the standard AMS were used. However, if the database in question is tape-resident, a stage request is generated and the open operation is blocked until the database is staged to disk. In addition, a free-space daemon assures that sufficient disk space remains available, migrating files to tape, using a least-recently-used algorithm, as required.

Figure 28 - Staging Interface
The interface between Objectivity/DB and HPSS is used in production at both CERN and SLAC, but with slightly different configurations. At SLAC, PFTP is used to move the data between HPSS and the stage cache (see Figure 29 and Figure 30), whereas at CERN, the SHIFT RFIO package is used. At CERN, over 1TB of data have been stored using the interface (primarily the ATLAS 1TB milestone, described in section 4.3 on page *), whereas at SLAC over 1TB of production data have been stored.

Figure 29 - SLAC Configuration with Remote Tape Access

Figure 30 - SLAC Configuration with Local Tape Access

Figure 31 - CERN Configuration
At the time of writing, both CERN and SLAC use Objectivity/DB V5.1.x. This version does not contain the official production release of the interface. In particular, the V5.1 AMS will block until an I/O operation is completed clearly a problem in the case of offline databases, as the operation may take several minutes to complete. Although it is possible to increase the RPC timeout between client and server, a much cleaner solution will be provided in Objectivity/DB V5.2.
In this version, the AMS will be able to tell the client to retry after a specified time interval, thus avoiding timeouts in the case of lengthy operations. This mechanism is shown schematically in Figure 32.

Figure 32 - AMS Defer Protocol
Clearly, further work will be required to integrate the production release of Objectivity/DB V5.2 when it is released. In addition, the numerous other enhancements designed by SLAC and scheduled for this release will have to be tested and put into production. However, these activities are clearly more related to production rather than R&D. Hence, the issue of Mass Storage Integration is now considered one of production.
Although it was originally foreseen that a given experiment would use a single Objectivity/DB federation, it soon became clear that multiple federations would be preferable. The revised scheme involved a single reference federation containing only the schema a production federation containing the schema and data, as well as multiple developer federations, created by cloning the reference federation and attaching copies of the production databases as required. Experience with such a scheme has shown that the use of multiple federations can solve a number of problems, described in detail below.
Multiple consistent federations are a solution to the following problems:
These issues could also be addressed by enhancements to Objectivity/DB, such as support for private schema and data, partial backup of a federation and so forth. It is hoped that these issues will be addressed with time, reducing the need for what are essentially work-arounds.
Examples of the above cases are given below.
It would be clearly highly undesirable for the online data acquisition system to be blocked by a single read lock taken by a user who had opened a database browser. To avoid such problems which would occur frequently during periods of intensive offline analysis a simple solution is to deploy two federations. In such a scenario, the "offline" federation would be created as a copy of the "online" federation and both would run independently with separate lock servers and catalogs. On a periodic basis, databases would be copied from the online federation to the offline one, or attached to both federations. In the latter situation, it is clearly necessary that updates are only performed from one side, or preferably not at all.
The use of multiple decoupled federations is greatly simplified if the data flow is uni-directional. A further simplification is if databases are populated and then become readonly. This is typically not the case for databases containing metadata, which typically have to be recopied from the source to target federation.
Another frequently encountered situation is the need to share the bulk data amongst different users, each of whom requires additional private or semi-private databases and schema. Such a case can be handled by using multiple cloned federations, where each user who needs this capability starts with a clone of the production federation a copy of the complete catalogue and schema. The production databases are visible to both the original production federation and the clone, as they are referenced by both catalogues. The clones could even use the same lockserver and federation ID as the main production database. A user could then add new schema to their private clone, and new databases and instances of their private classes (or the production classes). Naturally, these changes would not be visible to the production federation.
Such a scheme would also permit databases to be shared between different users, as is shown in Figure 33. Clearly, such a scheme would have to be used with care, and would require the separation of users schema into different named schema. It is also somewhat inefficient, as it results in multiple copies of the master federation catalogue, with rather small differences, resulting from the end-user schema and databases.

Figure 33 - Using Multiple Cloned Federations
The recommended tools for making a copy of an Objectivity/DB federated database are oocopyfd or ooinstallfd. The use of ooattachdb is not recommended, except for the case when the database ID should be changed on attach.
In the typical HEP environment, a database backup normally applies to a small subset of the federation it makes no sense to stage in and recopy to tape readonly databases resident in HPSS. The important data to backup is essentially meta-data the schema and catalogue and "registry" information, such as the naming tree, collections (of references not data!) and so forth. As the current Objectivity/DB backup tools (oobackup/oorestore) do not support partial backup, an alternative solution has to be found. A simple technique is to make a partial copy of the federation, perhaps to a special disk area, and then dump this copy to tape. A Perl script to perform such an operation is provided in the HepODBMS tree.
The use of multiple federations provides a pragmatic solution to some limitations in the current version of Objectivity/DB. The techniques described above are used in production by a number of experiments, including both BaBar and CMS. White papers describing these techniques, together with appropriate scripts and/or other tools and documentation, are in preparation.
Extensive work on the R&D items requested by the LCB has been carried out. Much of this work has been performed within various experiments and/or the MONARC project, underlining the increasing involvement of the collaborations in these activities. Where appropriate, recommendations, tools and associated documentation has been or will be developed and will be made available as part of the RD45s production activities.
The BaBar experiment at SLAC is designed to perform high-precision studies of the decays of the B-meson, that are produced in e+e- collisions at the PEP-II accelerator. Some 650 scientists from 85 institutes in 10 countries are members of the collaboration. The first physics data was taken on May 26th 1999. The overall characteristics of the experiment are shown in Table 4.
|
Characteristic |
Size |
|
Number of Detector Sub-systems |
7 |
|
Number of electronic channels |
~250,000 |
|
Raw Event Size |
~32Kbytes |
|
DAQ to Level 3 Trigger |
50MB/sec (2000Hz) |
|
Level 3 to Reconstruction |
2.5MB/sec (100Hz) |
|
Reconstruction |
7.5MB/sec (100Hz) |
|
Event Rate |
109 events/year |
|
Storage Requirements (real + simulated data) |
~300TB/year |
Table 4 - BaBar Characteristics
The overall requirements on the database system are as follows:
In addition, the system must be capable of handling changing requirements, such as new physics goals, modifications to the detector hardware and so forth.
The overall database design consists of a number of domains, such as the event store, conditions, etc., which are logically independent, even if physically coupled through the use of the underlying technology, namely Objectivity/DB. The use of independent domains was important in that it allowed parallel development of the various domains, which were then brought together fairly late on in the development cycle. The EventStore domain is given the responsibility of being the primary transaction manager. Client-side multi-threading is not exploited, as it was not supported by the database at the time that the overall design was frozen. On the other hand, multiple database contexts (ooContext) are used to handle "mini-transactions".
The event structure and average size of each component is shown in Table 5.
|
Data Type |
Name |
Size per event (KB) |
|
Simulated Data |
SIM |
~100 |
|
Truth |
TRU |
~40 |
|
Raw Data |
RAW |
~30 |
|
Reconstructed Data |
REC |
~100 |
|
Event Summary Data |
ESB |
~20 |
|
Analysis Object Data |
AOD |
~2 |
|
Event Selection Tag |
TAG |
~.2 (200bytes / event) |
Table 5 - Data Types and Placement Regions
In order to optimise access, data are clustered according to the event component as shown in Figure 34. The clustering algorithm automatically rolls over to a new container / database / filesystem as required and transparent navigation to the event data is provided by the appropriate headers. Named event collections which may contain both pointers to events and/or other collections are provided, using a Unix-like naming tree.

Figure 34 - Event Structure and Placement
Reconstruction and analysis uses the same processing framework (see Figure 35). An input module creates a skeleton event from the input source. Different input modules are used to cater for different sources. User modules may be grouped together into sequences and multiple paths are supported. A filter module can terminate the processing of a given path, which automatically initiates the processing of the next path. User modules see only those events that are passed down the path in question. Output modules provide an abstraction of an output stream, which corresponds to a set of databases or named collection, such as /groups/multihadron/taus/theTauSample.

Figure 35 - BaBar Processing Framework
In the BaBar model, user code deals only with transient objects. Input modules are used to create an empty transient event and loader modules convert parts of the persistent event to the transient representation. In a similar manner, convertors are used to generate persistent data from the transient information on output. This decoupling hides the database implementation from the user, which would in principle allow for a replacement of the underlying store. It allows the data to be represented differently in the two worlds for example, small transient objects can be grouped into a single persistent object to minimise the storage overhead associated with such objects.
During the development phase, full releases are built every 2 weeks on both Digital Unix and Sun Solaris are currently supported. Linux will shortly be added to the list of supported platforms, which previously also included HP/UX and IBM AIX. Between the fortnightly full releases, 1-2 bug fix rebuilds and nightly "compile only" rebuilds occur. For each build, a new federation is created, with the schema built from scratch. The schema building is performed on a single platform and then exported to the others. Developers and users import and initialise test federations from the exported release federation for a given release. Code is then developed against these test federations. Schema evolution is permitted during this phase and schema changes become official with the next release build. The source code, managed using CVS, is organised into some 550 packages with a total of 5800 classes. Packages are tagged by a version identifier and releases are defined as being a set of tagged packages.
In the production phase, a reference federation, containing only schema, is used. Once the schema associated with a given release have been validated, these release schema become the new reference schema. In addition to the reference federation, there are a number of production federations, containing both schema and data. The production schema are upgraded, currently on a weekly basis, from the reference schema.
It was initially intended that the schema evolution feature provided by Objectivity/DB would be used. However, as this feature normally requires that existing executables must be rebuilt, which creates problems in reproducing existing analyses, it was decided to move to explicit schema migration, where the schema are versioned "by hand". The transient persistent mapping layer is used to perform the appropriate translation.
Regional centres for BaBar existing in France (IN2P3), Italy (INFN) and the UK (RAL) with SLAC acting as a regional centre for North America. Bulk data transfer between these sites is performed using database files. For data exchange between developers, small event samples are copied, which is more efficient than exchanging complete databases. A large suite of tools, based on Perl and C++, to support the export of collections, domains and so forth is being developed. Although the Objectivity/DB FTO and DRO features are not currently used, it is intended to re-evaluate these tools at a later stage.
There is a clear need to provide some level of protection to the data. This is provided both within a federation e.g. to protect against accidental corruption or deletion and at the level of the administration tools, such as oodeletefd. Within a federation, authorization levels are used, supporting the concept of system, group, and user. The tools themselves are wrapped to prevent accidental misuse, although this is not ideal as the original utilities can still be found. A longer term solution is being provided as part of the joint developments between SLAC and Objectivity, which will include security hooks on both client and server sides, allowing a site to plug in its own authorisation mechanism.
In order to generate access statistics, a post-processor is used to insert code into that generated by the DDL processor. This allows statistics to be obtained on which databases and objects are accessed. It is possible to enable / disable this feature at run-time and a non-measurable overhead is incurred when disabled.
The SLAC configuration, shown in Figure 36, is primarily based on Solaris, although Linux systems are planned. 2 SMPs are provided for software development and physics analysis, where 64 CPUs will shortly be available for the latter. 8 datamovers providing the disk cache and HPSS interface are currently used, together with a 200 node processing farm and 100 node online system. The disk caches on the datamovers consist of various regions, including:

Figure 36 - SLAC Hardware Configuration
Multiple production federations are currently deployed, including:
Data is moved between production federations using a mixture of data distribution (physical copying of databases between federations using a DBID allocation scheme) and shadowing, where databases are attached to multiple federations once readonly.
At the time of the RD45 workshop in July 1999, BaBar production (event reconstruction) was running at the rate of some 10-15 events/s (with peaks of 50Hz) on 60-70 nodes (up to 150 nodes observed). The target is to reach 100Hz by February 2000 the rate at which events are acquired from the detector on some 200 nodes without sacrificing robustness.
We describe below on-going work on performance optimisation of the reconstruction farm.
In addition to this work and perhaps more important in the eye of the end-user the performance of the analysis environment needs to be improved. This latter activity was only just beginning at the time of writing and hence cannot be described in this report. However, details will be distributed via the normal RD45 mailing lists as soon as they become available.
Performance optimisation is clearly an on-going process. Since the start of BaBar data taking in May, a number of successful steps have been carried out, as follows:

Figure 37 - BaBar Job Startup Time
Additional optimisations that have been performed include the following:
As a result of these steps made in a single week after the appropriate hardware had been installed the overall throughput was improved by some 224% (see Figure 38).
No saturation in performance is seen up to 110 nodes, although at the time of writing (15th September) a small degradation less than 10% at 150 nodes is still seen if more nodes are used.
Figure 38 - BaBar Reconstruction Throughput
A number of other optimisations are in progress, namely:
Possible further steps include:
|
Number of Sites Using Objectivity/DB |
>15 (USA, UK, France, Italy) |
|
People who have signed license agreement |
~500 |
|
People who created a test federation |
~260 |
|
Simultaneous users (oolockmon output) |
80-90 |
|
Developers (created or modified a persistent class) |
~30 (10 "experts") |
|
Number of Persistent Classses |
450 |
Table 6 - BaBar Database Usage Statistics
The BaBar database system, based upon Objectivity/DB and HPSS, is now in production and has already written several tens of TB of data. The basic concepts of an ODBMS have been shown to be a good match to the problem, although a number of issues, primarily related to performance, scalability and robustness, still need to be addressed. A schedule is in place to address these concerns and other problems that will no doubt arise as the data volume and usage builds up. At the time of writing, significant improvements in the overall performance of the reconstruction federation had been achieved, although additional enhancements are still required. Attempts to optimise the performance of the analysis federation were only just starting and hence cannot be covered in this report. However, it is expected that some of the techniques used on the reconstruction federation will also be of benefit on the analysis side. Progress on both reconstruction and analysis federations will be closely monitored and reported on via the RD45 mailing lists and Web pages.
The H1 experiment at DESY has recently implemented a tag database built on Objectivity/DB. In the current Fortran-based system, event selection is performed using index files, based on a 32-bit classification word. The tag database, on the other hand, contains:
The basic aim of moving to a tag database is to speed up data access by providing improved event selection possibilities. Interfaces to the tag database are provided for C++, Java and Fortran, including a Objectivity/DB Data Interface Module (DIM) for Java Analysis Studio (JAS).
Further details regarding the H1 tag database can be found via
http://www-h1.desy.de/~benisch/talks/rd45_990712.ps.gz.
The ZEUS experiment has built a similar system to the above, replacing their existing ADAMO-based system [23]. The Objectivity/DB-based system ZES has been in operation since late 1997. Information stored in ZES includes:
The ZES federated database currently consists of some 350 database, each of which are between 200 and 400 MB in size. Each database contains about 10 run objects, which in turn consists of some 10K event objects. For each event, the DAQ information, some 250 physics and detector variables and a pointer to the full event in ADAMO is stored.
|
Year |
ZES version |
# events |
Data Volume |
Per Event |
|
1995 |
V1.0 |
7.9 x 106 |
12.8GB |
0.7KB |
|
1996 |
V2.0 |
20 x 106 |
48.3GB |
0.9KB |
|
1997 |
V2.0 |
30.4 x 106 |
||
|
1998 |
V2.0 |
9.5 x 106 |
10.6GB |
1.1KB |
|
Total |
77.8 x 106 |
71.7GB |
0.9KB |
Figure 39 - ZES Storage Requirements
ZES is integrated into the standard ZEUS analysis environment, for which there are typically between 0 100 simultaneous users (normally around 20). Typically, a user reads between 102 and 103 runs, corresponding to some 10M events.
Since the time of the last LCB review, a series of topical "white papers" has been produced. The main purpose of these papers is to provide information in a concise and easily digestible form in a more timely manner than is possible via LCB status reports.
A number of these documents concern guidelines and recommendations for production deployment of Objectivity/DB. Others document work on the use of Java agents and a database administration tool written in that language. The bulk of the remainder is concerned with the on-going RD45 risk analysis and issues arising.
A list of titles is given below. In addition, the revised risk analysis report is reproduced in its entirety in chapter 11 on page *.
The following is a direct copy of the RD45 "Risk Analysis Update" - CERN/RD45/99/05.
This document is essentially an annotated version of "Risk Analysis and Management" CERN/RD45/98/08. In that document, proposed actions were written in
red. Comments on or updates to these actions appear in green, as this text.In this document we attempt to identify the key risks involved in the strategy for HEP data management that is currently proposed by the RD45 collaboration. In each of the key areas, we make proposals for future work aimed at minimising, or at least developing a better understanding of, the corresponding risk factor.
We emphasise that risk analysis and management is an on-going effort and the issues described in this document build on earlier RD45 work, as presented to the LCB, LCRB and DRDC.
In addition, this document is intended to serve as the basis for discussion. The action items listed below are merely suggestions the concrete list of actions will be drawn up as a result of the discussions held with the experiments and the LCB.
It is clear that choice of computing model can have a significant impact on the viability of a given data management strategy. As these models evolve, we suggest that their feasibility in terms of the then-current RD45 strategy is reviewed. A series of white papers are being produced by RD45 of which this document is a member which propose guidelines for various issues concerning the deployment of a data management system based upon Objectivity/DB. These guidelines are deliberately conservative and avoid the use of features that we believe are either immature, with which we have insufficient experience or where further testing is required. It is expected that these recommendations will evolve with time, but the adoption of proven technology is strongly recommended.
An example of such an issue would be the use of collaboration-wide Objectivity/DB federations in a computing model. As a federation is a tightly-coupled entity, any computing model based upon the deployment of a single federation across multiple sites would necessarily require very close collaboration between all institutes involved on operating system, compiler and Objectivity/DB versions.
Work on LHC Computing Models should be closely coordinated with that undertaken by RD45. The risks associated with each model should be evaluated and the risk factor should be included as part of the overall assessment of the viability of the model.
Close links between the MONARC projects and RD45 are maintained. RD45 presentations at MONARC meetings can be viewed at
http://www.cern.ch/MONARC/docs/rd45_presentations.html.The solution currently proposed to the needs of the LHC experiments in terms of data management is based upon an ODBMS coupled to a Mass Storage System (MSS). In this document, we only consider the risks associated with the ODBMS layer, although clearly the risks associated with the MSS layer need to be evaluated. Alternative solutions to an ODBMS, such as the use of an RDBMS, an Object-Relational solution, "light-weight" or other persistency mechanisms, or conventional file-based I/O have a number of significant restrictions that made the choice of an ODBMS more appropriate. However, in terms of a fall-back solution, one of these alternatives might turn out to be viable. For example, Ardent Software has recently announced an ODMG-compliant C++ wrapper for a number of RDBMS products, including ORACLE, Sybase and Informix. There is a good chance that at least one of these products, or at least similar technology, will continue to exist in 10-20 years. Would such a wrapper offer the functionality that is required? Can the underlying technology offer sufficient performance and scalability to meet the needs of the LHC experiments? What are the chances that the wrapper or equivalent software will continue to exist, even if the underlying RDBMS is still available?
RD45 should evaluate the functionality, performance and scalability of ODMG-compliant C++ wrappers to ORACLE with a view to developing a potential fall-back solution.
Based on information obtained from ORACLE at a data management workshop at SLAC, it appears unlikely that the RDBMS vendors will attempt to address the PB region until there is sufficient commercial demand. This is not expected to occur in time for the 2001/2 choice of a solution for the LHC startup in 2005. In addition, the solutions that are currently available typically C++ wrappers are provided by small companies with no long-term guarantee of survival. Hence it is felt that no active work in this area is appropriate at this time, although the evolution should clearly be followed.
It is clear that the database market will continue to evolve. Existing products will provide additional functionality and new ones will emerge. Other database paradigms may be developed, perhaps even more suitable to HEP than the ODBMS model.
RD45 should continue to monitor and evaluate developments in database products and alternative database solutions. This includes obtaining relevant industry reports, as well as making product evaluations.
This continues to be part of RD45s activities. To this end, RD45 intends to participate actively in the 1st European Workshop on Object Databases, whose theme is central to this discussion.
The ODMG standards for ODBMS products are implemented, at least partially, by a number of database vendors. The ODMG is aligning itself more closely with the OMG, and may well merge with the OMG, e.g. become a sub-group, in the future. At the same time, it is taking steps towards formal ISO approval of the standards that it has developed. It is therefore likely to remain at least the de-facto standard in this area, if not become the de-jure one.
RD45 should continue to monitor the activities of the ODMG and continue to push for vendor-compliance.
The ODMG appears to be diverging from the pure and simple definition of Object Database standards. Currently, their main activities focus on Java C++ and Smalltalk work appearing to be of somewhat lower priority. In 1998, the group swapped "Data" for "Database", suggesting that they were addressing implementations other than ODBMSs, such as interfaces to RDBMSs. A new revision of the standard ODMG V3 is planned for later in 1999. However, increased vendor compliance particularly as regards the C++ binding is expected to proceed slowly.
With the partial exception of Versant, all currently known alternatives share a common draw-back with respect to Objectivity/DB, namely the lack of the federation concept. Conventional file-based approaches, be they based on sequential or random access, have the further disadvantage that they require at least one, if not two, additional layers. These layers provide a file (database) catalogue and event directory. Existing experience in HEP suggest that several man-years are required to develop and maintain each of these layers, whilst still not offering the functionality offered by a fully distributed federation.
An estimation of the manpower involved in developing and maintaining the additional layers required over file-based or multiple independent database approaches should be made. An evaluation of the loss of functionality with respect to a fully-distributed federation should be included.
No such estimate has yet been made directly by RD45. It is felt that experience at LEP, and that gained by experiments such as CDF and D0 and others who have adopted similar strategies will provide sufficient information on these approaches.
Of course, the best guarantee that a given technology will continue to exist is demand. This does not guarantee that a specific implementation will always be available, but that equivalent functionality will be supported. A key area where we currently have atypical requirements is that of scale: few applications require scalability into the PB region with correspondingly high data rates. On the other hand, storage requirements are rapidly increasing everywhere. Numerous industry analyses predict that databases of the order of 1PB will be relatively commonplace in the early years of the next millenium. Some predictions suggest that they will be as large as 10PB. Although the storage capacity that will be supported by commodity databases is hard to predict, it is almost certain that VLDBs will require distributed databases with relatively tight coupling, e.g. consistent schema, inter-database references. This is an area of particular importance to us, and one where we can predict a demand that goes far beyond the realm of HEP.
A revised version of the report "Object Databases and Mass Storage Systems: the Prognosis", CERN/LHCC 96-17, should be produced to help assess the likelihood that the requirements of the LHC experiments in the area of data management will be met.
A revised report is currently being prepared by the "PASTA" technology tracking team. The PASTA team covers essentially all of the storage-related issues discussed in the Prognosis, with the exception of ODBMSs. An update concerning ODBMSs is available in "Persistent Object Manager Choices" CERN/RD45/99/01 [4].
There are clearly risks involved with the choice of "supplier" of a given solution. This is true regardless of whether the solution is developed risks in-house or acquired commercially or otherwise, particularly given the extremely long time-scales of the LHC. It is impossible to absolutely guarantee the survival of a given vendor over some 15-20 years, just as it is impossible to guarantee the survival of a home-grown solution over such a long period of time. Risks involved with the choice of a vendor include not only its long term survival potential, but also its focus on a given market segment. In both of these cases, it is important to understand if the vendor considers the market segment to be of importance, and if it is of sufficient size as to guarantee at least the medium-term survival of at least one solution satisfying the requirements of the market segment.
Over the past few years, Objectivity has refocussed its strategy on the high-end, concentrating on areas where its product is architecturally most appropriate. These include markets where there is a demand for scalability and performance probably the two issues of highest priority to HEP. It currently dominates the scientific and technical market, in areas ranging from High Energy Physics to Astrophysics, Geophysics, medical physics etc and has a strong presence in the telecommunications market. There is no doubt that CERN has had a significant impact on this change of focus, and in Objectivitys success in the scientific market place. Their success in the telecommunications market should also be of interest to us: here there is a strong requirement for distribution, performance and reliability. There is also a significant amount of money: current estimates predict that the global satellite-based telecommunications market will be worth some $40 billion by the year 2010. Objectivitys involvement in the IRIDIUM project and likely involvement in other such efforts is likely to provide them with significant revenue a crucial component to ensuring long-term viability.
RD45 should develop a very clear understanding of Objectivitys marketing strategy and financial model. RD45 should continue to be pro-active in helping to define the future direction of Objectivity, by participating in Developers Conferences and Technical Fora. Visits should be made to a number of key Objectivity customers to understand their strategies for risk management. Suggested customers include Motorola/Iridium, Nortel, Fisher-Rosemount and the various Astrophysics projects. The possibility of collaborating with other Objectivity customers, including an assessment of benefits, should be evaluated.
Objectivity believes strongly in the message preached in Geoffrey Moores "Crossing the Chasm" book [6]. As such, they believe that it is necessary to concentrate on, and dominate, one particular market, which will then enable them to move on to new markets and thus expand. During 1998, there were consistently profitable and doubled the number of new customers with respect to the previous year. The evolution and stability of the company will continue to be closely monitored.
Given that it is impossible to guarantee the survival of any vendor over the timescales required for the LHC, a more pragmatic approach should be applied. We need to be aware of the issues involved in a potential migration to a different product, including the costs and timescales involved. Based on the experience gained for past migrations in HEP, and evaluations of alternative products, such as O2, ObjectStore, POET and Versant, it is clear that such a migration require a significant effort and would have an impact on both Computing and Object models of an experiment. The current fall-back solution to Objectivity/DB, for technical reasons, is considered to be Versant. This product also supports a distributed database architecture, although is in a number of ways less suitable for our purposes than Objectivity/DB. However, at the time of writing, Versants financial situation is somewhat unhealthy and cannot be considered a viable long-term alternative there is no clear reason to believe that Versant has a better chance of survival than Objectivity. What is more relevant is whether at least one company that provides a viable solution will continue to survive and what are the implications of adopting an alternative solution.
RD45 should develop a clear understanding of the issues related to the use of an alternative ODBMS product, including issues relating to the lack of support for the federated database concept of Objectivity/DB.
It is felt that the concept of a federation is fundamental and important distinction between a "catalog of files/DBs" approach and an "ODBMS" approach. At a technical level, Versant remains the only realistic alternative ODBMS to Objectivity/DB. However, the company has suffered poor financial results recently, which unfortunately is probably as indicative of the ODBMS market as a whole as it is of Versant as a company.
The next 1-2 years are likely to be a critical period for the ODBMS market and Objectivity in particular. Can the company grow from its current size to have a true international presence? Can it survive the transition from private to public company? It is likely that a new CEO will be appointed on a similar timescale how will this affect the companys stability? What if the company was to be acquired by another vendor, particularly one not interested in the current technology?
These are all questions that are hard to answer and clearly the situation needs to be followed closely. However, given the importance of the product to CERN, we should consider how we could maximise Objectivitys chances of survival and minimise our risks. As the past as shown, CERNs use of Objectivity/DB has had a positive effect on the company as a whole. We should therefore consider which activities are both in the interests of the organisation and could have a beneficial effect on Objectivity, such as participation in relevant conferences (IEEE Mass Storage Symposia, VLDB, OOPSLA etc.)
RD45 should continue to participate in relevant conferences and make presentations on our experience with ODBMS technology.
Presentations have been accepted at the 16th IEEE Mass Storage Symposium and the 1999 SIGMOD conference and have been submitted to VLDB 99. Participation is also planned at the 1st European Workshop on Object Databases.
It is clear that such steps cannot guarantee the survival of the company, and we should prepare for the worst-case scenario. As such, we should ensure that we have access to the source code and build procedures in case the company fails or is acquired.
CERN should add an ESCROW-like clause to the contract with Objectivity that guarantees access to the source code and build procedures. This procedure should be verified at the time of each major release of the product. It should be possible to invoke this clause even if the company continued to exist, but was unable to support the product in a suitable manner for the HEP community.
Attempts to agree on a revised contract with Objectivity were not conclusive pending delivery of the outstanding enhancement requests, any discussion of additional licenses, and hence contractual changes, is on hold. It now appears optimal that any significant increase in the number of licenses be closely tied to the choice of production system, foreseen for around 2001. Nevertheless, a small number of licenses, e.g. for Linux or the HPSS interface, may still be required in the interim.
Even with access to the source code and build procedures, the issue of long-term support needs to be considered. Given that several HEP laboratories are using this technology, we should consider the possibility of sharing resources to provide medium-long term support, or of contracting the support out to a 3rd party.
RD45 should develop strategies whereby Objectivity/DB could be supported should the above clause need to be invoked. The cost of such support should be estimated, whether provided in-house, as a collaborative effort with other HEP laboratories and/or Objectivity customers, or via a 3rd party.
Without access to the source code, such an estimate is essentially impossible to make. Nevertheless, it is believed that the current support team in IT/ASD would be able to support and perhaps even develop Objectivity/DB if given access to the code base. It is our firm recommendation that any decision to use Objectivity/DB for the production period of the LHC be tied to a source code license or, at very least a "working-ESCROW". Such an agreement would ensure that CERN personnel were allowed to build each major release on e.g. Linux and NT to verify that the material held in ESCROW was usable and complete.
We have seen that a given release can be used for something like 1-2 years, before compiler and operating system changes require a new version. The above strategies should be sufficient to ensure that we are able to use the product for at least one complete change in operating system and compiler versions and hence provide time for a migration strategy to be developed and implemented. It is certainly not impossible that the product could be supported throughout the entire lifecycle of the LHC. Such a scenario would certainly be preferable to developing a product from scratch and almost certainly offer advantages over migrating to an alternative product. Parts of the product that are of little or no interest to CERN or the HEP community could be dropped, such as the SmallTalk binding.
An alternative option would be to initiate a joint project with Objectivity and other HEP laboratories. This could allow us early access to the source code and provide a framework in which enhancements of interest to HEP were implemented using HEP resources.
Another option suggested by Objectivity would be for CERN (HEP) to take a stake in Objectivity and thus have a vote on the Board.
Ways of increasing the chances of Objectivitys survival as a company and/or product should be investigated. Proposals should be made whereby we can be reasonably confident that the software will continue to be available for the lifetime of the LHC, even if supported in-house of via some other mechanism.
In 1999-2000, the Motorola IRIDIUM project goes live, and several hundred TB of data should be collected in Objectivity/DB in various HEP experiments worldwide. In addition, numerous presentations of CERN and RD45-related work are planned for conferences such as the IEEE Mass Storage Symposium, VLDB, SIGMOD and so forth. Beyond these activities, there appears to be little that we can do to increase Objectivitys chances of survival.
However, probably the most realistic approach to ensuring long-term access to the Objectivity/DB software is to obtain a source code license. This would permit us to provide better support than using a binary license, offer greatly enhanced protection against failure or take-over of the company and provide a much better starting point for an eventual "HEP solution" than zero. The cost involved in such a license, which is likely to be significant, should be weighed against the extra protection and possible manpower savings in case of problems.
A source code license of Objectivity/DB should be investigated around the time of a decision to use Objectivity/DB for the production phase of the LHC, i.e. around 2001.
As described above, such an option continues to be strongly recommended by RD45.
It is likely that CERN will complete the port of Objectivity/DB to Linux and is already scheduled to help develop tests for a new test framework currently under construction. Hence, we are likely to gain much greater insight into the porting, build and test processes for Objectivity/DB, which would significantly increase our chances of providing medium-long term support in-house.
Over the past 3-4 years, considerable experience in the use of an ODBMS and the issues related to large-scale object data management has been gained. A final fall-back alternative, should no suitable commercial solution exist in the long term, would be to design and implement a HEP-specific system, based upon the experience of the last few years and that we hope to gain from the production deployment of Objectivity/DB. The costs of developing and supporting such a system, particularly in the long term, are likely to be very significant. Given the importance of such a system to the research goals of the laboratory, it would be extremely important to first develop a clear set of requirements and to develop the system, should this be considered to be desirable, with reliability and "cost-of-ownership" as very high priorities. Such an approach could not address the needs of current experiments, but could perhaps be an alternative for a medium to long-term backup strategy.
An estimation of the manpower required to design, implement and support a HEP data management system capable of satisfying the key requirements of the Computing Models of the LHC experiments should be made. The possibility of joint collaboration with other laboratories should be considered.
Industry estimates of the effort required to develop a full ODBMS are typically of the order of 100-150 man-years. Such effort is clearly beyond the resources currently available in IT/ASD group. However, the GEANT-4 collaboration has clearly shown that world-wide collaboration between HEP institutes is possible, and that highly complex pieces of software can be developed in a fully distributed manner. In addition, the considerable experience already gained in ODBMS technology would be a very important factor in both the design and implementation of a possible HEP data management system. Before any realistic estimate can be made, it is clearly necessary that a revised list of requirements is prepared. However, such a list is required in any case. It will be needed to help develop acceptance tests for an ODBMS (or equivalent product), to understand and better formulate our needs for further enhancements to Objectivity/DB and as part of the on-going work on LHC computing models. A first step in this direction could be the HEP ODBMS Testbed [7], discussed in another RD45 note.
Given the recommendations to establish an ESCROW agreement and eventually a source license, an estimate should also be made of the manpower needed to support and extend a solution based upon the Objectivity/DB code base.
Risk analysis and management is an on-going activity that is fundamental to the goals of RD45. The possible steps to evaluate and minimise the risks involved that are outlined in this report require a non-negligible investment, but are considered to be of highest priority.
We have listed above a number of areas where further investigations could be performed. We believe that, given the appropriate resources, a strategy can be developed whereby we can be assured that the required functionality remains available as long as is required and that such a strategy should be affordable both in terms of manpower and direct costs.
Given the approval of the LCB of the steps outlined in this document and allocation of the necessary resources, we would propose to implement the above recommendations and report back on the results to the LCB.
The RD45 reports listed below (e.g. CERN/RD45/xxxx) may be obtained via http://wwwinfo.cern.ch/asd/rd45/reports.htm.
Based upon our experience with Objectivity/DB, a number of important enhancement requests have been identified. We list below the main requests and their current status. Smaller issues such as minor bug fixes are not covered in this report, but are closely followed with Objectivity.
As previously reported, Objectivity have provided STL-based collection classes, using the ObjectSpace implementation, since V5.0 of the product. The issue of providing ODMG-compliant, STL-based persistent-capable collection classes is therefore considered closed.
Questions related to the interoperability of the ObjectSpace and vendor-supplied standard libraries are being followed with ObjectSpace directly.
The use of PC hardware has increased dramatically in HEP over the past few years. Indeed, there are proposals to focus on such hardware, running either Windows or Linux operating systems, for the medium-term future. Commercial software, such as Objectivity/DB, has been available for some time for the PC platform, but for Windows NT/95 only. Since around the time of CHEP 97, the need for Linux versions of the various commercial packages that are part of the overall LHC++ environment became evident. A initial port of Objectivity/DB to the Linux operating system was made available in mid-1998. About this time, the preferred compiler for Linux changed from g++ to egcs, and a port for this compiler was requested. This version was delayed due to limitations in the compiler that prevented code generated by Objectivitys DDL processor from compiling. These problems were isolated by CERN in early 1999, and a fix to the compiler was obtained. A full release of Objectivity/DB V5.1 was made available in April 1999 and the CERN license agreement extended to cover also this platform (for both C++ and Java bindings) in May 1999. This request has thus been satisfied.
An interface between Objectivity/DB and HPSS was formally requested in May 1997. A first "proof-of-concept" prototype was delivered in November of that year and tested at CERN and SLAC, as well as other laboratories, from early 1998. The interface provided is not tied to HPSS its use with other Mass Storage Systems has been successfully demonstrated. The current interface has been used in production at both CERN and SLAC, with data volumes of 1TB at CERN and many TB at SLAC. Due to a limitation in the most recent production release of Objectivity/DB, V5.1.2, an attempt to access a tape-resident database on a given dataserver will block all other requests to that node until the initial request is satisfied. This limitation will be removed in the forthcoming V5.2 release, but the issue cannot be considered closed until this release has been received and fully tested.
As part of the work to extend Objectivity/DB to interface to MSS, a number of additional enhancements were proposed by SLAC. These features, which are fully described in a SLAC document, are all scheduled for the V5.2 release of Objectivity/DB. Included in these enhancements are:
Although the 64bit OID of Objectivity/DB is theoretically large enough to cope with all LHC event data, practical constraints in particular the filesize limits federations to around 500TB 1TB. In the current architecture, 216 databases (files) are permitted per federation. By using databases in the 10-20GB range, federations up to 1PB would be possible. To create federations up to 100PB, changes to the architecture are required. One possible change would be to allow containers, rather than databases, to be mapped to files. This would effectively permit 232 files. Other possible implementations including a longer OID, allowing multiple federations to be operated on as part of a single transaction, and so forth. No firm commitment regarding the implementation schedule or technical details of the eventual VLDB solution has yet been obtained from Objectivity. However, they are now receiving requests for such a capability from customers outside HEP and hence expect to provide such a facility in the medium term. Meanwhile, it is clear that both BaBar and COMPASS will hit the current limits in the 2000 2001 timeframe. Given the lead times in Objectivity releases, it is not unlikely that a stop-gap solution, such as the use of multiple federations, will be required.
Enhancements to the way in which the schema for persistent C++ classes are handled are required such that it is easy and transparent to develop applications across multiple federations. In other words, the developer should be able to build an application using a test federation, and not the production federation of a given experiment. This would require, for example, that no type numbers are hard-coded into the header files produced by the DDL processor. An acceptable solution would be to adopt a similar mechanism to that employed in the Java binding, where the type number of determined at run-time. Such changes should be compatible with currently supported features, such as support for named schema, classes of the same name, but in different named schema and for schema evolution.
In the current version of Objectivity/DB, access control, based on client credentials, is not supported. It is a requirement that such support be added to a future version of Objectivity/DB. Such access control must work consistently across the entire federation, be supported by both language bindings and tools and support both rôle-based (e.g. DBA) and user-based activities. Given the difficulty of implementing a consistent authentification scheme on all relevant nodes in a federation, exits, e.g. at database open time, where site-specific code may be called would be a valid, if not preferred, solution.
The "SLAC extensions", originally foreseen for Objectivity/DB V5.2, are expected to provide sufficient hooks as to permit a site-specific implementation of an access control mechanism. However, this item clearly cannot be closed until the V5.2 release has been made and the functionality has been evaluated.
The C++ binding of Objectivity/DB is not fully ODMG-compliant in a number of areas. For example, the ODMG specifies methods d_activate() and d_deactivate(), which are called when an object enters or leaves scope. It is a requirement that fully ODMG-compliant bindings be provided for all of the languages of interest to HEP (C++, Java), although vendor extensions, for the purpose of performance, are acceptable if clearly marked as such. The Objectivity/DB documentation and training material should be based on the corresponding ODMG language binding.
Unfortunately, improvements in the ODMG compliance of Objectivity/DB are not foreseen for at least the next two releases. Thus, we should assume the current binding and develop whatever work-arounds are required.
A number of important new features, such as STL-based collections and supported for the Linux operating system, have already been provided. Release V5.2 of Objectivity/DB currently targetted for end-September 1999 should satisfy the most time-critical of the remaining enhancement requests, with the exception of VLDB support. Obtaining a satisfactory solution to these issues remains high on the priority list of RD45.
Beyond V5.2 and VLDB, the main enhancements expected in the medium term concern production-level reliability and performance tuning primarily resulting from the experience of experiments such as BaBar. Objectivity have understood the importance of ensuring the success of BaBar and are reacting accordingly.
In addition to the work-items related to the LCB milestones and recommendations, the following activities are worthy of note.
As in previous years, a number of RD45 workshops have been held at CERN. These workshops have been attended by approximately 50 people, typically including representatives from Brookhaven, CERN, DESY, FNAL, IN2P3, KEK, LBL, SLAC as well as experiments based at these sites. In addition, people working on other large ODBMS projects not necessarily based on Objectivity/DB have also attended. The most recent workshops were held in July 1999 and October 1998, the latter being transmitted live using "Web-casting". This technique proved successful and it is planned that future workshops be videoed in this manner. The most recent workshop spanned four full days, covering topics such as Mass Storage Systems, their integration with Objectivity/DB, site and experiment reports, status of enhancement requests, Objectivity/DB futures, as well as progress on the LCB milestones.
The next RD45 workshop is scheduled for the week prior to CHEP 2000, i.e. January 31 February 4, B160 1-009 at CERN.
There are clearly areas of overlap between the RD45 and MONARC projects, Members of each project also participate in the other and a series of joint meetings have been held. The initial such meetings were devoted to an overview of RD45 activities and ODBMS features. More recently, there have been presentations on MONARC activities at RD45 workshops, showing a bi-directional flow of information.
As in previous years, an Objectivity/DB Developers' Conference was held in Santa Clara in May 1999. Although a rather brief meeting one and a half days of presentations from Objectivity users and Objectivity themselves it provides an excellent opportunity to meet Objectivity developers, as well as other users including those from HEP. At this years meeting, given the continued delays in the release of Objectivity V5.2, little of what was presented was new. However, Objectivity continue to state that their market is that of scalable, distributed systems, including areas such as Telecoms and Science and that they continue to be profitable.
Objectivity initiated a user meeting devoted to their European customers in October 1998, the first meeting being held at CERN. At this meeting, each site was asked to provide a short list of enhancements to Objectivity/DB. The highest priority items, listed in descending order, are given below. Only items voted for by more than one customer are listed.
With the possible exception of Unicode support, all of these enhancements are of importance to the HEP community. As these items have been requested by multiple customers, typically from different fields, there is a better chance that they will be implemented than if they were of specialist interest.
The next European Technical Forum will be held on October 28-29 1999 at the European Southern Observatory.
The 1999 SIGMOD conference on Management of Data was held in June in Philadelphia. An invited tutorial on the management of PB databases was given at this conference, helping to spread awareness of the work of RD45. However, with the exception of some follow-on work to the SHORE project, namely PREDATOR, there appears to be no public-domain ODBMS under active development. Similarly, the main interest appears to lie with RDBMSs and "object extensions".
At the June 1999 European Conference on Object Oriented Programming, a specialist workshop on ODBMSs was held. The aim of the workshop was to bring together researchers working in the field of object oriented databases and to discuss the work that is going on. As there are relatively few papers on ODBMS issues presented at conferences these days, a topic for discussion was whether it was felt that all research in this area has been completed, where the future lay and so forth. A presentation on RD45 was made at the workshop, including a summary of the current risk analysis. Unfortunately, the general conclusions were that the overall market is indeed small and not likely to grow significantly in the foreseeable future. However, on the positive side, it seems that the main technical issues are indeed considered solved one simply has to refer to the appropriate research paper and implement the algorithm in question.
The 1998 CERN School of Computing [35] included a track on Petabyte Storage. This included lectures on HPSS, Objectivity/DB and their combined use in HEP. The 1999 school will include a track on Software Building using LHC++, which will contain lectures on Data Storage and Access in HEP, based upon HepODBMS and Objectivity/DB. As well as being included in the proceedings of the respective schools, the material from these courses is available through the Web, including the associated exercises (with solutions!) and will be incorporated in the IT tutorial series at CERN.
The existing licenses for Objectivity/DB covering both C++ and Java bindings have been extended to include the Linux operating system (in addition to DEC, HP, IBM, Sun, SGI and NT). Objectivity have confirmed in writing that the HPSS interface, including enhancements to the AMS to support multiple threads, will be made available at no charge under the current contract.
A number of new products are expected in the late-1999/2000 timeframe. These include so-called "Active Schema" an option that provides ODMG-compliant run-time schema access and Rational Rose integration, permitting the generation of DDL from Rose and reverse engineering from existing DDL files. It is likely that licenses for these options will need to be acquired. Discussions with Objectivity are continuing on these issues.
In preparation for production services, negotiations started in early 1998 concerning adequate support and on-site consultancy. This resulted in a number of important changes, including direct access (via e-mail) to Objectivity/DB developers. In addition, an Objectivity consultant for the HEP market was identified. This consultant, who had previously worked at LBL on BaBar, made a number of visits to CERN, organised to coincide with important COMPASS and NA45 test runs.
Significant pressure was also brought to bear on Objectivity regarding reliable release dates for their product, together with advance information about the exact contents of a given release. Objectivity has introduced an internal bulletin on product status, which is forwarded to CERN on a weekly basis. Previously, Objectivity/DB was developed on Solaris and NT and released for these platforms before the work on porting to other systems commenced. This inevitably led to delays and incompatibilities. Recently, they have moved to nightly builds on all platforms. Hence, at "code-freeze" time, the system builds on all supported platforms. This move is welcomed as it is likely to result in a significant improvement in the roll-out schedule as well as reducing the probability of cross-platform incompatibilities.
The code-freeze date for V5.2 was end-August 1999, with a target release date of end-September. Given the changes described above, we can be reasonably confident that V5.2 will be available at CERN on the October timeframe. As there are a number of significant changes in this release, it is not unlikely that "point" bug-fix releases (e.g. V5.2.1) will be made. However, it is reasonable to assume that V5.2.x will be used for production in 2000.
It is too early to confirm that these changes will indeed result in more reliable schedules and better support. However, they do indicate that Objectivity is reacting to our concerns, which in itself should be considered positive.
During the past year, the use of relational databases (RDBMS) together with object technology continued to increase. Techniques for mapping object models to relational ones are now well understood, and there are numerous packages that assist in automating this process. In the immediate future, the use of an RDBMS together with a mapping layer does not appear to be a viable solution to the needs of the LHC experiments, as the underlying databases are not yet capable of scaling to the multi-PB region. At a workshop on large-scale data management held at SLAC in October 1998, a representative of ORACLE stated that they had no plans to address these data volumes until there was sufficient commercial demand, although object-relational technology is central to their plans, as witnessed by the ORACLE 8i release. Thus, although we can be confident that (O)RDBMS products will be capable of handling such data volumes at some stage during the lifetime of the LHC, it is unlikely that this will happen early enough as to affect the choice for the initial production phase. Nevertheless, as it now appears likely that the mainstream DBMS technology will be RDBMS with sufficient object-extensions as to satisfy business needs [39], the use of this technology should be considered in the medium to long term.
It is now clear that previous predictions concerning the growth of the ODBMS market were overly optimistic. There is no evidence of significant growth the market appears rather constant at around some $100M. This is no doubt at least partly attributable to the fact that all RDBMS vendors offer some sort of "object solution". Indeed, some analysts attribute the perceived down-turn in the ODBMS market to people waiting to see what products such as Oracle 8i offer in terms of OO support. The ODBMS vendors claim that the market has picked up again in recent quarters and is now growing at a healthy rate. Whether this indicates a long-term trend or is just a fluctuation is not clear at this time.
It is generally accepted that for suitable data models and access patterns, ODBMSs do have a clear and measurable performance advantage over RDBMS products. Unfortunately, this advantage is most critical for large volumes of data. As the amounts of data that can be conveniently handled with RDBMSs continues to grow, coupled to improvements in storage and greater CPU power, the corner of the market available to ODBMSs is perhaps even shrinking. Thus, it now appears most likely that ODBMS technology will continue as a niche market, but will not made serious in-roads in the RDBMS market.
The Versant Object Database has been considered the primary commercial alternative to Objectivity/DB for some time. Performance and scalability measurements of this product have been reported in previous RD45 status reports and an evaluation of the issues relating to porting existing applications has been made. Unfortunately, the company has fared rather poorly over the past year or so, although the most recent results show a return to profitability. The number of employees has been cut dramatically and at least one of the key developers has left the company. As a result, Versant is no longer considered to be a viable long-term alternative to Objectivity/DB, although this of course could change with time.
The O2 ODBMS [19] one of the systems considered in the early days of the RD45 project is probably closest to the ODMG standard in its OQL and C++ bindings. Certain features of the product such its limited scalability and support for heterogeneity meant that it was not considered a viable candidate for handling LHC event data. However, it was a well-known ODBMS product, used particularly widely in the University environment. Recent reports from O2 customers suggest that the system will no longer be development or maintained. Although O2 was acquired by UniData now part of the Ardent software company it no longer appears to be marketed by this company, and the previous O2 Website responds with "o2tech.com is no longer hosted on this server."
Objectivity continues to prepare for a public flotation, focussing strongly on quarterly results. They have been consistently profitable over several quarters, which has allowed the firm to expand some 20% in personnel. Should this trend continue, they could expect to reach the size of a company such as NAG or TGS (some 150-200 people) prior to the startup of the LHC. Such an expansion including a strong local support team in Europe would be necessary if adequate resources for product development and support are to be made available.
Much has been made of the Iridium satellite telephone project in the past years, which uses Objectivity/DB as a component of the overall software. As is now well known, Iridium failed to adapt to the dramatic change in market of the mobile phone and hence has not been the financial success that was predicted. However, this is in no part due to Objectivity/DB itself. Objectivity continues to perform well in the telecoms market as a whole, which is perhaps the market most likely to provide long term survivability of the company.
The ODBMS market has not (yet?) taken off as predicted only one vendor (Objectivity) stresses its ODBMS product in its Web pages all others refer to a wider range of applications. A degree of rationalisation has occurred, with existing vendors focussing on different areas of the market. It is expected that the number of vendors will continue to decrease over the next 1-2 years, resulting in a small number of products that are each dominant in different sectors. A large fraction of the activities of all DBMS vendors both ODBMS and RDBMS is related to Java. If this trend continues, it is not unlikely that the C++ side will suffer correspondingly. The emergence of a new ODBMS product particularly one capable of scaling to multiple PB is considered rather unlikely on the timescale of the 2001 decision of the ODBMS for the LHC experiments. On the other hand, it now appears likely that ORDBMS technology basically extensions to existing RDBMS products dominate and that these products will eventually address the data volumes expected at the LHC. Thus, the potential use of these products should also be considered in the longer term.
The CMS Computing Technical Proposal [27] states (section 3.2, page 22):
"If the ODBMS industry flourishes it is very likely that by 2005 CMS will be able to obtain products, embodying thousands of man-years of work, that are well matched to its worldwide data management and access needs. The cost of such products to CMS will be equivalent to at most a few man-years. We believe that the ODBMS industry and the corresponding market are likely to flourish. However, if this is not the case, a decision will have to be made in approximately the year 2000 to devote some tens of man-years of effort to the development of a less satisfactory data management system for the LHC experiments."
Based on the conclusions of the risk analysis (see section 11 above) and given the current state of the ODBMS market, an estimate of the effort required to produce an alternative to Objectivity/DB is clearly required. It should be stressed that the intent at this stage is not to implement such an alternative, but to obtain a better understanding of what could be produced if required and the resource implications. A necessary first step in producing such estimates is a list of requirements. Such a list was produced in the early stages of the RD45 project [10], but clearly needs to be revised in the light of the experience gained during the past years. A first attempt at such a revised list is presented below. We then discuss the scope of a prototype system that attempts to meet these requirements, present the main architectural features of the system, then describe the current status and possible future directions.
The following draft requirements are based upon the experience gained in using ODBMS technology over the past 5 years. To position these requirements, we first describe a number of use cases typical of the HEP problem domain and then discuss in general terms the overall features that a system needs to provide. Finally, we provide a preliminary list of requirements for the main components of a potential system.
Use cases that need to be catered for include:
All these use cases are by no means exhaustive, they do demonstrate the wide range of needs that must be catered for. In each of these three cases which are strongly coupled, if the end user is to be provided the possibility to access any part of the event store there are different priorities. For example, in the case of data acquisition and online reconstruction, priorities include scalability to sufficiently large data volumes and excellent throughput (close to that of the underlying filesystems). The main priority of a physics working group is somewhat different probably reliability and consistency are of greatest concern. On the other hand, an individual physicists requires features such as ease of use, facilities for exporting subsets of the data, the ability to create private classes and instances thereof and so forth. These rather different needs must all be carefully considered if a single system is to be used efficiently for all such purposes.
Sharing data and attribute definitions is particularly important if one is to ensure consistent results across multiple users or analysis groups. In addition, the use of shared data offers the potential of better resource usage: there is less overhead due to multiple data copies and a higher chance of finding useful data in the staging pool or memory cache. This in turn translates to less data transfers to and from tape and over the network. Enabling such shared access is an important design decision that needs to be made as early as possible: it is far from easy to convert single user systems into multi-user ones at a late stage.
An early goal of the RD45 project was to find a solution capable of handling persistent data of all types. In other words, the system should be capable of handling the production data of an experiment (raw data, reconstructed data, analysis data, event tags etc.), statistical data (histograms etc.), detector-oriented data (calibration data etc.), as well as associated meta-data. It is assumed that the data store should meet the requirements of all of these domains.
The bulk of the data in an HEP experiment is composed of data that is write-once and then read-only. These data the raw data are never modified. The reconstructed data, which also represents a significant data volume, is gradually rendered obsolete by more recent versions perhaps produced with more recent calibrations, better algorithms or simply as the result of a bug fix. On the other hand, there remains an important fraction of the data that is frequently updated and for which concurrent access must be supported. Much of this data can be classified as meta-data: database schema, calibration data, tags and so forth. In the past, experiments have developed solutions for handling these updates without resorting to database technology. Examples of such systems include DBL3, FATMEN, HEPDB [28] and OPCAL. However, all of these packages serialised updates through a single server a highly non-scalable solution. Alternative approaches include separate systems for data and meta-data. However, both as the boundary between the two is somewhat arbitrary and as "navigation" between "data" and "meta-data" objects is essential (for example, raw data to calibration), the advantages of a unified approach are clear.
In a large, distributed storage system, the problems of data consistency are significant. Problems that must be handled include:
Handling such situations is often performed "by hand" in todays experiments a highly labour intensive and unrewarding task. It is clearly highly desirable that the data store handle such cases consistently and, where possible, transparently.
The principle requirements that must be met by the storage manager are as follows:
Meta-data is usually defined as being "data describing data" that is, data that makes other data usable. Examples of meta-data include the database schema, detector calibration data, catalogue and naming information, descriptions of event collections and so forth.
Based upon our experience with a commercial tool such as Objectivity/DB, we can identify the following additional requirements that is, those that are not currently satisfied by the above tool. As such, the following list is clearly far from exhaustive.
The system should be able to handle schema and catalogue entries from several thousand users, without interfering with the production schema and catalogue.
The system should check for consistency between the classes used by the application, the schema stored in the system and the persistent objects residing in the store.
The system should cater not only for "production mode", where production schema are relatively stable, but also for the continuous development mode, where schema changes are very frequent.
It is assumed that the most important language binding at least for the immediate future is C++. However, it is clear that other languages in particular Java will also play and important rôle and hence a strategy for adding new language bindings must be defined from the start. The question of interoperability of the different language bindings in particular at the level of accessing data stored from other supported languages, must also be carefully considered.
For the C++ binding, important requirements include:
The key requirements can thus be summarised as follows:
As described above, the scope of the prototype is to directly address the conclusions of the risk analysis requested by the LCB namely to investigate the manpower requirements should it be decided that an alternative data management solution is required. It should be stressed that there is no intention to produce a product at this stage and that continued production services based upon Objectivity/DB remain high priority.
In addition, it is clear that a small prototype could be used for other purposes such as prototyping new ideas, such as the effect of read-ahead, the impact of using different page sizes, and so forth.
Although at the start of the RD45 project, it was estimated that a full ODBMS implementation required something like 150 man-years to bring to market, these estimates were based upon 1980s technology. Since that time, numerous ODBMS products have been brought to market an important existence proof. It is also important to note that ODBMSs are no longer considered a research topic the important issues are already well researched and documented in the literature. Finally, together with the wealth of experience that has been gained within the RD45 project, it is felt that the CMS estimate (see section 15.1 above) of some tens of man-years is more realistic for an implementation that was started today. This is not inconsistent with the above estimates in addition to the improvements in technology that have occurred over the past ten years such as to the C++ language itself it is assumed that not all features of a full commercial system would be required. For example, it is currently assumed that features such as OQL and ODBC would not be required at least in an initial version and that the Smalltalk language binding can be safely ignored. Other features such as synchronous replication could perhaps be implemented in a simpler manner, taking into account the features of HEP event data.
To develop a system capable of meeting the draft requirements listed in section 15.2 above in time for the "2001 ODBMS selection" that is by the end of 2000, would require a significant amount of effort in the intervening period. However, the development a system in time for the ramp up to the LHC, with a fully functional, tested and documented system ready by mid-2003, would appear more feasible. However, it is clear that much more detailed manpower studies which can only come out of further prototyping are required before any decision can be taken in this area. It is our intent to write detailed white papers on the key architectural components listed below, to allow extensive discussion of the ideas that are being studied, together with the preparation of realistic man-power estimates, by the end of 1999. These issues would then form a key element of the RD45 workshop that is scheduled for January 31 February 4 2000.
It is clear that any new system should avoid the scalability problems of existing systems. This is somewhat easier for a "clean-sheet" design than for a deployed system, as there are no legacy constraints. Not only does the scalability of the storage hierarchy need to be considered, but also issues related to the handling of meta-data i.e. the schema and catalogue.
Although HEP data models are complex, with many associations, these associations are typically rather local e.g. within a given event. More generally, the data consists of weakly-coupled groups of tightly-coupled objects. Although a general purpose system cannot afford to make any assumptions about the data models that will be used, it would be possible to exploit this fact to simplify and optimise the system. Similarly, it is well known that the bulk of HEP data is read-only, potentially leading to significant simplifications. Can these features be exploited to result in a reduction of the implementation and support effort, whilst still offering sufficient flexibility? A possible approach would be to mark data as read-only in the catalogue. This would avoid the locking overhead for such data, improving the overall scalability of the system and support for multiple concurrent clients.
There appears to be a great deal of scope for exploiting the features of HEP data and ways of working to simplify the overall system, whilst retaining sufficient flexibility as to meet the overall requirements from the different application domains. The ideas presented above area clearly far from complete, but represent some initial thoughts in this area. Further discussions are clearly required and will be organised through RD45 meetings, mini-workshops and discussions with the experiments.
The overall architecture that is currently being prototyped is based upon the following storage hierarchy:
In order to satisfy requirements ranging from those of individual end-users to the global requirements of the production event store, the smallest usable database would consist of a single file. Scaling up, one could work with a set of files defined by a file-system directory, with optional schema, as shown in Figure 40.

Figure 40 - A Set of User Files
Finally, one could construct a production federation out of multiple domains, as shown in Figure 41. Given the fact that all OIDs are relative, it would be trivial to move files or domains. It would also be possible to attach files to multiple domains or domains to multiple databases in read-only mode.
The above architecture is clearly both flexible and scalable and addresses a number of issues that are seen today. For example, by segmenting the catalogue and schema, the chance of locking the entire federation is greatly reduced. It also simplifies issues such as backup, data exchange and replication, which can work on a single domain, which is by definition logically coherent.

Figure 41 - A Production Federation Consisting of Multiple Domains
The main design choices of the current prototype are listed below, with more detailed descriptions in the following sections.
Commercial databases are implementing using a wide range of architectures. Some systems such as Objectivity/DB use a "fat-client" / "thin-server" model, where the bulk of the intelligence resides on the client side. Other systems such as Versant are built upon a more server-centric model. These two approaches can be classified as:
The page-server approach, being that adopted by Objectivity/DB, is well understood in the HEP community, and is that used in the current prototype.
Another important choice that needs to be made concerns the mapping of the OID. Again, Objectivity/DB and Versant implement different approaches: the former using a physical OID, the latter a logical OID. A physical OID leads to more predictable access times and better scalability, at the cost of less flexibility when it comes to re-clustering. A logical OID allows objects to be moved without changing their identity, but typically requires a redirection table, which does not scale well to large stores. As described in section 15.4 above, an interesting combination could be the use of physical OIDs within domains, allowing the possibility of logical OIDs across domains. The current prototype, however, implements only physical OIDs.
The OID mapping that is used consists of four fields: the domain, file, page and slot. In the current prototype, the domain and page fields occupy 32bits, whereas the file and slot fields are 16bit. These allocations are parameterised and further study is required to determine the optimum values.

Figure 42 - OID Mapping
A variety of different approaches are used by DMBS systems for handling transactions and recovery. Once such technique is the use of log or journal files, where un-do or re-do records are stored. An alternative mechanism is the use of shadow-pages, where the new data are written directly to the database, but access indirectly through a page map. Modifications or new data are written to free physical pages and only become visible at commit time when the new page map is written to the database. Shadowing paging is well adapted to page-server environments, is robust to system crashes, and provides a simple clean-up mechanism. Objectivity/DB implements a form of shadowing paging, where the page map changes are logged in journal files. More recent forms of shadow paging such as that implemented in the current prototype obviate the need for journal files and exploit sequential writes (via appropriate logical physical page mapping) to optimise performance.
A lockserver is required to support concurrent access from multiple clients. In the current prototype, locking is performed at the level of a file, with a queue of requestors in the case of locked resources. In an attempt to offer more scalability, the lockserver operates on the level of a given dataserver, although it is currently implemented as a separate process. As is the case with the dataserver (pageserver), the lockserver is rather straightforward to implement, although detailed performance, reliability and scalability tests need to be performed.
In order for the system to create or reload C++ objects in the address space of the client application, the schema of the classes has to be defined. Using the schema, the database has to perform the following tasks:
Three distinct approaches can be taken to handle the schema definition of persistent classes:
This approach is that followed by the SHORE persistent object manager, which uses the SDL language for defining schema. It is also used by relational databases, where SQL is used to define schema. An issue that must be addressed when using this technique is that of consistency between the C++ application and the database schema.
An example of such an approach is Objectivity/DB.
In this case, one can either extract the schema using a modified C++ parser, as above, or else obtain the schema directly from the compiled program.
The current prototype implementation uses the last option, namely to extract the schema from debugging information. A standard format, used by the Sun and egcs compilers, is currently supported. Alternatively, a modified egcs front-end could also be used.
This technique supports run-time reflection for C++ classes, as follows:
The current C++ language binding supports all main language features, such as virtual functions and templates. As opposed to the C++ binding of Objectivity/DB, no macros are needed templates are used, as defined by the ODMG standard. Again, in contrast to Objectivity/DB, there is no need for automatic code generation the functionality of refs, associations and iterators is all handled directly via templates. No language extensions are required, facilitating the integration of 3rd party tools, such as Rational Rose.
The binding is based upon that defined by the ODMG, although not all features (e.g. indices, bi-directional associations, lock propagation) are yet available. However, the list of classes implemented includes:
The binding also solves are number of problems present in that of Objectivity/DB, namely:
The prototype is implemented in standard C++ and requires a compiler, such as egcs, that supports:
The current prototype is capable of handling the HTL histogram package. In order to build HTL with the prototype, rather than Objectivity/DB, only two changes are required:
These changes are shown in the code fragment below.

Figure 43 - HTL Built with Alternative Persistent Object Manager
Using Python/tk, a simple graphical browser has been built, as shown in Figure 44 below. This browser shows the class inheritance and member data of the persistent HTL objects that have been stored.

Figure 44 - Python Browser
As stated above, there is no intention to develop a product at this stage. The investigations that have been performed have been strictly limited to the on-going risk analysis, in order to enable estimates of the manpower that would be required to develop a full alternative system to be made. However, it is intended that the system be developed to the stage that the various components (object manager, schema handler, lock server, data server etc.) can be used together and have sufficient functionality to store HTL histograms and event tags. It is felt that the requirements of histograms and event tags are sufficiently complex as to enable useful estimates to be made. The target date for such a version is October 1999.
The prototype alternative persistent object manager described above has been implemented as a background activity as part of the on-going risk analysis of the RD45 project over the past 6-9 months. Although far from being a complete product for which the exact requirements have not yet been finalised it does suggest that an alternative solution, consistent with the overall strategy of RD45 could be indeed be built, should it be determined that such a step is necessary. Such a system is likely to require a minimum of 10 man-years, and perhaps as much as the "tens of man-years" foreseen by the CMS collaboration in 1996.
Todays estimates of the effort to bring a full commercial system to market are around 50 man-years. A non-commercial system in particular, one designed to meet the needs and exploit the characteristics of HEP data would almost certainly require less effort. It is likely that effort could be found within the LHC experiments and others, such as BaBar. A minimum of 1FTE / experiment could be expected. Coupled with a small team at CERN, 5-10 FTEs does not look impossible. It would still be very tight to produce a complete system by 2001 indeed, experience suggests that any large software project requires at least three years for a stable system to be produced. However, a minimal system by 2001 and a complete solution by 2003 would seem to be feasible.
The possibility of collaborative development with users outside of the HEP community should also be considered. There are certainly other areas of physics that have similar needs e.g. astrophysics, plasma physics etc. It is also possible that Universities and research institutes, such as INRIA, would also be prepared to provide effort.
These issues clearly require much further study and discussion. However, the first results are both positive and reassuring it seems likely that an alternative solution could be built, if required and that the manpower requirements are not off-scale.
In direct response to the risk analysis requested by the LCB, the RD45 collaboration has developed a prototype alternative persistent object manager, with a view to estimating the effort required to build a production version of such a system, should the need arise. Initial results are encouraging and suggest that a complete system is indeed feasible. Further work in this area clearly requires careful discussion. However, it is important to stress that at least one realistic alternative to Objectivity/DB should be available on the timescale of 2001 and is considered particularly important given the continued small size of the ODBMS market.
In the context of RD45, CERN has associate membership of the Object Management Group (OMG) and is a reviewer member of the Object Database Management Group (ODMG). CERN is also represented in the IEEE Computer Society Executive Committee on Mass Storage the body to which the various standards sub-groups (SSSWG) report.
Regular attendance at the IEEE Mass Storage Symposia has been continued: two papers were presented at the 16th Symposium, held in March 1999 in San Diego. These symposia provide up-to-date information on storage hardware and software and allow contacts with other sites to be established. On the other hand, the IEEE SSSWG, which recently dropped the existing set of draft standards for a new list of somewhat different scope, is not felt to be of direct importance to HEP. Even if draft standards are completed on target, it is not clear if conforming and hence interoperable products will be developed early enough to affect the initial production phase of the LHC.
In contrast to previous years, no ODMG meetings were attended. The reasons for this are twofold:
It is expected that a new revision of the ODMG standard V3.0 will be released during 1999 or early 2000. However, given that vendor compliance to the standard does not seem to be improving, regular attendance at the meetings no longer appears appropriate. However, as is also the case with the OMG and IEEE, continued membership is still recommended, if only for timely access to information on the database (or object / storage as appropriate) industry / market.
Although many of the initial questions concerning the potential use of an ODBMS and MSS as the basis of a HEP data store have been answered and much additional information will also be gained from the production experience of experiments such as BaBar, a number of important outstanding issues remain. These are primarily related to the on-going risk analysis and to the LHCC milestone regarding a choice of ODBMS in 2001. Whilst it is clear that production support for Objectivity/DB-based services must have high priority, we cannot afford to ignore the conclusions of the risk analysis, nor can we delay in preparing for the 2001 ODBMS selection. Thus, we believe that the future activities of the project should include:
Support of Objectivity/DB as part of a full production database service is now considered essential by several experiments. A significant amount of important data is already stored in Objectivity/DB, including event and calibration data, construction data produced by CMS/Cristal and soon simulation data.
In addition, the following items are considered of particularly high priority:
It is felt that the presence of a framework for making the decision of a database system in 2001 (the timeframe set by the LHCC) is crucial. Such a decision should be based not only on technical issues and practical experience, but also support and resource requirements, licensing issues (if applicable), risk analysis and so forth.
In addition, it is felt important that the current "white paper" series be maintained and extended, that future meeting and workshops are held and that contacts with industry and other ODBMS users be maintained. In short, those activities traditionally carried out by RD45 that did not explicitly address a particular LCB milestone or recommendation should continue.
In other words, the proposed R&D activities would include:
The potential use of a mainstream ORDBMS products, such as ORACLE 8i or a later release, should also be considered, if HEP is to benefit from a widely-used technology.
In addition to offering production services, we believe that the RD45 project should address the issues identified by the on-going risk analysis requested by the LCB, and should assist in the preparation for the choice of ODBMS for the LHC experiments that is scheduled for 2001. These issues are reflected in the proposed milestones, presented in section 18 below.
The RD45 project was approved in 1995 to investigate and propose solutions to the problems of handling the persistent objects of the LHC experiments: event data, calibration data, histograms and so forth. Strong emphasis has been placed on the potential use of standards-conforming, widely-used (commodity) solutions. At an early stage of the project, a potential solution, based upon an Object Database Management Group (ODMG)-compliant Object Database (ODBMS), coupled with a Mass Storage System (MSS) built according to the IEEE Computer Societies Reference Model for Mass Storage Systems, was identified. This potential solution has been the primary focus of our activities, although we have continued to monitor and evaluate alternatives. The preferred components of this solution are built on top of Objectivity/DB and HPSS, coupled with a small quantity of HEP-specific code.
Although numerous experiments have used Objectivity/DB over the past few years, a major milestone was achieved during the last year as the BaBar experiment entered production. The experience of BaBar where some 100TB of data are expected per year will clearly be extremely important in understanding if the specific systems identified above will be capable perhaps in a future release of handling the event data of the LHC experiments. Although it is too early to draw definitive conclusions from their experience, the fact that the initial production period has been successful is an important step in validating the overall approach.
Coupled with the experience of running experiments such as BaBar is the on-going risk analysis. Given the state of the ODBMS market and indeed the lack of commodity solutions also in the MSS area it appears likely that alternative, presumably "home-grown", solutions need also to be considered and the effort to develop and support such systems carefully evaluated.
We list below the milestones and recommendations from previous reviews of the RD45 project.
In addition, the project is asked to include the following activities in its work-plan:
RD45 (P59) should be approved for an initial period of one year. The following milestones should be reached by the end of the first year.
It should be noted that the milestones concentrate on event data. Studies or prototypes based on other HEP data should not be excluded, especially if they are valuable to gain experience in the initial months.
ADAMO - a system, developed in the ALEPH collaboration, based on the Entity-Relationship (ER) model.
ADSM - A storage management product from IBM
AFS - the Andrew (distributed) filesystem
CASE - Computer Aided Software Engineering
CORBA - the Common Object Request Broker Architecture, from the OMG
CORE - Centrally Operated Risc Environment
CWN - Column-wise Ntuple
CTP - Computing Technical Proposal
DFS - the OSF/DCE distributed filesystem, based upon AFS
DMIG - the Data Management Interface Group
EDMS - Engineering Data Management System
GB - 109 bytes
HPSS - High Performance Storage System - a high-end mass storage system developed by a consortium consisting of end-user sites and commercial companies
IEEE - the Institute of Electrical and Electronics Engineers
KB - 210 (1024) bytes - normally referred to as 103 bytes
LCB - LHC Computing Board
LCRB - LHC Computing Review Board
LIGHT - Life Cycle Global Hypertext
MB - 106 bytes
MSS - a Mass Storage System
NFS - the Network Filesystem, developed by Sun
ODBMS - an Object Database Management System
ODMG - the Object Database Management Group, a group of database vendors and users that develop standards of ODBMSs
OID - Object Identifier
OMG - the Object Management Group
OOFS - the Objectivity/DB Open FileSystem
OQL - the Object Query Language defined by the ODMG
ORB - an Object Request Broker
OSM - Open Storage Manager: a commercial MSS
PAW - the Physics Analysis Workstation
PETASERVE - a MSS based upon OSM
PB - 1015 bytes
RWN - Row-wise Ntuple
SHORE - Scalable Heterogeneous Object REpository
SQL - Standard Query Language: the language used for issuing queries against databases
SSSWG - the Storage System Standards Working Group
STL - the Standard Template Library: part of the draft C++ standard albeit in a modified form
TB - 1012 bytes
TOOLS.H++ - a former de-facto standard container/collection class library, largely made redundant by the collections provided in the standard C++ library
VLDB - Very Large Database
VLM - Very Large Memory
VMLDB - Very Many Large Databases
XBSA - the draft X/Open Backup Services Application Program Interface