EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH
CERN/LHCC 97-9
LCB/RD45
15 April 1997
We present an analysis of the performance and
scalability characteristics of a HEP event store based upon an
Object Database Management System (ODBMS), namely Objectivity/DB
Objectivity/DB. The event store has been populated both by physics
reconstruction programs (NA45) and by detector simulations (GEANT-4).
The performance and scalability measurements therefore include
both the production and analysis phases of classical physics processing.
The performance has been compared both with
traditional HEP tools, such as PAW/PIAF and HBOOK Ntuples, but
also against the raw performance characteristics of the storage
devices and systems in question.
This response has been produced in response
to the 3rd milestone set by the LCRB for the second
year of the RD45 project, namely:
"Make an evaluation of the effectiveness
of an ODBMS and MSS as the query and access method for physics
analysis. The evaluation should include performance comparisons
with PAW and Ntuples."
TABLE OF CONTENTS
1. Executive Summary 12. Introduction 13. HEP Data Management 24. The Scale of the Problem 35. The RD45 Model 46. Limits and Scalability of the Objectivity/DB Architecture 56.1 Client Cache Capacity 56.2 Large VArrays 66.3 Very Large Numbers of Associations 76.4 Number of Containers per DB, Size of Containers and Databases 86.5 Number of Databases per Federation 86.6 Maximum Federation size 106.7 Very Large Collections 106.8 Re-clustering and the Effect on Existing Collections 116.9 Handling Multiple Containers and/or Databases 116.10 Conclusions 127. Schema Evolution Performance 137.1 Survey of General Performance Issues 137.2 Code Examples 147.2.1 Test Bed 157.3 Comparison of Different Conversion Methods 157.4 Influence of Size of Objects on Conversion Performance 177.5 Performance Issues With Numerous Schema Changes 187.6 Summary 197.7 Conclusions 198. Data Replication Performance 198.1 Performance Tests Results 208.1.1 "Replicated" Container Performance 218.2 Summary 219. Object Versioning Performance 229.1 Conclusions 2210. Data Compression and Impact on Performance 2311. Raw Performance Measurements 2311.1 Hard Disk Read Performance 2311.1.1 Sequential Read 2411.1.2 Random Read 2511.1.3 Selective Read 2611.1.4 Other Hard Disks 3011.1.5 Striped Hard Disk Arrays 3111.1.6 Trends in hard disk read performance 3111.2 Sequential I/O performance through Objectivity 3211.2.1 Influence of the Storage Efficiency on the Database I/O Performance 3211.3 Read/Write Performance of Test-Bed System 3612. Data Selection Performance 3612.1 Current Analysis Practice 3612.1.1 Analysis based on the full event data 3612.1.2 Analysis based on Ntuples 3713. The Ntuple Data and Storage Model 3814. The Objectivity/DB Data and Storage Model 4015. Ntuple Generation Versus ODBMSs 4016. Possible ODBMS Scenarios 4117. The Tag Database 4217.1 Introduction 4217.2 The Tag Hierarchy 4217.3 Clustering Options 4317.4 Comparisons with PAW and Ntuples 4317.4.1 Full Tag Comparisons 4417.4.2 Reduced Tag 4517.4.3 Comparison with larger data amounts 4518. Using Indices for Physics Queries 4618.1 B-Tree Indices 4618.1.1 Queries Using Indices 4818.2 Data Search Performance 4919. Performance and Scalability Enhancement Requests 5419.1 Mass Storage Interface 5419.2 Paged VArray 5419.3 Support for Very Large Federations 5419.4 STL-Compatible Collection Classes 5519.5 Conclusions from Requirements 5520. Conclusions 5521. Appendices 5722. Comparison between RW and CW Ntuple Performance in PAW 5723. Example User Function From NA45 6024. Example of User Function in C++ 6525. Glossary 6826. References 69Executive Summary
In response to the milestones set by the last LCRB review of the
RD45 project, we have made an evaluation of the effectiveness
of using an ODBMS as the query and access method for physics analysis,
together with performance comparisons with existing systems, namely
PAW and HBOOK Column-wise Ntuples (CWN).
It is our conclusion that:
In addition, we believe that the architecture of the current Objectivity/DB implementation offers sufficient scalability such that it could be used as a building block for a multi-PB HEP object store. However, there are a number of enhancements, such as a flexible mass storage interface, extended object identifier and paged VArray, which are highly desirable. We are working with the company to ensure that the requested improvements are made in a timely manner.
This report has been produced in response to the third milestone
set at the March 1996 review of the RD45 project by the LCRB,
namely:
Make an evaluation of the effectiveness of an ODBMS and MSS
as the query and access method for physics analysis. The evaluation
should include performance comparisons with PAW and Ntuples.
This document should be read in conjunction with the March 1997
RD45 status report to the LCB [2], together with the supporting
documents produced for the work relating to milestones 1 [4] and
2 [5]. This document assumes a working knowledge of object-oriented
methods and object data management, as described in [10].
Below, we give detailed performance and scalability measurements
based on a test-bed using Objectivity/DB to store and manage production
physics data. The test-bed has been designed to easily support
1TB of disk-space with tens of TB of tertiary storage. To date,
however, we have only measured the performance using disk storage,
but foresee the addition of a mass storage system (MSS) in a further
phase of the project. Other possible extensions include investigations
of new architectures, including cache-coherant MPPs with very
large memories (VLM).
In addition to measurements relating directly to the milestone
listed above, we also include
Traditionally, data management in HEP has been limited by the
technologies available. The data volumes involved have always
exceeded what could affordably be stored on random access media,
and hence extensive use of slow, sequential media, such as magnetic
tape, has been made, together with extraction of increasingly
small subsets, which have been used for various analyses.
In addition to the use of sequential media, the data access and
management software has typically been developed in-house. This
is normally attributed to the lack of appropriate alternatives,
but was at least in part due to the highly non-standard environment
used in most HEP experiments.
For the future, by adopting widely used technologies, there is
the hope that the bulk of the hardware and software required for
HEP data management will be commodity items. By definition, such
components should be widely used, well supported, and also affordable,
as the development costs will have been amortized over the large
user base. This allows us to rethink our approach to data management
and, hopefully, implement a solution offering much more functionality,
whilst requiring similar, or perhaps even less, manpower to build
and run.
It is important to stress that the fundamental differentiating
factors concerning HEP data processing are the sheer volume of
data - the major LHC experiments will generate about 1 PB, 1015
bytes, of data per year, and also the data rates that must
be supported - up to 1.6GB/second for the ALICE data acquisition
system. In addition, the HEP discipline is fully distributed -
it must be possible to provide peer access to the data to any
physicist from essentially any location on the world (although
this should not be taken to imply data copies).
On the other hand, HEP data processing has a number of simplifying
characteristics:
As shown in the table below, three of the LHC experiments will
produce raw data at a rate of around 1 PB per year. Including
all data, both real and simulated, and integrating over 20 years
of running for ALICE, ATLAS, CMS and LHC-B, we can expect around
100 PB of data in total. Although the data from the various experiments
is likely to be stored in separate logical databases, this number
gives an indication of the scale of the problem - 100 PB, 1017
bytes or 0.1 EB (Exabyte). This data volume is the largest of
any known project in the time period involved, several times larger
than that of e.g. NASA's "Mission to Planet Earth",
which plans to store some 15PB by 2015.
|
| |
| 1 second | ||
| 1 minute | ||
| 1 hour | ||
| 1 day | ||
| 1 week | ||
| 1 month | ||
| 1 year (100 days) | ||
| TOTAL LHC (15 yrs) | ||
One of the key issues over the coming years will be to investigate
how the currently proposed solution, based upon Objectivity/DB,
can be expected to scale to such enormous data volumes, and to
identify areas that need to be improved. However, preliminary
investigations of both the scalability and performance of Objectivity/DB
(see section 6 on page 5 and section 17.4 on page 43) have been
most encouraging.
To date, work has been limited to small physical databases of around 2 GB each, grouped into logical databases (a federated database, in Objectivity terminology) of a few tens of GB. Clearly, a demonstration of a much larger federation is required, and the size of this federation must grow with time, perhaps reaching as much as a few hundred TB 2-3 years before the start of LHC data taking.
As can be seen from the table above, the experiments at the LHC
will require solutions capable of scaling far beyond the PB region.
Although a solution capable of handling a few PB per experiment
will be sufficient at the start of data taking, the solution adopting
must be capable of scaling by at least one if not two orders of
magnitude beyond this. Predicting technology so far into the future
is an extremely error prone activity. For these reasons, and as
described in CERN/LHC 96-17 "Object Databases and Mass Storage
Systems: the Prognosis" [3], we have deliberately based the
RD45 model on technology that either exists today, or can be expected
in the next 2-3 years. In reality, we expect that technology will
continue to evolve, but we have nevertheless listed the areas
of concern, together with milestones for various prototypes and
also fall-back solutions.
We reiterate the basic elements of the RD45 model below. This
model assumes that a totally integrated, object-oriented solution
is required. Although today's solutions are far from transparently
integrated, future solutions must be capable of handling much
larger volumes of data and higher data rates, whilst demanding
no more and preferably even less man-power to run. Clear savings
can be made from a solution that offers transparent navigation
from any object in the entire logical store to any other - this
is hard to achieve in a scheme built out of separate components,
often based on different technologies, as is the case with today's
solutions.
This report contains numerous performance measurements of the
above model. These measurements need to be continued beyond the
data volume that we have achieved. In particular, we would suggest:
In other words, building a hundredth-scale rather than thousandth-scale model of the LHC production phase. Closer to LHC data-taking, this test-bed should be extended further still, to perhaps one tenth scale.
As part of the risk analysis, discussed in the RD45 Status Report [2], we have tested the limits of the Objectivity/DB architecture, and investigated its scalability.
The Objectivity/DB architecture implements a so-called "fat-client",
that requires that all the data the program is accessing during
a transaction is kept in the client cache. In fact, in Objectivity/DB
there are two separate data caches designed to handle objects
of different sizes. Base objects that are smaller than a single
page and objects such as containers and databases are kept in
the base cache, whereas objects large than a single page, so-called
"large objects", are kept in a separate, dynamic cache.
The base cache consists of a number of pages that have the same
size as the database pages - a federated database constant. When
an application starts executing, an initial number of pages is
allocated for the cache. This cache may later be extended, by
the addition of extra pages, until the upper limit is reached.
If there are no free pages in the cache, no new objects can be
accessed - some objects would have to be closed, either explicitly,
or implicitly at transaction commit time, e.g. at the end of processing
the current event. In the general case, lack of free pages in
the client cache would lead to transaction failure, unless the
application included the appropriate error handling code. However,
this is not expected to be a problem for HEP applications, as
the number of active objects, even in the extreme case of processing
raw data, will be not be constrained by the client cache size.
One could also consider defining the upper cache limit to be very
high. However, as the base cache never shrinks back even if some
pages are freed, the process image size in memory will quickly
increase up to the given limit. This may be a problem on clients
with smaller amounts of memory.
The cache pages correspond directly to database pages. In the
worst case, when only a single object on a given database pages
is of interest, there will be only one opened object on the page(s)
in question. Clearly, efficient clustering will minimise the number
of unused objects that are brought into memory. However, this
should not be overlooked when calculating the expected cache size.
No limit on the number of objects that can be stored in the base
cache has been observed - the only limit appears to be the size
of the cache itself.
The second, dynamic, cache has no limit on object size, providing that the system limit on process memory size is not exceeded.
Version 4.0.2 of Objectivity/DB has a soft limit for the dynamic
cache size. "Soft" means that the cache can grow beyond
the limit, but when some objects are closed and the space occupied
by them is freed, the cache will shrink back to this soft limit.
This limit is, by default, equal to the maximum size of the base
cache, with the possibility to be resized later from within the
application.
We have observed that the maximum number of opened large objects is equal to the limit of pages in the base cache, even though large objects are stored in a completely separate cache. Such a connection is not understood and is being discussed with Objectivity. We have also requested enhancements such that each cache may be managed independently.
Objectivity/DB provides a variable length array (VArray) that
can be resized at run-time. The indexing type is uint32, which
means that a VArray can hold up to 232 entries. This
number is large enough to satisfy known HEP requirements. However,
the effective limit in the size of a VArray comes from the fact
that the entire VArray must be read into memory before an operation
can be performed on it. Simple operations such as accessing a
value of an entry require just enough memory to read in the VArray.
On the other hand, to increase the size of an existing VArray,
the current implementation requires enough contiguous space to
hold the VArray after the resize as well as the original
copy. Fortunately, the client cache size is not a restriction
on the size of VArrays, as large VArrays are stored in the dynamic
cache that is specifically reserved for large objects.
We have performed tests to understand the effective limits on
the size of VArrays on a machine with a process size limit of
128MB. The first approach was to create a VArray with the maximum
possible initial size. We were able to achieve a VArray that contained
15 million 64-bit entries, i.e. a total of 120MB. The other approach
was to create a small VArray and expand it incrementally. In this
way we could create a VArray with only 3.5 million entries (28MB),
which is about one quarter of the size from the first test. The
difference is probably the result of copying and also of the transaction
mechanism.
The conclusion from the tests is that the current implementation of VArrays works reasonably only for objects of size in range of 10MB and less. Beyond this limit the performance deteriorates, firstly due to a significant amount of I/O, then because of the large memory requirements. It is recommended to create VArrays is to create them with large initial size and later expand them in big steps but rarely.
There is a possible risk of running out of memory when expanding
big VArrays. Existing VArrays may be also inaccessible on some
clients with modest amounts of memory.
The preferred solution to the mentioned problems would be for
Objectivity to provide "paged" VArrays, that would
load into memory only accessed pages and would not require continuous
memory space.
Support for such VArrays, termed LVArrays in Objectivity/DB parlance, is scheduled for release in a future version of Objectivity/DB, expected to enter beta-test during 1997.
In Objectivity/DB the associations between objects are implemented using VArrays of object references (ooRefs). Because of this they inherit all the behavior and features of VArrays. The important difference is that there is no direct control of the initial size of the association table and of its resizing policy.
During our test we have found out that the initial size of association table is 4 entries. If needed, the table length is increased by 40% every time. The growth factor is always the same, so it is possible to calculate when the next resize will happen.
For example, at certain point an association table can hold from
3800516 associations. If it was filled up, to add a single new
association the table would have to expand up to 5320723 entries.
At the time of the expansion our test program had process size
of over 150MB. The next resize was not possible due to lack of
memory.
Some object models, such as the current prototype for the ALICE
raw data, require very large numbers of associations. However,
in the case of individual events, we expect that the number of
associations that will be required will be of the order of 10-100,
or at the most 1000. As one database page of the default size
8 KB can hold about 650 associations, this amount is not considered
a risk area. In extreme situations one association table of size
12MB can contain 1 million associations and the applications should
be able to handle it.
Association tables suffer from the same problems as VArrays. Implementing "paged" VArrays would also have a significant impact on the number of associations that can be stored in one object.
The number of containers per physical database is limited to 32K
(215 - 1). Although we have no obvious requirement
for a greater number of containers per database, we have nevertheless
tested building a database with such a large number of containers.
The current Objectivity/DB architecture limits the maximum container
size to a multiple of the database page size. Using database pages
of 8K, the maximum container size is 229 bytes, or
0.5GB . Using the maximum database page size, containers
are limited to 4GB. As a physical limitation, this is not considered
to be a significant problem, although there is a clear need for
logical containers, which group together multiple physical containers.
Theoretically, maximum database size can be 128TB, if they are stored on a true 64bit filesystem. However the current release of Objectivity/DB (4.0.2) is not able to handle databases bigger than 2GB. This problem has been reported to Objectivity and we are awaiting a patch to fix it.
In the Objectivity/DB architecture, a single logical database
is composed of many physical databases. Currently, each
physical database is mapped to a file and the logical database
is termed a federated database.
The current RD45 model, using multiple physical databases limited
to some 100GB, requires that we will use many thousands of physical
databases, spread across multiple servers. Indeed, the current
64-bit OID used by Objectivity/DB implies a maximum size of a
federated database of only 6.5PB (216 - 1 databases
of 100GB each). Although this would be adequate for a single year's
data, it is not sufficient to store all the data produced over
the lifetime of an LHC experiment. Using one federation per year
would be an acceptable, but undesirable, fallback solution.
As it would seem unwise to plan on a physical database size of
much more than 100GB, there is a clear requirement to increase
the current 64-bit OID, or at least change the mapping from logical
to physical model to circumvent this problem. This has been raised
as a requirement with Objectivity.
To understand whether such large numbers of databases can really be handled by the current architecture, we have used the existing test-bed to try to build a federation containing the maximum number of empty databases. This allows us to understand issues relating to the number of databases, without requiring a massive storage system in which to place them.
During the tests we have created a federation containing about
13,000 databases. We were able to make the following observations:
Both things make it difficult to create larger numbers of databases.
We have reported the problem to Objectivity and we are still investigating
the reasons for such database behaviour.
In general, we estimate that at present a federation of 50TB could be created with Objectivity/DB 4.0.2.
During tests we have created a database with Alice prototype data
model with maximum size of 50GB. Using a compressed file system
on Windows NT and storing empty objects we were able to create
a 500GB database.
After solving the problems with database size and creation time,
the maximum federation size would be 6.5PB (assuming the RD45
data model with a physical database size of 100GB).
Changes to the logical to physical mapping in the Objectivity/DB OID are planned for a future release of Objectivity/DB, scheduled to enter beta-test during 1997. At the same time, support for multi-file databases will be provided, e.g. permitting each container to be stored in a separate file. It is our understanding that these changes do not increase the overall size of the OID, but would nevertheless permit federations much larger than 6.5PB. We intend to participate in the field-test program to understand if these changes are sufficient to meet our overall requirement for federations up to 100PB with no arbitrary constraints on the physical partitioning of the data.
Some physics channels at the LHC are estimated to include very
large numbers of events - perhaps 109 or even more.
It is highly unlikely that collections are a viable approach for
managing such large numbers of objects, and an approach based
on containment, i.e. where all events corresponding to a certain
channel would be stored in a given (set of) containers and/or
databases, is more appropriate.
Solutions to this problem include "collections of collections",
e.g. where a given physics channel is divided into multiple collections,
each corresponding to a data taking period, or direct support
from the database, in a manner that does not require that the
entire collection is loaded into the client cache.
Further investigation of the handling of very large collections needs to be made.
The ODMG does not define the implementation of an OID, and various
different strategies are taken by the vendors. That followed by
Objectivity is to use an OID that currently has a direct physical
mapping. This has significant advantages in terms of performance
over logical OID implementations, which typically imply at least
one additional level of indirection, but implies that object re-clustering
is likely to render existing collections invalid. If bi-directional
associations are used between the collections and the objects,
then re-clustering can be performed at any time without rendering
these collections invalid. However, if uni-directional associations
are used, which is expected to be the case for user collections,
then re-clustering will render such collections invalid.
A number of scenarios exist which minimize, or even hide, the effect of re-clustering on private collections. For example, a validity stamp could be used to determine automatically whether collections were still valid, and even update the collections if required. However, it is clear that further investigation is required to fully understand the issues involved.
Independent of the current limits on container and database sizes,
there is a clear requirement for a facility whereby containers
and databases can be limited to a given size, with new containers/databases
created automatically as required. In addition, facilities to
iterate over the multiple containers/databases must be provided.
Such a facility could be provided either by the database vendor
or by HEP-specific application code. The preferred solution would
be for the vendor to provide such libraries, although it is highly
unlikely that such implementation-specific areas will ever be
standardised, and hence the usual caveats concerning vendor-specific
features apply.
Opening containers and databases requires a lock per container/database.
At present, the number of simultaneous locks that the lockserver
can handle is around 23500. With a lot of containers open it
is possible to reach this limit, but during testing of our prototype
databases less then 100 locks per client was needed.
The lockserver table size will be increased in the future releases
of Objectivity/DB.
We have tested the limits restricting the creation and accessing
objects in Objectivity/DB. In some areas the limits imposed by
architecture of federation (maximum number of databases) have
been found too restrictive. In other places the limiting factor
was the memory available for the client. In the case of databases
problems have been encountered and reported back to Objectivity.
The areas where improvements is required, are:
Our requirements in these areas have been fed-back to Objectivity
in the form of product enhancement requests. Specifically, we
have asked for:
It is our understanding that all of these problems will be addressed
during the coming year. We will nevertheless continue to follow
up with Objectivity on these issues, and report back to the LCB
on the results of further investigations in these areas.
Schema evolution, and its implementation in Objectivity/DB, is described in detail in [5].
Below, we report on the performance of the Objectivity/DB implementation.
To understand the implication of schema evolution on data access
performance, we have measured the access time to small, medium
and large objects, both before and after schema evolution has
occurred, in different scenarios:
In all cases, we have measured the access time to objects using
all three modes supported by Objectivity/DB, namely: eager,
deferred and on-demand.
In eager mode, all affected objects in the entire federation
are converted immediately. In on-demand mode, objects are
converted within a specific scope, e.g. within a given container
or database. Finally, deferred conversion converts objects
as they are accessed. These modes may be mixed - for example,
one may start by converting objects using deferred mode, and later,
at a convenient moment, convert all remaining objects in defined
containers or databases. Clearly, flexible conversion mechanisms
are required when large volumes of data are involved - it would
be unacceptable to have to wait for the entire multi-PB federation
to be converted for each schema change.
The results presented below were obtained on a 90MHz Pentium PC with 64MB of memory and a 1GB hard disk, running Windows/NT 4.0 and Objectivity/DB 4.0.1. Given the relatively poor performance of this configuration, the absolute values quoted below should be given less emphasis than the relative values.
The impact of schema changes on access performance to "affected" objects is significant only during object conversion itself. Access times to such objects before and after the necessary changes have been stored in the database to no differ significantly, assuming that the schema changes do not result in major differences in object size. Object conversion itself is a relatively expensive process, especially if a large amount of data is to be converted inside one transaction:
The example below shows how schema evolution looks to an upgrade
application. In immediate mode, one simply opens the federated
database inside an upgrade application and requests the upgrade
to be performed.
ooTrans trans ;
ooHandle(ooFDObj) fdH ;
trans.upgrade() ;
trans.start() ;
fdH.open("TstFB", oocUpdate) ;
fdH.upgradeObjects() ;
trans.commit() ;
In on-demand mode, one requests the conversion within a specific
scope.
ooTrans trans ;
// declare fdH, dbH, contH
trans.start() ;
fdH.open("TstFD", oocUpdate) ;
dbH.open(fdH, "tstDB", oocUpdate) ;
contH.open(dbH, "tstC", oocUpdate) ;
contH.convertObjects() ;
// or dbH.convertObjects() ;
// or fdH.convertObjects() ;
trans.commit() ;
The most trivial scenario, when all objects are stored in a one-way
list in the same container, only a few lines of code are needed
to enable conversion of all objects using deferred mode. The majority
of the work is to locate the objects, rather than the act of conversion
itself.
ooTrans trans ;
trans.start() ;
ooRef(simpleList) node = head ;
while( nodes-- && (node = node->next()) )
node.update() ;
trans.commit() ;
It is important to note that the update() function must be called explicitly on every object that is to be converted; just accessing an object in an update transaction is not enough.
During the tests described below, all "affected" objects,
i.e. objects of the class which is changed during schema evolution,
were kept in the same container, thus converting objects in one
container should give the same results as converting the entire
database or even federation. Regardless of the conversion procedure
or scope, e.g container, database or a federation, the same number
of objects was involved.
Objects were kept in one-way list, thus simplifying the conversion task. Similarly, selective conversion (e.g. every second object, every third, etc.) was easy to accomplish.
Conversion of the same amount of data using different conversion
tools gives similar results - the differences are within a range
of +/-10%. The best conversion performance (if objects are not
too small) is given by deferred conversion (all objects converted
in one transaction). However, the commit time for deferred mode
conversion is worse than for other methods, thus no one technique
emerges as the fastest.
The numbers presented in the table below are averages over many
measurements for the conversion of 10MB of objects, each object
being 500 bytes in size. In these tests, we have simply inverted
two data members, as this is a persistent change that does not
alter the size of the affected objects. Other tests with the same
amount of data but different object size (100 bytes and 1 KB)
show the same general behaviour, although the actual figures observed
are different. More details are given in section 7.4 on page 17.
|
| |||||
| read access, no schema evolution (s.e.) | 0.25 | 0.05 | 0.06 | 27 | 20 | 3 |
| update access, no s.e. | 0.40 | 0.07 | 0.09 | 1469 | 103 | 190 |
| read access after s.e., before conversion | 0.39 | 0.18 | 0.06 | 50 | 40 | 7 |
| 1st update access after s.e.
(deferred conversion, persistent storage) | 0.54 | 0.22 | 0.10 | 2308 | 141 | 381 |
| read accesses after conv. has been stored | 0.25 | 0.05 | 0.06 | 30 | 23 | 3 |
| update accesses after conv. has been stored | 0.40 | 0.07 | 0.08 | 1402 | 103 | 168 |
| immediate conversion after s.e. | 0.64 | 0.28 | 0.10 | 1693 | 160 | 380 |
| on-demand conv. after s.e. (on container) | 0.58 | 0.28 | 0.10 | 1848 | 135 | 376 |
| on-demand conv. after s.e. (on database) | 0.60 | 0.28 | 0.10 | 1452 | 155 | 316 |
| on-demand conv. after s.ev. (on federation) | 0.61 | 0.28 | 0.11 | 1587 | 135 | 391 |
|
| |||||
| read access, no s.e. | 5,0 | 1,0 | 1,2 | 0,03 | 0,02 | 0,003 |
| update access, no s.e. | 8,0 | 1,4 | 1,8 | 1,47 | 0,10 | 0,19 |
| read access after s.e., before conversion | 7,8 | 3,6 | 1,2 | 0,05 | 0,04 | 0,007 |
| 1st update access after s.e.
(deferred conversion, persistent storage) | 10,8 | 4,4 | 2,0 | 2,31 | 0,14 | 0,38 |
| read accesses after conv. has been stored | 5,0 | 1,0 | 1,2 | 0,03 | 0,02 | 0,003 |
| update accesses after conv. has been stored | 8,0 | 1,4 | 1,6 | 1,40 | 0,10 | 0,17 |
| immediate conversion after s.e. | 12,8 | 5,6 | 2,0 | 1,69 | 0,16 | 0,38 |
| on-demand conv. after s.e. (on container) | 11,6 | 5,6 | 2,0 | 1,85 | 0,14 | 0,38 |
| on-demand conv. after s.e. (on database) | 12,0 | 5,6 | 2,0 | 1,45 | 0,16 | 0,32 |
| on-demand conv. after s.e. (on federation) | 12,2 | 5,6 | 2,2 | 1,59 | 0,14 | 0,39 |
If the conversion results have already been stored persistently,
subsequent update accesses show essentially the same performance
as prior to schema changes.
Attempts to perform immediate-mode or on-demand conversion on
a database where no schema evolution has been performed takes
essentially no time, showing that the software correctly detects
that there are no outstanding conversions to be made. Running
the same tool in the case where schema evolution has been performed,
but already stored persistently, involves scanning through all
objects checking the schema evolution history. Thus performance
is then significantly slower than in the case where no evolution
has occurred, but still much better than when conversion has to
be performed.
Access performance to objects (read or update, compared respectively)
before schema evolution is exactly the same as after schema evolution
and after conversion results have been stored persistently. Before
the storage, conversion is performed "on the fly" (read
access), thus performance degrades slightly.
Mixed more conversion has also been tested, e.g. a combination of deferred and immediate mode. Strangely, the real time of committing a transaction appears to grow as the number of objects converted in the transaction decreases. This is also being followed up with Objectivity.
Object size appears to affect conversion time differently for the various conversion modes. The results from immediate or on-demand conversion are shown below.
Here we observe that, for a constant volume of data, overall conversion
time improves as the total number of objects decrease. In other
words, converting a few large objects is quicker than many smaller
ones.
Tests of deferred conversion show that performance goes down rapidly
for very small objects.
|
| |||||
| read access before schema evolution | 0.05 | 0.04 | 0.005 | 240 | 230 | 3 |
| read access after schema evolution | 0.63 | 0.58 | 0.007 | 19944 | 19027 | 0 |
| read access after conversion | 0.06 | 0.04 | 0.005 | 245 | 231 | 0 |
| RATIO after s.e. / before s.e. | 11.78 | 15.57 | 1.58 | 82 (!!) | 83 (!!) | 0 |
| update access before s.e. | 0.07 | 0.05 | 0.005 | 1492 | 310 | 224 |
| 1st update (deferred conversion) | 0.20 | 0.16 | 0.010 | 1161 | 330 | 191 |
| update access after conversion | 0.06 | 0.04 | 0.006 | 1612 | 320 | 191 |
| RATIO 1st update / update before s.e | 2.97 | 3.42 | 1.89 | 0.78 | 1.06 | 0.85 |
| immediate, on-demand conversion | 0.26 | 0.23 | 0.009 | 1613 | 148 | 365 |
These figures show that read access suffers most from schema evolution.
The commit time for a read transaction appears to take an incredible
amount of time and is not understood. The time needed to read
objects, while performing conversion "on the fly",
is about 12 times slower than when no conversion is required.
Only the system time is relatively stable. Update access (deferred
conversion with persistent storage) performs much better then
read access. Committing update transactions is affected neither
by schema evolution nor by object conversion - the same is true
also with larger objects.
Other tools (immediate or on-demand) perform a little slower then
deferred conversion, but the performance is still reasonable (see
the table above).
The same tests made for objects of 100 bytes in size show much small effects. Committing a read transaction is (only) 4 times slower, and reading objects takes about 3 times that for the same objects without conversion. For objects of size 500 or 1000 bytes, the differences with/without conversion are only a few tens of percentage.
Consider the scenario when a person A performs only one small
schema change (e.g. changes the order of two members), while a
person B makes some tens of changes (which do not effect object
size). Our measurements show that the process of converting objects
will be shorter for person B than for person A! User time is about
1.6 longer, system time is faster, and real time seems to be about
1.1 slower. This is not understood and will be checked with Objectivity.
Objectivity/DB allows multiple schema changes to be made without requiring that the affected objects are actually converted between successive changes. The complete history of changes to the schema is stored, thus during conversion, successive changes are performed in the right order. The time taken to perform multiple conversions is approximately the same as that for a single conversion.
Our tests confirm that the schema evolution conversion tools provided
by Objectivity/DB perform as expected. However, there are a number
of observations which are not yet understood, which we list below:
In the current Objectivity/DB version there is no tool for checking up if conversion results have been stored persistently. If a tool accesses objects after schema evolution but before conversion, deferred conversion is performed automatically by the tool. As a tool opens read mode transaction, the conversion results are not persistently stored in a database, as is expected. We have requested Objectivity for a way of checking that changes have indeed been stored persistently.
At the time of writing, the above measurements have not been fully understood, but will be pursued further with Objectivity.
Data replication, and its implementation in Objectivity/DB, is described in detail in [5].
Below, we report on the performance of the Objectivity/DB implementation.
In a read-mostly environment, as is the case in HEP, access to
replicated databases should be at least as good, if not significantly
better, than to non-replicated databases, particularly when access
can be shared across multiple database servers. However, write
operations are likely to be penalised. Even in the best case scenario,
when writing to two servers with identical performance and connectivity,
some small overhead is to be expected. If wide-area replication
is used, particularly over low-bandwidth networks, it is possible
that significant performance penalties will be involved, as the
data is written to all available database images at commit time.
Thus, the commit will wait for the slowest machine involved in
the transaction. Updates for images that are not visible at transaction
commit time are written to logfiles, for transmission to the remote
server when it becomes available again.
As above, the measurements were made using Intel PCs. The local
image was stored on a 90 MHz Pentium PC, with the remote images
on two 150 MHz Pentium PRO PCs using Objectivity/DB, Objectivity/FTO
and Objectivity/DRO version 4.0.1, a pre-release of the officially
supported version, 4.0.2.
Unfortunately, tests between more then three machines (one local and two remote) could not be performed by the time that this document was finalised - Objectivity Fault Tolerant Option and Data Replication Option were available only for Windows NT and only three NT platforms were available for our tests. These options have since been released for Unix platforms, and we plan to repeat our tests with more systems and across heterogenous architectures.
Our conclusions are based on observation of real time (real time
should not scale with the number of images if operations are not
performed serially). User and especially system time have to scale
with growing number of images: adding another image means always
an unavoidable additional network communication.
Opening a transaction
Replicating data over network seems to have no influence on opening
either read or update transaction time; regardless how many images
exist, the performance is unchanged. This means, that Objectivity
does not check up, whether remote images are available at the
beginning of a transaction (even in case of an update one), leaving
the problem of obtaining a required quorum for a time when changes
are to be committed.
Creating objects
Objects were created in three different scenarios:
Although for some measurements performance obtained for scenario 3 was worse then sum of performances obtained for scenarios 1 and 2, for many others there was no (or small) difference in performance between scenario 2 and 3. That proves, that Objectivity does try to perform an operation in parallel, creating objects at the same time in local and remote image. Better results (per object) are obtained for smaller object.
Adding another remote image results in getting performance worse
about twice. Only for very few measurements performance a little
better was obtained. Connection between local and both remote
images should be equal, thus better performance could be expected,
e.g. performance of the slowest image.
Committing an update transaction
Committing changes performs like creating objects. Maintaining
a local image together with a remote one does not mean that expected
performance will be close to sum of time needed for maintaining
the images separately, but adding another remote image seems to
give (unfortunately) a sum. Again same measurements proves, that
operations are not performed serially, but results are very close
to these expected in case of processing data one image after another.
Reading objects
Existence of a remote image influence significantly access time
to objects.
Reading object from a database, if two its images exist (one local and one remote) always give much worse results then reading data explicitly from a local image (only one image exist, kept on local machine). Reading performance is the same (or sometimes even worse!) as performance of reading data explicitly from a remote image (only one image exist, kept on remote machine). Adding another remote image does not influence performance of reading at all, both real, user and system times are exactly the same.
The Objectivity/DB Data Replication Option gives the possibility
not only to replicate an entire database, but also provides management
tools at the container level, including the possibility of changing
control of a container from one partition to another. The process
of passing control of a container to another partition means that
the container still remains under logical control of its database,
but physically is stored in the remote partition. If the "replicated"
container is the only one container in the database, the amount
of data stored in it equals amount of data stored inside the database.
Tests of both read and write performance to such a container show no significant difference whether the container is controlled locally or remotely.
Data replication is a new feature of Objectivity/DB version 4,
which was released in early 1997. This first release is already
quite powerful, although there are a number of areas where enhancements
have been requested, as described in [5]. In addition, there are
a number of measurements that are not yet understood, including:
Further tests of data replication are required, including between more than three machines, and also using different network configurations, including also tests across a WAN.
Object versioning, and its implementation in Objectivity/DB, is described in detail in [5].
Below, we report on the performance of the Objectivity/DB implementation.
As described in [5], versioned objects in Objectivity/DB are accessed
through a so-called genealogy object. Thus, one would expect
a small performance overhead, corresponding to this one level
of indirection. In principle, the performance penalty should be
identical to that involved in opening an object and traversing
an association to a second.
We have measured the overhead of accessing versioned objects both
as a function of the number of versions and as a function of object
size.
As can be seen from the table below, object versioning does not
behave as would be expected. In particular, the number and size
of the VArrays that are created looks highly anomalous. Pending
an explanation of this behaviour from Objectivity which we believe
to be a bug in the current implementation, we have not made any
further investigation of the performance of object versioning.
For every version one VArray is created, to hold the associations
to subsequent versions. These associations, accessed through nextVers[
], are defined as to-many. In addition, extra VArray(s) for the
associated genealogy object may need to be created, which has
associations to all versions of an object via allVers[
].
| 2 | 3 | 56 | 0 | 0 |
| 10 | 11 | 122 | 0 | 0 |
| 100 | 101 | 769 | 0 | 0 |
| 600 | 601 | 4254 | 0 | 0 |
| 650 | 601 | 4254 | 50 | 10152 |
| 1000 | 601 | 4254 | 400 | 10152 |
This behaviour is clearly anomalous, and is being investigated with Objectivity. We note that the same behaviour is seen regardless of object size - we have tested versioning for objects of 12 bytes, 2KB and 50KB in size. Attempts to implement versioning "by hand" result in the creation of the expected size of VArrays, confirming the hypothesis that the implementation is flawed. The same behaviour is seen both with the pre-release of Objectivity/DB version 4, and the official release, 4.0.2.
The Windows NT filesystem (NTFS) performs optional file compression
on a file, directory or drive basis. This compression typically
shows a 50% saving for text files, and some 40% for executables.
In the case of an Objectivity/DB, a "best-case" scenario,
where only zeroes are written, a database that would take 2GB
uncompressed can be stored in as little as 50MB! Similarly, we
were able to create a federated database that logically contained
400GB of data, although only 25GB of disk space was required.
To test the effectiveness and performance of NTFS compression
on physics data, we have copied the NA45 Ntuple, used for the
performance measurements described in section 17.4 on page 43,
from the CERN CS-2 to both compressed and non-compressed NT filesystems
and compared the space requirements and impact on performance.
| Uncompressed | Compressed | |
| Space requirement - full tag | 327MB | 210MB |
| Space requirement - reduced tag | 22.2MB | 16.2MB |
Ideally, the performance overhead introduced by the ODBMS software should be less than a few per cent. In other words, one should be able to read and write data at approximately the speed of the underlying storage system, although this will typically be an upper limit for best-case scenarios. As is the case with all existing ODBMS products, Objectivity/DB uses the standard filesystem in which to store databases - each database appears to the operating system as a normal file. This means that standard techniques, such as parallel filesystems, file caching etc., translate directly into improved database performance. The maximum throughput obtainable would thus depend only on the hardware resources made available.
We will describe the results of various raw performance
measurements which were made on hard disks. We measured the performance
for three kinds of read patterns: sequential read, random read,
and selective read. These measurements were made to find out what
gains can be expected by switching from the traditional sequential-read
based analysis methods to selective-read analysis methods (section
11.1.3 on page 26), in the case that the hard disk speed is the
limiting factor.
The results below are also important for the interpretation of
some of the higher-level performance measurements which were made
(section 17.4 on).
Performance measurements were done with 40 MB files, created on
a near-empty filesystem in a single pass, on the following configuration:
All measurements were made with a small C++ program which used the basic UNIX filesystem calls (open, read, lseek) in the same way as they are used by the SunOS Objectivity/DB implementation. The small C++ program did no significant processing, other than the processing needed to generate the read pattern. We ensured that the OS filesystem cache did not contain any of the data to be read. Therefore, what we measured was the raw performance of the hard disk, accessed though the operating system but without the benefit of OS-level caching.
To find out about a possible performance factor dependent on the
choice of the Objectivity page size, we measured the sequential
read speed for different (simulated) Objectivity page sizes. Observing
the result for a sequential read in the figure below, we can see
that the choice of page size does not affect the speed.
Sequential read measurements further showed that the bandwidth
with which data is read varies depending on the location of the
data on the hard disk. We created 60 files which together spanned
slightly more than one disk in the array, and measured sequential
read data rates from 4.5 MB/s up to 6.5 MB/s for individual files,
with a mean value of 5.5 MB/s.
The explanation is that with current hard disk technology, the rotation speed of the disk is constant, while the tracks near the outside of the disk contain more data than the tracks near the inside. The limiting factor to the amount of data on the disk is the number of bytes which can be stored per square millimetre on the disk platter. The amount of data which can be put in a track is therefore dependent on the physical length of the track.
We measured the performance for a completely random page read
pattern for different (simulated) Objectivity page sizes. Observing
the result in Figure 11-1, we see that the random speed is much
lower than the sequential read speed. For example, it is about
a factor 7.5 lower for 8 KB pages. For small page sizes, the random
read time is completely dominated by the hard disk seek time.
A more useful measure of random read speed is therefore the number
of pages per second, as plotted in Figure 11-2. This figure also
shows that for very large page sizes, the gap between sequential
and random reading will close.
In a selective read, a collection of pages is traversed sequentially,
but only some of the pages are actually read in. This is illustrated
by Figure 11-3, where only the dark pages are read. We have measured
the performance of selective reads with the following properties.
An important metric for a selective read is the page selectivity,
which is the percentage of pages in the collection which are read.
If there are multiple objects per page, the page selectivity differs
from the object selectivity, the selectivity of the (physics)
query which determines which objects to read. The relation between
the page selectivity Spg , the object selectivity
Sobj and the number of objects per page Npg
is:
This relation is illustrated in Table 11-1. For the
implementation in section 17.4.2 on page 45, which has 6 full
tag objects on each 8 KB page an object selectivity of 2% translates
to a page selectivity of 11%, and an object selectivity of 20%
translates to a page selectivity of 74%.
We have measured the performance of selective reading for different
selectivities on 8 KB pages. Observing the result in Figure 11-4,
we see that the bandwidth decreases rapidly when the reading
becomes more selective. The curve eventually levels out at the
bandwidth value for a completely random read. To find out in which
cases selective reading outperforms a normal sequential read over
all data, we can calculate the speedup factor for the selective
read optimisation. With a sequential read speed of 5.4 MB/s, we
have
The result of this calculation is plotted in Figure 11-5. We can
conclude that for this disk and this page size, selective reading
is only interesting as an optimisation technique if the page selectivity
is smaller than 15%. This 15% figure explains the somewhat disappointing
performance results for the 'cold' case of the reduced tag implementation
of section 17.4.2 on page 45. Note that the measurement in that
section was done on another system with another type of disks.
We have also measured the speedup for different page sizes. Observing
the results in Figure 11-6, we note that 16
KB pages yield about the same speedup curve. For 32 KB pages,
selective reading will pay off if the page selectivity is smaller
than 30%. Note however that, for the same object selectivity and
object size, a page selectivity of 15% on 16 KB pages generally
translates to a page selectivity of 30% on 32 KB pages.
The hard disks we did our measurements on where manufactured in
1996. Some comparative measurements were performed on different
hard disks, which were manufactured around 1994. On these hard
disks, selective reading on 8 KB pages will speed things up if
the page selectivity is below 10%.
A surprising result on these disks was that selective reading
with a page selectivity in the range 20% - 70% would slow things
down: it had a speedup factor which was noticeably less than 1.
As the worst case slowdown, selective reading of 40% of the pages
took 160% of the time of a sequential read of all pages. The `left
to right' reading of a specially constructed non-random page set
took even 550% of the time of a sequential read of all pages.
This result is probably due to a shortcoming in the read-ahead
caching logic of these hard disks: the disks failed to enable
read-ahead caching during some parts of the selective read, so
that the throughput was essentially reduced to that of a completely
random read,
Though these results show that one has to be careful with using
old disks, we expect that most current disks, and probably all
future disks, will be able to properly recognize the sequential
nature of a selective read, so that the speedup factor will never
be noticeably less than 1. Object database applications, which
will often generate selective read patterns, will become more
common. We therefore expect that all hard disks will soon include
the caching logic necessary to detect selective reading and apply
read-ahead caching to it. Also, the price-per-bit for RAM chips
is improving faster than the price-per-bit for hard disks platters.
Therefore, one can expect that hard disk cache controllers will
have increasingly larger caches to work with.
Striping is usually done to improve sequential reading, but if
the striping factor is above the page size, striping can improve
performance for random and selective reads as well, because multiple
hard disks can seek in parallel. Note however that to exploit
this parallelism, one needs to also parallelise the program which
does the reading.
Though striping can speed up random and selective reads, it will
speed up sequential reads with the same factor. Striping will
therefore not change the selectivity value at which selective
reading becomes attractive as an optimisation.
To get an optimal price/performance ratio for systems which do
a lot of random or selective reading and writing, it is better
to install a disk array with many small hard disks than a disk
array with a few large hard disks. For the most part, random and
selective read performance is determined by the number of independent
disks (seek mechanisms) one has available in the disk array. Seek
mechanisms in current mainstream hard disks all have about the
same speed characteristics, no matter what the disk capacity is,
and the hard disk price is mostly linear with the disk capacity.
The speed of random and selective reading is dominated by the
seek time of the hard disk, and seek times are improving only
slowly. The seek time is dominated by two mechanical factors:
the disk rotation speed and the speed with which the read/write
arm of the disk can be moved. The speed of sequential reading
is largely dominated by the data density on the medium, and this
speed is improving more rapidly than the seek time.
We therefore expect the speed gap between sequential reading and
random reading for equal page sizes to widen in future. For reading
8 KB pages, the gap may widen from a factor 7.5 for our measurements
now to a factor 20-30 in 2005. This widening also has an impact
on the selectivity threshold at which selective reading becomes
attractive as an optimisation device. The current threshold of
15% for 8 KB pages could change to 5% in 2005. Of course, some
new technology development may completely change the gap between
random and selective reading. Note however that the commodity/desktop
market, which is expected to drive innovation, is largely dominated
by sequential reading.
One important access pattern for HEP applications is reading or writing in a "sequential" way to the event store. The term "sequential" writing means in this context that one (or a few) container in the database are continuously extended by inserting new objects at the end of the container. This pattern is dominant during data recording or reconstruction and during a physical reclustering of data. The term sequential reading in a database is used in the following for reading objects in the same sequence, as they are physically stored in the container.
The sequential read access pattern is dominant for selections
from all objects in a given database or container using Objectivity's
scan method (e.g. a primary selection against the tag database).
It should be noted that for sequential database access (and only
for sequential database access!) one could expect to reach an
I/O rate comparable to the rate delivered to a program that performs
sequential file access.
On order to optimise the performance of object I/O using an ODBMS it is essential to control the storage size those objects will need in the database and the storage overhead introduced by the database. Since Objectivity's I/O is done in units of pages (a typical page size is 8 KB) rather than on single objects, the database I/O is limited by the page transfer rate that may be achieved through filesystem or network.
User applications are normally interested in the rate of object I/Os, which may be significantly lower if the database pages are not densely filled with object data.
The following table summarises the main contributions
to the effective object size and storage overhead on disk. For
small objects the contribution of 14 bytes from the persistent
base class (ooObj) may introduce a significant storage overhead
and thereby reduce the I/O performance. Consider the following
extreme example: For persistent objects containing a single float
attribute (4 bytes) the rounding the next 8 bytes boundary yields
8 bytes. Adding the overhead of 14 bytes from ooObj one obtains
the storage size on disk of 22 bytes. The fraction of the float
attributes in the total data transferred by the database is thus
about 18 %. Consequently, the best-case retrieval rate for this
attribute can only reach 18 % of the disk transfer rate. It should
be noted that for some applications the additional functionality
offered by the database for individual objects like indexing,
associations and versioning might still make this approach a sensible
implementation. In most cases it will be more efficient with respect
to performance and storage efficiency to make small object persistent
by containment in a persistent VArray.
| Size of base class | ||
| Embedded attributes | ||
| Alignment of data members enforced by compiler / CPU | ||
| 4 Bytes embedded
| Embedded Varray is implemented as external variable length storage object | |
| Inline 1-to-1 Association | ||
n * 8 bytes | Inline 1-to-n Association | |
n * 12 bytes | Non-inline Associations | |
| Remaining free space on page | Object Distribution on page | |
| Round attribute size up to next 8 bytes boundary | Object Alignment in "slots" |
To study the influence of the object size on the storage consumption on disk we have generated multiple databases, each containing objects of only one fixed size. For each database a total amount of 200 MB object attributes has been written.
We define a storage efficiency defined as
Eff = (N * Object Size) / (DB File Size).
The measured storage efficiency as function of the object size
is shown in Figure 11-7.
The plot shows that for objects smaller than 1kB or larger than 51kB Objectivity/DB achieves a storage efficiency of more than 90%. For larger objects (> 1MB) the efficiency is larger than 99%. We consider the result for large objects to be promising since the dominating contribution to the event data will come from raw data objects, which probably will be stored as large VArray objects.
When the size of stored objects is similar to the page size of the database, the remaining space on a page with one object can not be re-used for another object. This overhead caused by the distribution of objects on pages can be significant. For objects of 4Kb and 8kB, the storage efficiency drops to nearly 50%. In the first case every 4kB object has to go onto an empty page since the remaining space from the 8kB page is slightly smaller than the next 4kB object. In the 8kB case the 8kB object as a "large" object has to be split onto two pages. The second of those pages - even though nearly empty - will not be reused to store parts of the next large object.
It should be pointed out that both extreme cases are expected
to occur only rarely in realistic applications. Since the object
sizes in a particular container are normally distributed, the
remaining space on a page will often be re-used by the next smaller
object. Still this example shows that one should carefully check
the relation between object and page sizes.
The figure above shows the sequential write performance obtained
for the same set of databases. The top line show the rate in which
pages are obtained by the database, lower line shows the effective
rate in which object attributes are obtained. The measured rates
are compared to a measured filesystem transfer rate of 6 MB
/ s measured for sequential writing into a file from C.
For object sizes larger than 4kB the database operated at a page rate of more than 4 MB/s (more than 70% of the throughput obtained for sequential I/O). The achieved page rate is not very dependent on the object size. The effective rate in which application objects are retrieved shows the influence of the storage efficiency in the region around 8kB where object and page size are comparable. The effective data rate reaches only 50% of the page rate in this region.
For smaller objects (<8kB) only a relatively low transfer rate
of 1MB/s (~17% of the filesystem performance) has been achieved.
Since also the page rate is much smaller than for larger objects,
storage overhead effects can not explain this. A more detailed
investigation on this topic is underway.
Read/write performance of up to 100MB/second has been measured
on a Digital Alpha 4100 server. These figures exceed the initial
goal of 90MB/second. To achieve these results, the following configuration
was used:
Read/write performance through the filesystem layers, including software RAID etc., shows a degradation of less than 2%.
One of the key issues related to understanding the effectiveness
of an ODBMS as data store for physics analysis is how efficiently
physics objects can be selected from the data store and accessed
by a particular analysis program.
Currently most analysis tasks in high-energy physics fall into
one of two distinct classes separated by their input data type.
These mostly non-interactive programs are implemented by compiled FORTRAN and freely access any data item from the experiment event data hierarchy. Typically they read sequentially through the full input data set or do a more sparse read based on a fixed event classification scheme. Normal analysis practice is either to copy a selected subset of full events to another disk file or to copy a subset of all data items into a so-called Ntuple for further analysis.
Often interactive programs that process a small amount of data
stored in a special file format called "Ntuple". The
introduction of another copy of the data together with a limitation
of the data format to a simple table of attributes (typically
one row per event) served mainly two purposes:
It should be noted that both traditional analysis scenarios are still mainly based on sequential access, because the I/O subsystems used do not provide any more advanced access methods like indices or hash tables.
ODBMS products provide a wide range of more sophisticated techniques
for selecting data from large input. In the case of Objectivity/DB,
these include
An initial understanding of the performance and scalability characteristics
of these techniques is essential input to the design and implementation
of an object store that is optimised for query performance.
We report below on generic performance measurements that have been made in these areas.
The original PAW Ntuple was essentially a table of floats
- only one data type was supported, a maximum of 512 columns was
permitted, and the number of rows was largely determined by the
size of the /PAWC/ common block and also limitations imposed by
the Zebra RZ system, which until recently (and still by default)
used 16-bits for fields such as the maximum number of records,
and the maximum directory size. Data was stored in an Ntuple by
first filling a vector of length equal to the number of columns
in the Ntuple, which was then passed in the argument list to the
filling routine. A minimum of two data copies was required prior
to writing the data to persistent storage (disk), when further
data copies could take place, depending on issues such as the
data format ("native" or "exchange"), the
number of rows, output buffer size and so on.
When reading such an Ntuple, the entire row had to be read, even if only a few columns from the row were selected (in fact, many rows are fetched from disk, corresponding to the size of the Zebra bank used to store the data in the Zebra RZ file, itself determined by the parameters given at Ntuple "booking" time). These "original" Ntuples are now referred to as "Row-wise" or RWN since all data items from one row are physically clustered on the disk file.
Since the Ntuple data was completely detached from the main event data it had to physically contain all data items which were used to perform event selections and all data items which were visualised for the selected events. It is clear that this clustering of selection data together with visualisation data degrades the selection performance for selection of small selectivity since visualisation data is only rarely accessed in this case.
More recently, an improved Ntuple was introduced. Not only did
this overcome many of the limitations imposed by the RWN, such
as permitting data types other than floats, variable length
blocks etc., but also a more flexible storage model was introduced.
The new Ntuple format clustered all data items of a given column
on disk and allows therefore to read back only (buffers containing)
the selected columns. Its name is hence "Column-wise Ntuple".
The potential performance gain derived from this new storage model
relates directly to the fraction of the total number of columns
that are read - if only a small fraction of the columns are selected,
then a large gain will result. If virtually all of the columns
are required, then a correspondingly smaller improvement, or none,
will be seen. However, the implementation implies that a selected
row must always be read in its entirety. A more granular implementation,
whereby the required attributes are only read for those events
(or rows) passing the selection criteria may offer better performance,
depending how the actual clustering is performed.
Statistics show that typical PAW queries access only 20% of the columns in an Ntuple, hence CWN should in theory result in a speed-up by a factor of 5. However, the measured performance of CWN seems not to be only limited by I/O time, but also by their more complex implementation of their data handling. Already if more than 10% of all CWN variables are used, the benefit of less I/O with respect to RWN is completely used up and the CWN performs worse than the RWN.
It should also be noted that the low fraction of used Ntuple columns
might merely reflect one of the known weaknesses of the current
Ntuples. Creating an Ntuple is typically a lengthy process, requiring
an ad-hoc batch job which processes a large subset of the data.
If it is discovered that more information than is present in the
Ntuple is required, or if one or more columns needs to be recalculated,
then this lengthy process must be repeated. Hence, the observation
that only 20% of the columns are referenced in typical queries
may simply reflect indicate that users are trying to minimise
the number of times that the Ntuple must be recreated, and store
extra information "just in case".
This technique is simply an elementary form of data clustering.
By minimising I/Os, performance is improved - the objective behind
efficient data clustering. As Ntuples are stored in ZEBRA RZ files,
the level of clustering that is possible is constrained to that
supported by RZ itself. RZ allocates disk storage in units of
records, which are allocated sequentially from the free pool.
This implies time-based clustering - data that is added to an
existing RZ file will only be stored on the same or adjacent
records only in the case when no other data has been added to
the file since the original records were written. Although the
Ntuple implementation attempts to circumvent this problem, by
maintaining large buffers for the different Ntuple columns, the
data is necessarily fragmented, except in the atypical situation
of very small Ntuples.
In addition, the Ntuple stores both the information that is used
for queries and the information that will be analysed -
in principle, the selection of events can be based upon a small
subset of the event characteristics and should not force a common
clustering strategy for the data used for selection and the data
that is to be e.g. histogrammed. Using an Ntuple from NA45 for
comparison, we examine below the benefits of separating the data
used for queries from that needed for analysis.
It is our opinion that the analysis framework should not impose
a particular data model or format, and that converting data to
such a format is a major inconvenience which should be avoided
in future systems. This is particularly important given the volume
of data involved in an LHC experiment - redundant copies must
be avoided at all costs.
In summary, the storage models used by both RWN and CWN were directly dependent on the data model, which in turn was inflexible and imposed artificial constraints.
In an ODBMS, the data and storage models are largely de-coupled.
The data (or rather object) model, is essentially that of the
user, within the limitations imposed by performance. The basic
guidelines, as summarised in milestone 1, or as follows:
Data clustering and re-clustering techniques provided by an ODBMS
are far more powerful than those offered by a CWN, although they
are based on objects and not object components. That is, a clustering
directive applies to the entire object - it is not possible to
store one data member of a given object instance in container
X and another in container Y. (If such a decomposition is required,
a different object model is probably called for.) Each instance
of an object may be given a separate clustering directive. Alternatively,
the directive may be class-based, event-based, or use some other
appropriate strategy. Although the clustering directive can be
used to write data essentially sequentially, it is clearly far
more flexible than the record-based implementation imposed by
ZEBRA RZ. Using appropriately sized containers, additional objects
can always be located close to existing ones, and a fragmented
database/file can be avoided.
In contrast, the clustering implemented by the CWN is fixed, and
is directly coupled to the data model, which requires that the
user describe their data as a (set of) Fortran COMMON block(s).
To examine the effect of different clustering techniques possible
with an ODBMS, we have implemented those enforced by both the
RWN and CWN, and experimented with the more flexible capabilities
mentioned above. Comparisons are made below primarily for NA45
event data, but also for simulated data from GEANT-4.
Object Databases also support reclustering of data, whereby persistent objects are moved, for example, to different containers, or even different databases. The extent to which such reclustering is fully transparent depends on the architecture of the ODBMS in question - extra levels of indirection offer more flexibility, but less performance. We report below on the reclustering techniques that are available with Objectivity/DB, and their impact on different access methods.
In conventional HEP data processing, Ntuple generation is a separate
step, often repeated many times. Ntuple generation essentially
consists of scanning through large numbers of events and selecting
only the needed data. In fact, this step cannot be avoided even
if all of the data from all events are selected,
simply due to the design of PAW, typically used to perform interactive
analysis of Ntuples.
A further drawback of the Ntuple approach is that the entire input dataset needs to be reprocessed if additional data are required.
With an ODBMS, however, such as step is not required - analysis and visualisation can proceed directly from the ODBMS itself, although it may be advisable to copy the data in many cases, simply to recluster a sparse data sample. However, in the case of such reclustering, one can still access any other component of any event, with a performance penalty that is significantly lower than performing the entire reclustering exercise, or Ntuple generation, from scratch. We report below on the costs of accessing such data in various scenarios.
Rather than extract data into specific streams, as is typically
performed today, it is expected that popular subsets of the event
data will be accessed via named persistent collections.
A possible scenario might be the generation of a persistent event
collection using a boolean predicate function, as shown in the
example below. This can be considered to be an extension of the
widely used event directory concept, with the advantage that these
features are supported directly by the database, and can thus
be used in a consistent manner with other data management techniques,
rather than as an add-on.
Rather than issue a true query, the user would typically access
such a named event collection and then refine it further. This
may result in another persistent collection, or a purely transient
collection, as shown below.
Predicates themselves may be defined in a variety of manners,
e.g.
Predicates may themselves be persistent objects, stored in the database, and associated to other persistent objects, such as resulting event collections, as required. Predicates may be combined using AND and OR operators. In the case of persistent predicates, object database features such as object versioning may be used. An example application, using versioned persistent predicates, is described in [5].
One proposed technique for enhancing query performance is the so-called tag database. A tag database, which is logically part of the experiment's federated database, contains the key attributes on which queries are performed. In contrast to Ntuples, which contain both the fields on which queries are made and the attributes to be visualised, the tag database need not contain attributes that are typically not used in queries. (It is, in fact, possible for the tag objects to contain attributes that are typically not used in queries. Similarly, queries may be issued against any attribute in the federation, and not just the subset in the tag. however, the such queries will obviously not benefit from the clustering of the tag objects in the tag database). As the tag objects are managed by the database, they benefit from the full ODBMS functionality. For example, one can profit from schema evolution facilities to add additional data to existing tags. Similar, one may version tag objects, or, most importantly, use data replication to ensure that a local image of the tag database is available, thus reducing network bandwidth required to perform a query, increasing reliability and reducing the load on central machines.
Rather than agree on a single tag for all working groups, a more flexible alternative would be to use a hierarchy of tags. This avoids the problem of creating a tag sufficiently flexible, and therefore large, as to satisfy all potential users. In addition, it helps to circumvent problems related to reclustering - the users' collections are always based upon standard collections with bi-directional, and hence automatically updated, associations to the event data.
For example, one might implement the following tag hierarchy:
In such a scenario, users would typically start from their private tags, accessing the workgroup and finally experiment tags only when a match was found at the previous level.
Various clustering operations are available for tag objects. For
example, the clustering may be performed:
The former case would typically be implemented using a single
tag object per event, each attribute being a data member of the
tag object. As the tag is a true object, (bi-directional) associations
to e.g. the corresponding event data are directly supported by
the ODBMS. In addition, access to individual attributes is highly
efficient. The main disadvantage of such a technique is that clustering
cannot be optimised for queries that use only a few attributes.
However, using techniques such as those described above, such
drawbacks can be easily circumvented.
Clustering by attribute could used analogous to the column-wise
Ntuple. A possible ODBMS implementation would be to store the
individual attributes in separate VArrays. However, it would also
be necessary to maintain a parallel set of VArrays to handle the
associations from the tags to corresponding event data. Such an
implementation would permit efficient queries on a small number
of attributes, but would clearly add additional complication,
and would not exploit some valuable capabilities of the ODBMS.
Below, we describe performance comparisons between PAW and Ntuples
and simple TagDB implementations. The approach has been to perform
one to one comparisons between Ntuples and TagDB implementations
using Objectivity/DB. To this end, we have used a standard NA45/CERES
Ntuple from their 1995 production data. The same analysis has
been performed using both PAW + Ntuples and Objectivity/DB, under
a variety of different cache conditions. The benchmark environment
used in both cases was identical, using the following:
The NA45 Ntuple used contains 302 columns (all floats) and some 21K rows, giving a total size of around 25MB. This is seen to be a somewhat typical size for Ntuples today, although they are often combined into larger logical units using the PAW chaining facility.
The main time during the PAW-based analysis is spent in a single command, namely:
ntuple/loop [ntuple] ana.f
ana.f is a single, compiled, Fortran function that performs
all section cuts and histogramming. In the current analysis, some
15 columns are typically used to make selections, although this
is expected to rise so that eventually 80 columns are used.
The original Fortran source is shown in section 23 on page 60
and the translated C++ version is shown in section 24 on page
65. Note that both versions suffer from the naming restrictions
of Ntuple variables - a pure C++ implementation would obviously
use longer, but more meaningful, names.
In the case of the TagDB, the selection code is as follows:
tagItr.scan(tagCont,oocRead);
Timer t("simple scan");
long total = 0, matched = 0;
t.Start();
while(tagItr.next()) {
total++;
if (tagItr->Match()) // the Match function accesses the attributes
// used for histogramming
matched++;
}
t.Stop();
The time shown is that spent on performing the event selection, including the time to access the attributes used in the selection and those attributes used for the histogramming. The time spent in filling and displaying histograms has not been measured, as this is independent of the database.
The first tests were based upon an implementation whereby each
row in the NA45 Ntuple was converted to a separate "tag"
object, i.e. an object with one attribute corresponding to each
column in the Ntuple. This means that no traversals are required
to access additional objects to perform the query or to fill the
histograms. All the tags were stored in a single container in
an Objectivity/DB database, and clustered according to insertion
time. A compiled, user-written selection function was used in
both cases. The time taken to compile and load these functions,
and the time taken to fill the histograms, has been subtracted
from the values shown.
| Time | Comments | |
| PAW + RWN | 11.3s cold 2.5s hot | First pass - 0% cache efficiency Second pass - 100% cache efficiency |
| PAW + CWN | 16.4s cold 2.6s hot | Converted using htonew |
| TagDB | 6.2s cold
1.4s hot |
A slightly more complex implementation is one where that data used for the selection is stored separately from that used in the analysis of the selected events. In this case, two objects are used - one containing the 15 "columns" used in the selections and the other containing the remaining data. These objects were clustered separately and were stored in different databases (files) in the same federation. Only in the case, that an event is selected, a traversal is made from the reduced tag object to the full event data.
It should be noted that the separate clustering of tag and event data results to a better clustering of tag objects since the amount of unneeded data read per page is reduced.
If an event is selected and the event data hierarchy is accessed
this separate clustering means also that in most cases another
new page with the event object has to be transferred. Whether
this implementation performs better or worse than the Full Tag
implementation depends therefore also on the selectivity of the
event selection. The following table shows the results for three
cases ranging from no selection at all to 20% selected events
- the selectivity resulting from the NA45 selection cuts. The
reduced tag implementation shows clearly shows the expected performance
gain if no events are selected. In this case no traversals to
other event objects are made at all. Only those attributes contained
in the tag are accessible. A cold query on the reduced tag is
more than four times faster than on the full tag. This performance
gain is reduced to roughly equal performance if
| Tag Implementation | Time | Comments |
| Full Tag | 6.2s cold 1.4s hot | No traversal |
| Reduced Tag - 0% selectivity | 1.3s cold 1.0s hot | No traversal |
| Reduced Tag - 2% selectivity | 5.5s cold 1.2s hot | One traversal |
| Reduced Tag - 20% selectivity | 7.5s cold 1.5s hot | One traversal |
For the first tests we have tried to keep our benchmark as closely as possible in agreement with the Ntuple analysis done by NA45. This constraint also led to our limit in the total number of events (25k) and the total data amount of some 25Mb.
In a second step we have artificially enlarged our dataset by
creating multiple copies of the same events. For PAW we copied
the Ntuple files into new files. For the ODBMS we have copied
the event and tag objects within the database. Within the available
filesystem space, we managed to increase the size of the analysis
data by more than a factor of 32 representing an Ntuple of nearly
1GB size.
Although the query performance achieved with the TagDB was found to be very promising, we did not yet fully exploit the more advanced access methods provided by an ODBMS. Up to now, the technique we used to select events were still based on a sequential scan through all tag objects in the TagDB. The time needed to perform this operation on a single processor is expected to scale linearly with the number of object in the input collection. A much weaker dependency on the input collection size up to very large input collection (more than 109 events) is expected for the use of indices on selective analysis attributes.
Indices are a way to cluster one or more object attributes together with a reference to the original object on special index pages. Within one index page the attribute reference pairs are ordered by the value of the attribute. All index pages form an n-ary Tree (n being the number of attribute/reference pairs fitting on one database page). The index implementation used by Objectivity/DB is based on the B+-Tree algorithms, which keep a tree reasonably balanced even for a degenerated distribution of attribute values [comer paper]. The main advantage of using a B-Tree index results from the fact that it allows to find all objects for with the given attribute lies in a particular value rage with much less I/Os than a sequential search. The number of page I/Os needed to find the first matching object is only determined by the height of the B-Tree. Since this height increases only logarithmically with the number of objects in the tree, this means that a tree based search will scale much better than a sequential search with increasing number of objects.
Once the first matching object is found, the ordering of pages in the tree is used to find only other matching objects.
Objectivity/DB indices as of V3.8 use a B-tree with short OIDs,
which are 4 bytes rather than the usual 8 bytes. This implies
that they can only refer to objects within the same container.
Objectivity/DB V4 provides also database-wide and federation-wide
indices with the trade-off of increased storage overhead for the
object reference.
The storage consumption of a single index entry is given by the size of the attribute together with 4 byte overhead of the object reference. Thus, a single 8KB database page can store some 1000 object references ordered by a single 32 bit field.
Assuming that all B-Tree are completely filled, this allows to find the first matching object in up to 109 objects with by reading only 4 pages. The following figure shows the amount of page I/Os needed to select with a given selectivity from 109 input objects each of about 8kB size. For the sequential search the amount of transferred data is independent of the selectivity since each object has to be retrieved for the check against the selection condition. The B-Tree based selection needs for the extreme case of selecting all objects (selectivity 100%) a few more page I/Os since the B-Tree pages have to be transferred in addition the pages containing object. As one can see from the same figure for smaller selectivity the B-Tree based search transfers much less data, than the sequential search since apart from a few key pages only matching objects will be accessed.
One side effect of accessing all matching objects ordered by an
attribute becomes obvious when objects smaller than one page are
accessed. In this case, the sequential scan accesses each object
in the order they are written into the database and accesses thereby
all pages containing object once. In the B-Tree case the objects
are fetched from the index in the order defined by the attribute
used to index. Since the client side cache may not be large enough
to retain all object pages until a second matching object on the
same page is accessed, this may lead to multiple I/Os to the same
page. As a result the B-Tree access for relatively high selectivity
(larger than 20%) turns out to be less efficient than sequential
access.
Query performance is a key issue of our tests. We do not use any automatic query optimiser as on relational databases where they exploit the semantics of the model and the fixed sets of operators, storage structures, and implementation techniques. We optimise our queries based on a procedural approach, this means we define "how" we want to search the objects and not "what" as in declarative approaches.
Before evaluating complex queries we have started doing tests of lower complexity in order to find out best search techniques and how to improve them.
We have considered three main techniques in Objectivity to make our first comparative analysis:
- sequential search: navigating through all objects using iterators and finding the ones which match the search condition. This implies a sequential search through all objects in the order they have been stored.
- hashed search based on keyed objects. Objects are created as keyed objects at the time of their creation this implies a parallel creation of an entry on a hash table.
- indexed search: Objects can be indexed at any time, and the
level of indexation can be on containers or derived classes. Objects
are accessed through keys ordered on a B-tree.
The environment where the tests are realised consists of a Pentium
Pro personal workstation (200Mhz), 160 MB memory, Windows NT 4.0,
Visual C++ compiler version 4.2, and Objectivity 4.0. The Objectivity
settings related to the benchmark are: the page size is 8KB and
cache size is 7000 pages. We use a very simple object, one integer
attribute and one method, the object size on disk is 22 bytes
(14 bytes object overhead + 8 bytes corresponding to the integer
attribute rounded to 8 bytes).
Search optimization consists in speed up the time for accessing
a certain number of objects. Tests are based on a comparative
analysis between sequential search and indexed search. . For sequential
search we mean that all objects are accessed directly one after
the other with a next operation and not through an index
or hash table entry. We work with selectivities from 0% to 100%
(all objects fullfil the query). The query used is simple selection
of objects which have an attribute equal to a certain value. We
do not consider hashed search because this implies to create objects
as keyed objects and the time needed for their creation was very
large not suitable for rapidly growing databases. In the other
side indexes allow indexing of derived classes and the index can
be created and deleted at any time.
As mentioned before tests are based on queries using indexes and
on sequential search. Observing Figure 18-3 which plots the elapsed
real time needed to scan a certain number of objects against the
total number of objects scanned, we can see that the search method
using C++ directly it is almost two times faster than using the
predicate language, for these reason in future tests we scan using
C++ directly.
The selectivity percentage reflects different scenarios where we can compare the two search mechanisms. Observing Figure 18-4 we can see that search times using indexes scale better than using iterators when the selectivity is lower than 20%. For higher rates of selectivity sequential processing should be used due to the fact that pages are loaded only once because objects are read in the same order as they are stored. Objectivity implements a sort of B-tree which performs well for random transactions conducted by specifying a key but not if we consider a sequential search using the next operation to process all records in a key sequence order. To realise this selectivity test, we made previous tests to find out the best cache size for indexing, for our little objects the optimal cache size was 7000 pages. An optimal size of cache reduces the amount of disk accesses if it can maintain all or most of the pages of the B-tree in memory reducing the number of accesses to disk.
The container size increases rapidly using indexes because it has to be maintained the B-tree in extra pages. In our tests the index was stored in the same container as the objects, but in large growing DB's indexes should be stored in separated containers. At the end of this section we explain how can be calculated the index overhead in a container. One must notice that big container factor growth's are good when we preview that the database is going to grow rapidly, but in the other side may not all pages in the last growing be used. On the other side we have noticed that when trying to perform a big container growth with indexes it occurs a page overflow error, even if all pages in the last growing are not used. We consider that for big factor growth of containers the number of pages should stop at the limit allowing the maximum use of the container (in case not all pages were used in the last growth).
We have analysed where the time using indexes is spent in each
of the steps indicated in cases A and B indicated below and reflected
graphically on In Figure 18-5 we observe that the biggest time is spent when committing, this is because indexes must be updated (commit1 of case B and commit2 of case A). When objects are created in an already indexed container, objects will not be indexed until the current transaction commits, this means that if we try to do a search before committing the objects created in that transaction will not be yet in the index. Creation and search times in both cases are the same. Total times spent in both cases are not the same because case A must create an index whilst case B has it already created. In case we would modify indexed keys between both commits we should use the oocSensitive indexing mode to update the indexes when initialising the iterator with scan to start the search, and then we would observe that the search time would increase. The
evaluation of the cost of creating indexes is based on the cost
of creating objects in an already indexed container or not. In
cases A and B we indicate the mode in which indexes are updated
at open transaction. There are three modes in Objectivity: oocInsensitive,
oocSensitive, oocExplicitUpdate.With OocInsensitive
mode indexes are updated at commit time. With oocSensitive
mode the updates, if any, of the index are done when scan
is called and before the index entries are returned, or at
commit time if no scans are done. OocExplicitUpdate
gives you explicit control over changes to indexed objects
during a transaction. For our tests we have used the two first
modes. Figure 18-5 shows us the different timings in each of the
steps reflected in case A and case B.
CASE A: Initially the container is completely empty
open transaction
create: create objects of type X clustered in a container
commit1: commit the transaction
open transaction with index mode oocInsensitive
index: create an index having as a key one attribute of objects of type X
search: initialize with scan the iterator that will navigate through the B-tree and search for the objects
commit2: commit the
transaction updating the indexes
CASE B: Initially the container contains already indexed objects
open transaction with index mode oocInsensitive
index: check if an index already exists in the container
create: if index already exists then create objects of type X clustered in the container
commit1: commit transaction updating indexes
open transaction with index mode oocInsensitive
search: initialize with scan the iterator that will navigate through the B-tree and search for the objects
commit2: commit transaction
Previous tests are done choosing the best cache size. The best
cache size for indexing means that we can fit all or almost all
B-tree in. A too little cache can provoke a memory leak due to
continuous page faults and consequently continuous writes to disk
which provoke the system to slow down considerably. The cache
size can be set inside the program with the ooinit function,
and can be set in a way that most of the B-tree will fit in cache.
For example, for one integer key index of 4 bytes and a certain
number of objects, we can calculate how many pages it needs to
store the whole B-tree, knowing that we must sum to the key 4
bytes for the OID and that the page size is 8k. Supposing that
we fill in the container until the limits, we can obtain how
many pages occupies the index using following equation,
(size of objects . number of objects) + size of index key . number of objects
_____________________________________________ ______________________________________________ = 64k pages
size page
size page
and determine which would be the best cache.
In this chapter we have analysed different aspects of index usage. Considering that most of the queries based on Ntuples have an order of selectivity maximum of 20% the index search technique seems to be the best suited. We have given maximum attention to two important aspects: the extra space in disk and memory needed by indexes and consequently the big committing time that they generate.
As a by-product of the work described in this report, we have
made a number of enhancement requests and bug reports to Objectivity.
All of these requests/reports have been logged, and we will monitor
progress on these issues through the regular RD45 workshops and
report back on key issues to the LCB.
The most important enhancement requests that we have identified
are:
Although not directly related to performance and scalability,
we also requested with high priority support for STL-compliant
collection classes.
We summarise the progress on these key issues below.
An Objectivity engineer attended the HPSS course that was held at CERN during October 1996. As a result of this course, a clean interface between Objectivity/DB and HPSS was identified, namely to modify the Objectivity/DB "AMS" page-server to use the HPSS API, rather than direct filesystem calls. Objectivity have identified the resources to make the required changes and SLAC have agreed to provide access to an HPSS system, plus appropriate local resources and expertise, so that the necessary code can be developed and tested. A meeting is scheduled for May 1997 between Objectivity, representatives of the HPSS consortium and interested HEP parties, including representatives from Caltech, CERN and SLAC to perform detailed planning for the development and delivery of the appropriate software.
Objectivity plan to support a paged-VArray in a future version of Objectivity/DB. Although it is unclear where this will be supported in the next or subsequent release, a beta version is expected during the coming year, with full product support during 1998 at the latest.
The current version of Objectivity/DB uses a simple mapping for
the object identifier (OID) to physical storage. This is based
upon 4 16-bit quantities, making up the
64-bit OID. This division, which was made for performance
reasons, is not optimal and changes are planned which would permit
much larger databases, without extending the overall OID. In parallel,
support for multi-file databases will be added, again increasing
the maximum federation size.
Using containers with an average size of 1GB, a database would
be limited to 32TB and a federation to 2EB (1018 bytes)
- an order of magnitude greater than we require.
Again, these changes are expected in a beta version during the coming year.
Objectivity has recently reached an agreement with ObjectSpace whereby Objectivity will deliver both transient and persistent collection classes based on ObjectSpace's implementation of the STL. This work is targetted for the next release of Objectivity/DB, expected to enter beta-test in the coming months.
Although there are clearly additional outstanding requirements and bug reports, it is pleasing to note progress on the four key issues that have been identified. We will continue to work with the vendor to ensure that additional requirements are speedily addressed.
We have investigated and reported on the effectiveness of using
an ODBMS and MSS as the query and access method for physics analysis
and presented comparisons with existing solutions, namely PAW
and Ntuples, as well as the limits imposed by the hardware employed.
We have also measured the scalability of this combination according
to a number of important criteria. Finally, we have presented
the possible impact of the use of relatively new technologies,
such as very large memories and extensive use of data caching.
The performance of an ODBMS-based system is comparable to that
of today's systems, even without extensive optimisation. Similar
conclusions have been reached by a number of independent groups,
including ZEUS and KEK.
The scalability of the architecture of Objectivity/DB has been
measured and the positive results of this work give extra confidence
in an ODBMS-based solution.
It is our conclusion that a data management solution based upon
Objectivity/DB and HPSS continues to be the most viable alternative
for LHC era experiments and also pre-LHC experiments, such as
BaBar, who have chosen an object-oriented approach.
The current test-bed was limited to 100GB in size - some 3 orders of magnitude smaller than the disk farms expected for the production phase of LHC, and further tests, including both tertiary storage managed by a mass storage systems and significantly larger disk pools, will need to be made over the coming years.
In this section we present a comparison between the performances
of Row Wise Ntuple and Column Wise Ntuple during the first uncached
access to the data. After the first access, variables are cached
in memory and the performances are the same. This comparison has
been performed by Olivier Couet, who is currently responsible
for the PAW maintenance.
The test has been done on two chains (one with RWN and an other
with CWN) of 9 hbook files. In total this represented 81060 events
each having 302 simple floating point variables. The data were
exactly the same.
A series of loops over the whole chain has been done with various
COMIS functions. Each function used more variables than the previous
one. The results are summarized in the following plot:
We can see that CWN are more efficient only if a small fraction
of the variable is used (< 10%). Over 10% of variables used
the CWN access time increase whereas the RWN access time remains
stable. When a few variables are used (less than 3%) the speed
up provided by the CWN is important but it decreases quickly when
the number of variables used increase.
CWN should be used when typed variables are mandatory (character, integer etc ...) or when array variables are needed.
CWN should NOT be used to speed up the data analysis (on the
first pass) when more than 10% of the total number of variables
are used in one query.
On the following plot, we can see (last point of the curves) the time spend if all the Ntuple variables are used (note the X axis in now in log scale).
REAL FUNCTION apsel(zopt)
REAL
+run ,burst ,event ,tburst ,mdmult ,sidcmult,
+xvertex ,yvertex ,zvertex ,zvtxfit ,chi2vtx ,ntrvtx ,
+gainr1 ,gainr2 ,npadsr1 ,npadsr2 ,nhitsr1 ,nhitsr2 ,
+nhitss1 ,nhitss2 ,nringsr1,nringsr2,ntracks ,npairs ,
+nhitspc ,pairtype,pairmass,pairopen,pairrap ,paireta ,
+pairpt ,maskt1 ,dthsrt1 ,dphisrt1,dthr12t1,dphr12t1,
+typet1 ,pt1 ,ptt1 ,thsdt1 ,phisdt1 ,drs12t1 ,
+dphs12t1,etat1 ,maskt2 ,dthsrt2 ,dphisrt2,dthr12t2,
+dphr12t2,typet2 ,pt2 ,ptt2 ,thsdt2 ,phisdt2 ,
+drs12t2 ,dphs12t2,etat2 ,thr1t1 ,phir1t1 ,xr1t1 ,
+yr1t1 ,radr1t1 ,nhitr1t1,chi2r1t1,varr1t1 ,actr1t1 ,
+kolr1t1 ,npadr1t1,sumr1t1 ,dnh1r1t1,dnh2r1t1,dx1r1t1 ,
+dy1r1t1 ,dx2r1t1 ,dy2r1t1 ,dch2r1t1,dvarr1t1,dko1r1t1,
+dac1r1t1,dko2r1t1,dac2r1t1,fxr1t1 ,fyr1t1 ,frr1t1 ,
+ha1r1t1 ,ha2r1t1 ,thr1t2 ,phir1t2 ,xr1t2 ,yr1t2 ,
+radr1t2 ,nhitr1t2,chi2r1t2,varr1t2 ,actr1t2 ,kolr1t2 ,
+npadr1t2,sumr1t2 ,dnh1r1t2,dnh2r1t2,dx1r1t2 ,dy1r1t2 ,
+dx2r1t2 ,dy2r1t2 ,dch2r1t2,dvarr1t2,dkolr1t2,dactr1t2,
+dko2r1t2,dac2r1t2,fxr1t2 ,fyr1t2 ,frr1t2 ,ha1r1t2 ,
+ha2r1t2 ,thnr1t1 ,phinr1t1,xnr1t1 ,ynr1t1 ,radnr1t1,
+nhnr1t1 ,ch2nr1t1,varnr1t1,actnr1t1,kolnr1t1,npnr1t1 ,
+sumnr1t1,ha1nr1t1,ha2nr1t1,msknr1t1,dtsnr1t1,dpsnr1t1,
+thnr1t2 ,phinr1t2,xnr1t2 ,ynr1t2 ,radnr1t2,nhnr1t2 ,
+ch2nr1t2,varnr1t2,actnr1t2,kolnr1t2,npnr1t2 ,sumnr1t2,
+ha1rr1t2,ha2nr1t2,msknr1t2,dtsnr1t2,dpsnr1t2,thr2t1 ,
+phir2t1 ,xr2t1 ,yr2t1 ,radr2t1 ,nhitr2t1,chi2r2t1,
+varr2t1 ,actr2t1 ,kolr2t1 ,npadr2t1,sumr2t1 ,fxr2t1 ,
+fyr2t1 ,frr2t1 ,ha1r2t1 ,ha2r2t1 ,thr2t2 ,phir2t2 ,
+xr2t2 ,yr2t2 ,radr2t2 ,nhitr2t2,chi2r2t2,varr2t2 ,
+actr2t2 ,kolr2t2 ,npadr2t2,sumr2t2 ,fxr2t2 ,fyr2t2 ,
+frr2t2 ,ha1r2t2 ,ha2r2t2 ,thnr2t1 ,phinr2t1,xnr2t1 ,
+ynr2t1 ,radnr2t1,nhnr2t1 ,ch2nr2t1,varnr2t1,actnr2t1,
+kolnr2t1,npnr2t1 ,sumnr2t1,ha1nr2t1,ha2nr2t1,thnr2t2 ,
+phinr2t2,xnr2t2 ,ynr2t2 ,radnr2t2,nhnr2t2 ,ch2nr2t2,
+varnr2t2,actnr2t2,kolnr2t2,npnr2t2 ,sumnr2t2,ha1nr2t2,
+ha2nr2t2,rs1t1 ,thes1t1 ,phis1t1 ,amps1t1 ,nas1t1 ,
+anos1t1 ,tbs1t1 ,rs1t2 ,thes1t2 ,phis1t2 ,amps1t2 ,
+nas1t2 ,anos1t2 ,tbs1t2 ,rns1t1 ,thens1t1,phins1t1,
+ampns1t1,nans1t1 ,anons1t1,tbns1t1 ,rns1t2 ,thens1t2,
+phins1t2,ampns1t2,nans1t2 ,anons1t2,tbns1t2 ,rs2t1 ,
+thes2t1 ,phis2t1 ,amps2t1 ,nas2t1 ,anos2t1 ,tbs2t1 ,
+rs2t2 ,thes2t2 ,phis2t2 ,amps2t2 ,nas2t2 ,anos2t2 ,
+tbs2t2 ,rns2t1 ,thens2t1,phins2t1,ampns2t1,nans2t1 ,
+anons2t1,tbns2t1 ,rns2t2 ,thens2t2,phins2t2,ampns2t2,
+nans2t2 ,anons2t2,tbns2t2 ,dxpt1 ,dypt1 ,amppt1 ,
+xpt1 ,ypt1 ,zpt1 ,thept1 ,phipt1 ,dxpt2 ,
+dypt2 ,amppt2 ,xpt2 ,ypt2 ,zpt2 ,thept2 ,
+phipt2 ,dxnpt1 ,dynpt1 ,ampnpt1 ,xnpt1 ,ynpt1 ,
+znpt1 ,thenpt1 ,phinpt1 ,dxnpt2 ,dynpt2 ,ampnpt2 ,
+xnpt2 ,ynpt2 ,znpt2 ,thenpt2 ,phinpt2 ,indext1 ,
+indext2 ,checksum
*
LOGICAL CHAIN
CHARACTER*128 CFILE
*
COMMON /PAWCHN/ CHAIN, NCHEVT, ICHEVT
COMMON /PAWCHC/ CFILE
*
COMMON/PAWIDN/IDNEVT,OBS(13),
+run ,burst ,event ,tburst ,mdmult ,sidcmult,
+xvertex ,yvertex ,zvertex ,zvtxfit ,chi2vtx ,ntrvtx ,
+gainr1 ,gainr2 ,npadsr1 ,npadsr2 ,nhitsr1 ,nhitsr2 ,
+nhitss1 ,nhitss2 ,nringsr1,nringsr2,ntracks ,npairs ,
+nhitspc ,pairtype,pairmass,pairopen,pairrap ,paireta ,
+pairpt ,maskt1 ,dthsrt1 ,dphisrt1,dthr12t1,dphr12t1,
+typet1 ,pt1 ,ptt1 ,thsdt1 ,phisdt1 ,drs12t1 ,
+dphs12t1,etat1 ,maskt2 ,dthsrt2 ,dphisrt2,dthr12t2,
+dphr12t2,typet2 ,pt2 ,ptt2 ,thsdt2 ,phisdt2 ,
+drs12t2 ,dphs12t2,etat2 ,thr1t1 ,phir1t1 ,xr1t1 ,
+yr1t1 ,radr1t1 ,nhitr1t1,chi2r1t1,varr1t1 ,actr1t1 ,
+kolr1t1 ,npadr1t1,sumr1t1 ,dnh1r1t1,dnh2r1t1,dx1r1t1 ,
+dy1r1t1 ,dx2r1t1 ,dy2r1t1 ,dch2r1t1,dvarr1t1,dko1r1t1,
+dac1r1t1,dko2r1t1,dac2r1t1,fxr1t1 ,fyr1t1 ,frr1t1 ,
+ha1r1t1 ,ha2r1t1 ,thr1t2 ,phir1t2 ,xr1t2 ,yr1t2 ,
+radr1t2 ,nhitr1t2,chi2r1t2,varr1t2 ,actr1t2 ,kolr1t2 ,
+npadr1t2,sumr1t2 ,dnh1r1t2,dnh2r1t2,dx1r1t2 ,dy1r1t2 ,
+dx2r1t2 ,dy2r1t2 ,dch2r1t2,dvarr1t2,dkolr1t2,dactr1t2,
+dko2r1t2,dac2r1t2,fxr1t2 ,fyr1t2 ,frr1t2 ,ha1r1t2 ,
+ha2r1t2 ,thnr1t1 ,phinr1t1,xnr1t1 ,ynr1t1 ,radnr1t1,
+nhnr1t1 ,ch2nr1t1,varnr1t1,actnr1t1,kolnr1t1,npnr1t1 ,
+sumnr1t1,ha1nr1t1,ha2nr1t1,msknr1t1,dtsnr1t1,dpsnr1t1,
+thnr1t2 ,phinr1t2,xnr1t2 ,ynr1t2 ,radnr1t2,nhnr1t2 ,
+ch2nr1t2,varnr1t2,actnr1t2,kolnr1t2,npnr1t2 ,sumnr1t2,
+ha1rr1t2,ha2nr1t2,msknr1t2,dtsnr1t2,dpsnr1t2,thr2t1 ,
+phir2t1 ,xr2t1 ,yr2t1 ,radr2t1 ,nhitr2t1,chi2r2t1,
+varr2t1 ,actr2t1 ,kolr2t1 ,npadr2t1,sumr2t1 ,fxr2t1 ,
+fyr2t1 ,frr2t1 ,ha1r2t1 ,ha2r2t1 ,thr2t2 ,phir2t2 ,
+xr2t2 ,yr2t2 ,radr2t2 ,nhitr2t2,chi2r2t2,varr2t2 ,
+actr2t2 ,kolr2t2 ,npadr2t2,sumr2t2 ,fxr2t2 ,fyr2t2 ,
+frr2t2 ,ha1r2t2 ,ha2r2t2 ,thnr2t1 ,phinr2t1,xnr2t1 ,
+ynr2t1 ,radnr2t1,nhnr2t1 ,ch2nr2t1,varnr2t1,actnr2t1,
+kolnr2t1,npnr2t1 ,sumnr2t1,ha1nr2t1,ha2nr2t1,thnr2t2 ,
+phinr2t2,xnr2t2 ,ynr2t2 ,radnr2t2,nhnr2t2 ,ch2nr2t2,
+varnr2t2,actnr2t2,kolnr2t2,npnr2t2 ,sumnr2t2,ha1nr2t2,
+ha2nr2t2,rs1t1 ,thes1t1 ,phis1t1 ,amps1t1 ,nas1t1 ,
+anos1t1 ,tbs1t1 ,rs1t2 ,thes1t2 ,phis1t2 ,amps1t2 ,
+nas1t2 ,anos1t2 ,tbs1t2 ,rns1t1 ,thens1t1,phins1t1,
+ampns1t1,nans1t1 ,anons1t1,tbns1t1 ,rns1t2 ,thens1t2,
+phins1t2,ampns1t2,nans1t2 ,anons1t2,tbns1t2 ,rs2t1 ,
+thes2t1 ,phis2t1 ,amps2t1 ,nas2t1 ,anos2t1 ,tbs2t1 ,
+rs2t2 ,thes2t2 ,phis2t2 ,amps2t2 ,nas2t2 ,anos2t2 ,
+tbs2t2 ,rns2t1 ,thens2t1,phins2t1,ampns2t1,nans2t1 ,
+anons2t1,tbns2t1 ,rns2t2 ,thens2t2,phins2t2,ampns2t2,
+nans2t2 ,anons2t2,tbns2t2 ,dxpt1 ,dypt1 ,amppt1 ,
+xpt1 ,ypt1 ,zpt1 ,thept1 ,phipt1 ,dxpt2 ,
+dypt2 ,amppt2 ,xpt2 ,ypt2 ,zpt2 ,thept2 ,
+phipt2 ,dxnpt1 ,dynpt1 ,ampnpt1 ,xnpt1 ,ynpt1 ,
+znpt1 ,thenpt1 ,phinpt1 ,dxnpt2 ,dynpt2 ,ampnpt2 ,
+xnpt2 ,ynpt2 ,znpt2 ,thenpt2 ,phinpt2 ,indext1 ,
+indext2 ,checksum
*
logical like
integer zopt
twopi = 6.283185
* call hfill (1002,pairmass,0.,1.)
apsel=0.
*
* cuts:
*
c mult1 = 250
c mult2 = 350
c if (sidcmult.gt.mult1) return
c if (sidcmult.le.mult1 .or. mult.gt.mult2) return
c if (sidcmult.le.mult2) return
if (zopt.eq.1) then
dedxcut1 = 700.
dedxcut2 = 1250.
else
dedxcut1 = 500.
dedxcut2 = 700.
endif
* sharpen dedx-cut
fscal = 6./7.
dedxcut1 = fscal*dedxcut1
dedxcut2 = fscal*dedxcut2
hit2cut = 7
hit1cut = 6
if (nhitr2t1.lt.hit2cut) return
if (nhitr2t2.lt.hit2cut) return
if (nhitr1t1.lt.hit1cut) return
if (nhitr1t2.lt.hit1cut) return
like = (pairtype.ne.0)
*
* no v-track
*
*
* ptcut
*
ptcut = .200
if (ptt1.lt.ptcut .or. ptt2.lt.ptcut) return
*
* ptcut only
*
if (like) then
* call hfill (1002,pairmass,0.,1.)
if (amps1t1.lt.dedxcut1 .or. amps2t1.lt.dedxcut2) then
if (amps1t2.lt.dedxcut1 .or. amps2t2.lt.dedxcut2) then
c call hfill (3102,pairmass,0.,1.)
endif
endif
else
c call hfill(1001,pairmass,0.,1.)
if (amps1t1.lt.dedxcut1 .or. amps2t1.lt.dedxcut2) then
if (amps1t2.lt.dedxcut1 .or. amps2t2.lt.dedxcut2) then
c call hfill(3101,pairmass,0.,1.)
endif
endif
endif
c-- only clean open/close pairs with enough mass....
if (like) then
c call hfill(1102,pairmass,0.,1.)
else
c call hfill(1101,pairmass,0.,1.)
endif
c-- veto on correlated double dedx in both SiDCs
if (amps1t1.gt.dedxcut1 .and. amps2t1.gt.dedxcut2) return
if (amps1t2.gt.dedxcut1 .and. amps2t2.gt.dedxcut2)
return
c-- harder pt-cut
ptcut = .200
if (ptt1.lt.ptcut .or. ptt2.lt.ptcut) return
if (like) then
c call hfill(1112,pairmass,0.,1.)
else
c call hfill(1111,pairmass,0.,1.)
endif
* the famous 'last cut':
c-- acceptance
thmin = 150.
thmax = 240.
if (thsdt1.lt.thmin .or. thsdt1.gt.thmax) return
if (thsdt2.lt.thmin .or. thsdt2.gt.thmax) return
c-- hard single dedx cut in both SiDC
c if ( de1t2.lt.400 .or. de1t2.gt.dedxcut1 ) return
c if ( de1t2.lt.400 .or. de1t2.gt.dedxcut1 ) return
c if ( de2t1.lt.200 .or. de2t1.gt.dedxcut2 ) return
c if ( de2t1.lt.200 .or. de2t1.gt.dedxcut2 )
return
c-- match r1-r2 (first track == best match) rather insensitive, optimum at 0.9
c scal=1.2
c if (abs(dthr1) .gt. scal*2.5) return
c-- match sidc1-sidc2 (theta and phi)
c-- cut if second sidc-2 hit is real and too close...
c-- match sidc-rich1 (theta and phi)
c if (abs(th1t1-sth11) .gt. 4.) return
c dphi = ph1t1-sph11
c if (dphi .gt. 3.) dphi = dphi - twopi
c if (abs(dphi) .gt. 0.02) return
c if (abs(th1t2-sth12) .gt. 5.) return
c dphi = ph1t2-sph12
c if (dphi .gt. 3.) dphi = dphi - twopi
c if (abs(dphi) .gt. 0.02) return
* let neighbouring rings veto
c rccut = 8.
c if (r1d.lt.rccut .or. r2d.lt.rccut) return
* sum amplitude cut
c psumcut = 6000.
c if (psum1.gt.psumcut.or.psum2.gt.psumcut)
return
* upper momentum cut
c pcut = 7.
c if (pt1.gt.pcut .or. pt2.gt.pcut) return
c-- opening angle
if (pairopen .lt. 35) return
if (like) then
c call hfill(2112,pairmass,0.,1.)
else
c call hfill(2111,pairmass,0.,1.)
endif
c-- opening angle
if (pairopen .lt. 35) return
c-- mass
if (pairmass.lt.0.200) return
c-- upper mass cut for comparison
if (pairmass.gt.1.5) return
if (like) then
c call hfill(4212,pairmass,0.,1.)
else
c call hfill(4211,pairmass,0.,1.)
endif
apsel=1.
END
#include "CeresFullTag.h"
inline void hfill(int id, float val, float weight, float whatever)
{}
HepBoolean CeresFullTag::Match()
{
HepBoolean like;
const float twopi = 6.283185;
float dedxcut1,dedxcut2;
//
// cuts:
//
const int zopt=1; // dd
if (zopt == 1)
{
dedxcut1 = 700;
dedxcut2 = 1250;
}
else
{
dedxcut1 = 500;
dedxcut2 = 700;
}
// sharpen dedx-cut
const float fscal = 6./7;
dedxcut1 = fscal*dedxcut1;
dedxcut2 = fscal*dedxcut2;
const float hit2cut = 7;
const float hit1cut = 6;
if (nhitr2t1 < hit2cut) return HepFalse;
if (nhitr2t2 < hit2cut) return HepFalse;
if (nhitr1t1 < hit1cut) return HepFalse;
if (nhitr1t2 < hit1cut) return HepFalse;
like = (pairtype != 0) ? HepTrue : HepFalse;
//
// ptcut
//
float ptcut = .200;
if (ptt1 < ptcut || ptt2 < ptcut) return HepFalse;
//
// ptcut only
//
if (like)
{
hfill (1002,pairmass,0.,1.);
if ((amps1t1 < dedxcut1 || amps2t1<dedxcut2) && (amps1t2<dedxcut1 || amps2t2<dedxcut2))
hfill (3102,pairmass,0.,1.);
else
{
hfill(1001,pairmass,0.,1.);
if ((amps1t1<dedxcut1 || amps2t1<dedxcut2) && (amps1t2<dedxcut1 || amps2t2<dedxcut2))
hfill(3101,pairmass,0.,1.);
}
}
// only clean open/close pairs with enough mass....
if (like)
hfill(1102,pairmass,0.,1.);
else
hfill(1101,pairmass,0.,1.);
// veto on correlated double dedx in both SiDCs
if (amps1t1>dedxcut1 && amps2t1>dedxcut2) return HepFalse;
if (amps1t2>dedxcut1 && amps2t2>dedxcut2)
return HepFalse;
// harder pt-cut
ptcut = .200;
if (ptt1<ptcut || ptt2<ptcut) return
HepFalse;
if (like)
hfill(1112,pairmass,0.,1.);
else
hfill(1111,pairmass,0.,1.);
// the famous 'last cut':
// acceptance
const float thmin = 150;
const float thmax = 240;
if (thsdt1<thmin || thsdt1>thmax) return HepFalse;
if (thsdt2<thmin || thsdt2>thmax) return HepFalse;
// opening angle
if (pairopen < 35) return HepFalse;
if (like)
hfill(2112,pairmass,0.,1.);
else
hfill(2111,pairmass,0.,1.);
// opening angle
if (pairopen < 35) return HepFalse;
// mass
if (pairmass<0.200) return HepFalse;
// upper mass cut for comparison
if (pairmass > 1.5) return HepFalse;
if (like)
hfill(4212,pairmass,0.,1.);
else
hfill(4211,pairmass,0.,1.);
return HepTrue;
}
ADSM - A storage management product from IBM
AFS - the Andrew (distributed) filesystem
CORBA- the Common Object Request Broker Architecture, from the OMG
CORE - Centrally Operated Risc Environment
DFS - the OSF/DCE distributed filesystem, based upon AFS
DMIG - the Data Management Interface Group
GB - 109 bytes
HPSS - High Performance Storage System - a high-end mass storage system developed by a consortium consisting of end-user sites and commercial companies
KB - 210 (1024) bytes - normally referred to as 103 bytes
IEEE - the Institute of Electrical and Electronics Engineers
MB - 106 bytes
MSS - a Mass Storage System
NFS - the Network Filesystem, developed by Sun
ODBMS - an Object Database Management System
ODMG - the Object Database Management Group, who develop standards of ODBMSes
OMG - the Object Management Group
OQL - the Object Query Language defined by the ODMG
ORB - an Object Request Broker
OSM - Open Storage Manager: a commercial MSS
PB - 1015 bytes
SQL - Standard Query Language: the language used for issuing queries against databases
SSSWG - the Storage System Standards Working Group
TB - 1012 bytes
VLDB - Very Large Database
VLM - Very Large Memory
VMLDB - Very Many Large Databases
XBSA - the draft X/Open Backup Services Application Program Interface