CERN/RD45/99/03

4 February, 1999

Revision 0

 

 

 

 

 

VLDB Support in Objectivity/DB

 

Requirements for Multi-File Database Support in Objectivity/DB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  1. Introduction

Experiments at CERN and other High Energy Physics (HEP) laboratories are beginning to use Objectivity/DB for production purposes. The volumes of data that these experiments plan to store in Objectivity/DB ranges from a few hundred TB per year to a few PB per year. In the short term, there is a need for Objectivity/DB to support federations of several PB, whereas in the longer term, some 100PB/federation should be supported.

Although the current architecture supports federations of the above size on paper, it requires the use of very large files, which are ruled out for practical reasons, as explained below. We therefore require changes to the architecture to permit large federations without the introduction of arbitrary and/or impractical constraints. As explained in more detail below, and has been discussed with Objectivity on a number of occasions, a production solution to this problem is required by the end of 1998 at the latest. Given the fact that there remains very little time for the design and implementation of a solution, we are prepared to consider a stop gap solution, provided that it can be agreed by the experiments and a smooth migration path to a more adequate solution be established.

Changes to permit very large federations could be achieved in a number of ways, including:

It has previously been argued that changes to the OID mapping – whilst keeping a 64 bit OID – would not offer sufficiently flexibility as to meet our requirements. Increasing the size of the OID, whilst still a potential solution, would have a number of drawbacks, such as increasing the overhead for persistent objects.

In the remainder of this note we discuss only the third option.

  1. Filesize Limitations

Although 64-bit filesystems are now widely available, thus removing the previous limit of 2GB per file(system), there are a number of practical reasons for limiting the maximum filesize. These include:

In general, we prefer to avoid spreading files across multiple tape (or other) volumes. With the devices in use at CERN today, this imposes a limit of 10GB (for IBM Magstar drives) or some 50GB (STK Redwood drives). Other storage devices, such as Quantum DLT, impose similar restrictions. Although somewhat specialist devices, such as the Ampex drive, will offer much larger tape volumes, there is little indication that the tape capacity for more mainstream products will increase significantly in the foreseeable future.

A similar restriction comes from the time take to move a file. As a rule of thumb, this should take no longer than 103 seconds, and preferably 102 (or less!). Assuming a transfer speed of 10MB/second – on the upper end of what is practical today - this limits files to 10GB. Such a filesize is compatible, if somewhat more restrictive, with that imposed by the tape volume size.

  1. Federation Size
  2. Using the above mentioned 10GB files, an Objectivity/DB federation can store a total of 650TB. In practice, this is unlikely to be achieved. This would imply using all possible database IDs and assumes that all files are of the maximum size. Either option would be far from optimal, and would cause significant problems if the federation were to be extended at a later date (e.g. consider the clustering implications). A more realistic scenario, e.g. using files of 1-5GB and not all database IDs, would lead to a maximum practical federation of some 50-100TB. This is insufficient to meet the requirements of experiments starting production data taking in 1999, and falls far short of what is required for the LHC experiments.

  3. Expected Data Volumes
  4. The following table shows the expected data volumes for a number of experiments starting in 1999. Each of these experiments is expected to run for approximately 10 years, resulting in an overall requirement for a capacity of some 5PB/federation. All of these experiments will start taking data around April - May 1999.

    In the longer term, the 4 LHC experiments (ATLAS, CMS, ALICE and LHCb) are each expected to take roughly 1PB of data per year, and run for some 15-20 years.

    Experiment

    Laboratory

    Volume/year

    1999

    2000

    BaBar

    SLAC

    ~300TB

       

    COMPASS

    CERN

    ~360TB

    See footnote

    360TB

    STAR

    Brookhaven

    ~50-300TB

    ~1TB

    ~3TB

     

  5. Possible Implementations

Possible implementations that have been discussed include:

Clearly, the first option is much more attractive and flexible. This could allow new containers either to be created in the default file for that database, or in a new file, in much the same way that new databases are currently created, or perhaps even in the same file as an existing container. For data import/export and for efficient caching from the mass storage system, such a scheme is expected to have significant advantages – only the needed files (containers) would be cached or exported, resulting in greater efficiency. In addition, assuming that the necessary changes were part of the standard Objectivity/DB product, the corresponding files could be read on any system with the standard product installed.

Splitting large files, on the other hand, has the advantage of being quick to implement: it is possible that it can even be implemented without any changes to Objectivity/DB itself, e.g. by implementing the changes in the customer-supplied layer below oofs. Splitting files at a fixed maximum size, e.g. 2GB, could be implemented fairly trivially. However, container export would not be possible – the entire DB would have to be exported. This may not be an important restriction, e.g. if multi-file DBs are primarily used for the rawdata. For the interface to the MSS, one would have the option of:

  1. Recalling all fragments of a file on first open,
  2. Recalling fragments on demand.

The latter option would require the possibility to send a "retry in n seconds" message to the client on read, as well as on open.

In addition to the simple option of splitting at a fixed size, it could be possible, e.g. using the opaque information interface, to ensure that containers were always created in new files, although the problem of extending containers would also have to be established.

  1. Limitations
  2. Drawbacks of the above scenario include the likely requirement of using some form of AMS – even to access local data, which is not currently possible. Were the changes to be implemented at the customer level, this would require the modified AMS to be distributed to all required sites. These issues are certainly manageable.

    More serious, however, is the fact that the solution of simply splitting files does not really satisfy the requirements. This is particularly dramatic if the long-term requirements are considered, e.g. for federations up to 100PB or so, but is also true for the short-term. For example, unless only those files that are required can be conveniently identified and cached, the time take to stage all files belonging to an individual database will greatly exceed 103 seconds. Assuming an effective data transfer rate of 10MB/second, the time taken to read a 1TB database – needed to reach federations of 65PB - would be around 105 seconds – approximately one whole day – and require approximately 100 tape mounts!

  3. Conclusions

Splitting databases into multiple files at the level of the pluggable filesystem is probably a viable stop-gap solution, although the full implications have to be studied further. To meet the requirement for federations of (say) 65PB in size, logical databases of 1TB in size would be needed. In other words, each database would have to be split into something like 100 – 1000 fragments. Although this is at least conceptually feasible, it is clear that such an approach cannot be considered a viable long-term solution, as it is unable to satisfy our requirements.

Given the current architecture of Objectivity/DB, the strongly preferred solution is still that of permitting containers to map to files. Should it be impossible to deliver such a solution on the time-scale required, as appears likely, a fall-back solution, based upon splitting of database files in the customer supplied layer below oofs, could be considered.

In parallel to implementing or providing the necessary hooks for the stop-gap solution, we request that Objectivity procede with a design permitting container-file mapping such that an implementation no later that the V6 release, presumed to be officially released on all platforms no later than end-1999, can be provided. Such an implementation should conform to our overall requirement that changes must not invalidate existing federations. Indeed, existing federations should be usable without the need to run any upgrade application. The exception would be that of "split" databases – a tool that converted such databases to e.g one container/file, i.e. preserving existing OIDs and hence e.g. collections. Such a facility is considered by the experiments to be at least highly desirable, if not mandatory.