Fusion datasets can be fragmented across globally distributed experimental facilities, each with different data storage and access systems. This can pose challenges for aggregation of multi-machine datasets and for accessing data on remote, high performance computing (HPC) facilities where analysis may need to be run.
This talk presents recent efforts to investigate two open source tools for distributing data between multiple experimental facilities and multiple HPC analysis clusters. Some key advantages we were interested in exploring included (i) abstracting site-specific data access routines and adopting a unified interface, (ii) strategies for minimising latency between globally distributed locations, and (iii) scalability beyond manual, ad-hoc data copy operations.
The Pelican Platform[1] and CVMFS[2] are both examples of data distribution services which provide consolidated access interfaces and use caches and data mirrors to reduce latency when accessing multiple, globally distributed data sources. Pelican is a data federation platform which aims to unify access to different kinds of storage APIs (S3, Posix, HTTP) through adaptor services; it uses XRootD[3] for data transfer and includes features for user authorisation and authentication. CVMFS is a CERN-developed data distribution technology widely used in High Energy Physics. Initially focused on sharing open software, CVMFS uses the HTTP protocol for data transfer, has a convenient virtual filesystem interface, and implements an aggressive local caching strategy to improve performance for certain data access patterns. Overall, Pelican is more focused on large-scale data federation while CVMFS has advantages for transparent access to fine-grained data in read-heavy applications.
We compare these tools against traditional methods such as manual SCP (Secure Copy Protocol) transfers in terms of performance and also highlight the additional features these tools provide which can particularly benefit typical fusion analysis workflows.
[1] Pelican Platform, https://
Repository¶
https://