Skip to the content.

Google Summer of Code 2021: Project Report

CernVM-FS preload capability

Contributor: Rohit Topi
Organization: CERN-HSF
Member organization: CERN
Mentors: Jakob Blomer, Enrico Bocchi


Background

CernVM-FS (CVMFS) is a service for fast and reliable software distribution on a global scale. It is capable of delivering the scientific software used by the High Energy Physics (HEP) community to tens of thousands of client nodes worldwide. Data is organized in repositories that are mounted as a POSIX read-only file system by the clients. Files and metadata are downloaded on-demand by means of HTTP requests and take advantage of several layers of caches.

Project description

The current per-file granularity of distribution and caching leads to poor performance especially in the case of cold caches. In some cases, however, it is known upfront that all files of a certain set are required if any of the files is accessed. The loading of interdependent files and libraries required for certain applications can be improved. The ability to load all required files together would improve performance in the case of cold caches.
The goal of this project is to introduce the concept of bundles and related support, which would improve the startup performance of applications.

Project development

Summary

Bundles are the new content type introduced in the repository. A bundle has the binary format of the object packs used by the gateway as a form to concatenate multiple objects in a single BLOB. A bundle specification file called .cvmfsbundles describes the content to be included in each bundle. This file will thus, allow the repository owners to create file bundles.
To keep information on the bundles and the files belonging to bundles, the catalog table is extended so that an optional bundle id can be stored for each regular file. A new file catalog table stores the information about individual bundles.
When the client has a cache miss for a file that is part of a bundle, it can then decide based on the cache contents whether to download the individual requested file or the entire bundle, into the local cache.
The development towards the project was done through Github pull requests. The corresponding CERN JIRA issue is CVM-1886. The overall work was divided into four PRs.

Development

PR #2736 | Add support for .cvfmsbundles file | [Status: merged]

A bundle specification file, .cvmfsbundles, allows the creation of such bundles on the server-side. This file is created and maintained by the repository owners. During publication, the bundles are created and stored based on the contents of this file. Implemented functionality to:

PR #2741 | Introduce bundles content type in the repository | [Status: under review]

To support the creation of bundles:

PR #2742 | Extend catalog table to support bundle files | [Status: under review]

To support keeping track of bundles and the files belonging to the bundles:

PR #2743 | Add client-side support for bundle files | [Status: under review]

On the client-side, initial access to a file initiates the fetch and stores it in the local cache. Access to the file with a valid bundleid will lead to the prefetch of the corresponding bundle if it doesn’t already exist in the local cache. The changes made to support this were:

Acknowledgment

I am extremely grateful to my mentors Jakob Blomer and Enrico Bocchi for guiding me and helping me out at various stages throughout the project. Sincere gratitude to Google and CERN-HSF for offering me this great opportunity.