cdffs

A file-system interface to allow users to work with CDF (Cognite Data Fusion) Files using the fsspec supported/compatible python packages.

fsspec provides an abstract file system interface to work with local/cloud storages and based on the protocol name (example, s3 or abfs) provided in the path, fsspec translates the incoming requests to storage specific implementations and send the responses back to the upstream package to work with the desired data.

High level flow from various packages.

_images/cdffs_short_version.png

Path translation

CDF Files has two layers, metadata layer and blob storage layer. So, every read/write request issued using different python packages will first hit the metadata layer and the file contents will then be upload/download to/from the underlying blob storage.

Even though users can use cdffs protocol and path in fsspec supported/compatible python packages similar to abfs or s3, the path issued when the data is read/written will get translated to different metadata fields when working with CDF Files.

Example,

import pandas as pd
df = pd.read_csv("cdffs://pandas/test_data.csv", storage_options={"connection_config": client_cnf})
  • cdffs - protocol name - ffspec will use this protocol name to decide which file system spec implementation or package to use. In this case, it will be cognite.cdffs package.

  • /pandas/ - directory prefix - cdffs will translate the directory prefix as a root directory.

  • test_data.csv - name - cdffs will translate the file name as an external_id and name. When multiple part files or chunk files are expected to be generated, it will use the filename and all the subsequent child directory/file names as an external_id.

Few more examples on how the path gets translated to different metadata fields are listed below.

Path translation examples.

File Path

Structure

FileMetadata.directory

FileMetadata.external_id

FileMetadata.name

cdffs://test_data/test.csv

test_data
└── test.csv

0 directories, 1 file

/test_data

test.csv

test.csv

cdffs://zarr_tests/sample.zarr

zarr_tests
└── sample.zarr
    ├── .zattrs
    ├── .zgroup
    ├── .zmetadata
    ├── x
       ├── .zarray
       ├── .zattrs
       └── 0
    └── y
        ├── .zarray
        ├── .zattrs
        └── 0

3 directories, 9 files

/zarr_tests

sample.zarr/.zattrs
sample.zarr/.zgroup
sample.zarr/.zmetadata
sample.zarr/x/.zarray
sample.zarr/x/.zattrs
sample.zarr/x/0
sample.zarr/y/.zarray
sample.zarr/y/.zattrs
sample.zarr/y/0
.zattrs
.zgroup
.zmetadata
.zarray
.zattrs
0
.zarray
.zattrs
0

Caching

There are three different caching techniques are used to improve the overall performance.

  • Path caching

    Metadata layer on CDF files is eventually consistent - Any read-after-write requests might yield to unexpected results for a very short time. So, In order to prevent unexpected issues when read-after-write consistency is absolutely necessary (especially when working with zarr files using xarray/zarr packages), all the external_ids (constructed from file paths) used to write the data, will be cached and cached external_ids (constructed from file paths) will be included when the file read/list requests are issued.

  • Directory list caching

    Upstream packages might request a list directory multiple times within short intervals which gets translated to list endpoint in cdffs. So, In order to prevent hitting the list endpoint multiple times for the same directory/external_id prefixes, all the file paths will be cached with a specific expiry time (defaulted to 60 seconds) and cached file paths will be returned to the list requests when results are not expired.

  • File contents caching

    The file contents will be read once and it will be cached using allbytes caching from fsspec to improve the read performance and also overcome the limitations on performing range queries. Users will not be able to choose thier preferred cache_type when working with CDF Files.

Additional Configurations

Supported configurations when working with cdffs.

cdffs specific configurations

Parameter name

Mandatory/Optional

Description

connection_config

Mandatory

Client Config to authenticate the requests to CDF. Refer: ClientConfig

file_metadata

Optional but highly recommended

Metadata information to add for files. Refer: FileMetadata

cdf_list_expiry_time

Optional

Directory list cache expiry time. Default is 60 seconds

max_download_retries

Optional

Maximum number of download retries allowed before exhausting. Default is 5.

download_retries

Optional

Flag to indicate enable/disable download retries. Default is True.

upload_strategy

Optional

Flag to configure various file upload strategies. Possible values: [azure, google, inmemory]. Default is inmemory. azure: will use multipart upload expecting CDF in Azure. google will use multipart upload expecting CDF in Google. inmemory is a default upload strategy, where entire file is cached and uploaded as a single call to CDF.