# Create Kerchunk catalog for CMEMS 3D


## Set up credentials on GFTS buckets

Credentials are stored in the `gfts` profile in your `~/.aws/credentials`. This file is generated automatically on GFTS Jupyterhub.

You can view them with `~/.aws/credentials`.

- access keys are in profile named `gfts`
- endpoint_url is `https://s3.gra.perf.cloud.ovh.net`
- region_name is `gra`

You should have read and write permissions to the bucket, but not delete

## Rclone to copy/sync data from Copernicus Marine to GFTS bucket

### Set up Rclone config file

`Rclone` is configured automatically on GFTS JupyterHub. You can view Rclone config file with `~/.config/rclone/rclone.conf`

```
[cmarine]
type = s3
provider = Other
endpoint = https://s3.waw3-1.cloudferro.com
acl = public-read

[gfts]
type = s3
provider = Other
env_auth = true
region = gra
endpoint = https://s3.gra.perf.cloud.ovh.net
```

## Copy CMEMS files with rclone

We copy CMEMS files we need to our bucket as they are not available on DestinE DEDL yet.

- 3D
```
./rclone copy cmarine:mdl-native-10/native/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-3D_PT1H-m_202211 gfts:gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-3D_PT1H-m_202211/
./rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309 fts:gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/
```

## Set up credentials

Credentials are stored in the `gfts` profile in your `~/.aws/credentials`.

You can view them with `~/.aws/credentials`.

- access keys are in profile named `gfts`
- endpoint_url is `https://s3.gra.perf.cloud.ovh.net`
- region_name is `gra`

You should have read and write permissions to the bucket, but not delete

In [1]:
# !cat ~/.aws/credentials

In [1]:
import s3fs
import xarray as xr
from pathlib import Path

import ujson
from kerchunk.combine import MultiZarrToZarr
from kerchunk import df
from kerchunk.hdf import SingleHdf5ToZarr
import fsspec

In [2]:
s3 = s3fs.S3FileSystem(
    anon=False,
    profile="gfts",
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
)

## Create catalog for NWSHELF ANALYSIS FORECAST data

In [3]:
bucket_name = "gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309"
s3.ls(bucket_name)

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2024']

In [4]:
s3path = "s3://gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/*/*/*.nc"

In [5]:
remote_files = s3.glob(s3path)
remote_files

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230702_20230702_R20230703_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230703_20230703_R20230704_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230704_20230704_R20230705_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230705_20230705_R20230706_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230706_20230706_R20230707_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/c

In [6]:
fs = fsspec.filesystem(
    "s3",
    anon=False,
    profile="gfts",
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
)

In [7]:
fs_files = fs.glob(s3path)

In [8]:
fs_files

['gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230702_20230702_R20230703_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230703_20230703_R20230704_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230704_20230704_R20230705_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230705_20230705_R20230706_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309/2023/07/CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230706_20230706_R20230707_HC01.nc',
 'gfts-reference-data/NWSHELF_ANALYSISFORECAST_PHY_004_013/c

In [9]:
fs2 = fsspec.filesystem("")  # local file system to save final jsons to

so = dict(
    mode="rb", anon=True, default_fill_cache=False, default_cache_type="first"
)  # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.


def gen_json(fs, fs2, so, file_url):
    name = file_url.split("/")[-1].split(".")[0]
    outf = f"{name}.json"  # file name to save json to
    if not Path(outf).is_file():
        with fs.open(file_url, **so) as infile:
            h5chunks = SingleHdf5ToZarr(infile, file_url, inline_threshold=300)
            # inline threshold adjusts the Size below which binary blocks are included directly in the output
            # a higher inline threshold can result in a larger json file but faster loading time
            print(outf)
            with fs2.open(outf, "wb") as f:
                f.write(ujson.dumps(h5chunks.translate()).encode())
    else:
        print("File ", outf, " already exists")

In [10]:
%%time
for file in fs_files:
    gen_json(fs, fs2, so, file)

CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230702_20230702_R20230703_HC01.json
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230703_20230703_R20230704_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230704_20230704_R20230705_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230705_20230705_R20230706_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230706_20230706_R20230707_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230707_20230707_R20230708_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230708_20230708_R20230709_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230709_20230709_R20230710_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230710_20230710_R20230711_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230711_20230711_R20230712_HC01.json  already exists
File  CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230712_20230712_R20230713_HC01.json  alr

In [11]:
ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": {
            "fo": "CMEMS_v6r1_NWS_PHY_NRT_NL_01hav3D_20230702_20230702_R20230703_HC01.json",
            "remote_protocol": "s3",
            "remote_options": {"anon": False},
        },
    },
)
print(ds)

<xarray.Dataset> Size: 20GB
Dimensions:    (depth: 50, latitude: 551, longitude: 936, time: 24)
Coordinates:
  * depth      (depth) float32 200B 0.494 1.541 2.646 ... 5.275e+03 5.728e+03
  * latitude   (latitude) float32 2kB 46.0 46.03 46.06 ... 61.23 61.25 61.28
  * longitude  (longitude) float32 4kB -16.0 -15.97 -15.94 ... 9.921 9.949 9.977
  * time       (time) datetime64[ns] 192B 2023-07-02T00:30:00 ... 2023-07-02T...
Data variables:
    so         (time, depth, latitude, longitude) float64 5GB ...
    thetao     (time, depth, latitude, longitude) float64 5GB ...
    uo         (time, depth, latitude, longitude) float64 5GB ...
    vo         (time, depth, latitude, longitude) float64 5GB ...
Attributes: (12/13)
    Conventions:     CF-1.8
    comment:         
    contact:         https://marine.copernicus.eu/contact
    domain_name:     NWS36
    field_date:      20230702
    field_type:      mean
    ...              ...
    forecast_type:   hindcast
    institution:     Nologin

In [12]:
json_list = fs2.glob("CMEMS_*3D*.json")

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={"anon": False},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)

d = mzz.translate()

In [14]:
# Do not write json because it is too large & slow
# with fs2.open('CMEMS_v6r1_NWS_PHY_NRT_NL_3D_combined.json', 'wb') as f:
#    f.write(ujson.dumps(d).encode())

In [13]:
df.refs_to_dataframe(d, "CMEMS_v6r1_NWS_PHY_NRT_NL_3D_combined.parq")

### Copy the json file into remote bucket with rclone

In [17]:
#!rclone copy CMEMS_v6r1_NWS_PHY_NRT_NL_01hav_AN_3D_combined.json gfts:gfts-reference-data/

In [14]:
!rclone sync CMEMS_v6r1_NWS_PHY_NRT_NL_3D_combined.parq gfts:gfts-reference-data/CMEMS_v6r1_NWS_PHY_NRT_NL_3D_combined.parq

## Create catalog for IBI ANALYSIS FORECAST data

In [16]:
bucket_name = "gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-3D_PT1H-m_202211"
s3.ls(bucket_name)

['gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-3D_PT1H-m_202211/2024']

In [17]:
s3path = "s3://gfts-reference-data/IBI_ANALYSISFORECAST_PHY_005_001/cmems_mod_ibi_phy_anfc_0.027deg-3D_PT1H-m_202211/*/*/*.nc"

In [18]:
fs_files = fs.glob(s3path)

In [19]:
fs2 = fsspec.filesystem("")  # local file system to save final jsons to
so = dict(
    mode="rb", anon=True, default_fill_cache=False, default_cache_type="first"
)  # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.

In [20]:
%%time
for file in fs_files:
    gen_json(fs, fs2, so, file)

CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240515_20240515_R20240516_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240516_20240516_R20240517_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240517_20240517_R20240518_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240518_20240518_R20240519_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240519_20240519_R20240520_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240520_20240520_R20240521_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240521_20240521_R20240522_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240522_20240522_R20240523_HC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240523_20240523_R20240523_FC01.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240524_20240524_R20240523_FC02.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240525_20240525_R20240523_FC03.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240526_20240526_R20240523_FC04.json
CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_20240527_20240527_R20240523_FC05.json
CPU times: user 11.8 s, sys: 2.66 s, total: 14.5 s
Wall time: 1m

In [21]:
json_list = fs2.glob("CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_*_*_*_*.json")

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={"anon": False},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)

d = mzz.translate()

### Save into parquet

In [22]:
df.refs_to_dataframe(d, "CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_combined.parq")

### Copy to our bucket

In [23]:
!rclone sync CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_combined.parq gfts:gfts-reference-data/CMEMS_v6r1_IBI_PHY_NRT_NL_01hav3D_combined.parq