Create a Reference Model with virtualizarr

Create a Reference Model with `virtualizarr`#

In this tutorial, we illustrate how to assemble a reference model from .nc files located in a remote S3 bucket. We will use the library virtualizarr, which allows us to conveniently concatenate and save the final dataset in the parquet format without loading in memory the data (with a function called openvirtual_dataset()).

Here a summary of the main steps for building a reference model:

Identify where the raw data of your reference is stored.
Copy the data to the (S3) bucket.
Unify the data to a single dataset and store it locally to a parq/ folder (i.e., parquet format).
Optionally, upload the dataset on the remote bucket to share the model.

We will illustrate this guide with an example using the Ocean Physics Analysis and Forecast model for the European North West Shelf.

We will assemble the following data from 2022 to 2023:

Sea surface height above geoid SSH (variable zos).
Sea Water Potential Temperature (variable thetao).
Bathymetry (sea bottom depth, variable deptho).
Sea-land mask (variable mask).

1.a. Find the raw data#

First, we need to find where the raw data of the targeted reference model is stored.

Once you have identified the model you want to use, use this web page to find the url of the data. Enter/Search for the name of your model in the input ‘’Collections’’. Once you’ve submitted your search, different data will be listed, along with their respective time intervals. In our case, we look for the hourly mean fields (and the additional static data bathymetry and land-sea mask), which are located here:

We suggest using the native data, whose url can be copy-pasted under the panel ‘’Native dataset’’ in the ‘’Assets’’ section (see screenshot below).

Note: if you don’t work remotely, feel free to download the files locally! For our illustration though, we assume otherwise.

1.b. Fetch the data#

We fetch the data with rclone. Ensure that you have the adequate credentials (check the file ~/.config/rclone/rclone.conf with cat)*.

*For GFTS users, it should be done automatically during your onboarding.

Adapt and execute the following command:

rclone copy cmarine:SRC_URL gfts:DEST_URL

Where SRC_URL is the url copy-pasted earlier and DEST_URL the path where you want to store the data on the GFTS bucket, for each url.

In our example, it corresponds to:

rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-2D_PT1H-m_202411 gfts:gfts-reference-data/demo/NWSHELF_ANALYSISFORECAST_PHY_004_013/
rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_PT1H-m_202309 gfts:gfts-reference-data/demo/NWSHELF_ANALYSISFORECAST_PHY_004_013/
rclone copy cmarine:mdl-native-13/native/NWSHELF_ANALYSISFORECAST_PHY_004_013/cmems_mod_nws_phy_anfc_0.027deg-3D_static_202411/NWS-MFC_004_013_mask_bathy.nc gfts:gfts-reference-data/demo/NWSHELF_ANALYSISFORECAST_PHY_004_013/

Fixing `virtualizarr` bug and quick data checking#

In the following, we show how to fix a shortcoming of virtualizarr which causes the first depth value to equal nan upon opening the dataset.

We conclude with some ways to quickly check that the your model is complete.

import xarray as xr

# not needed if already done
from distributed import LocalCluster

cluster = LocalCluster()
client = cluster.get_client()
client

/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39135 instead
  warnings.warn(
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:1617: VersionMismatchWarning: Mismatched versions found

+---------+--------+-----------+---------+
| Package | Client | Scheduler | Workers |
+---------+--------+-----------+---------+
| numpy   | 2.1.3  | 2.1.3     | 2.0.2   |
+---------+--------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

Client

Client-4b1e1554-b150-11ef-895a-e6c722a023c1

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:39135/status

Cluster Info

LocalCluster

bcde258e

Dashboard: http://127.0.0.1:39135/status	Workers: 3
Total threads: 6	Total memory: 24.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-7568d076-6902-4711-b96b-f3f775f29bc4

Comm: tcp://127.0.0.1:37137	Workers: 3
Dashboard: http://127.0.0.1:39135/status	Total threads: 6
Started: Just now	Total memory: 24.00 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:36779	Total threads: 2
Dashboard: http://127.0.0.1:37619/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:35173
Local directory: /tmp/dask-scratch-space/worker-8y7fz8oe

Worker: 1

Comm: tcp://127.0.0.1:35777	Total threads: 2
Dashboard: http://127.0.0.1:44747/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:43831
Local directory: /tmp/dask-scratch-space/worker-c2rn_1be

Worker: 2

Comm: tcp://127.0.0.1:33643	Total threads: 2
Dashboard: http://127.0.0.1:43425/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:38881
Local directory: /tmp/dask-scratch-space/worker-q8p14e8t

combined_ds = xr.open_dataset(
    "combined_final.parq", engine="kerchunk", chunks={}
)  # chunks implicitly calls `dask`

combined_ds.coords["depth"].values[0] = 0.0  # fixing 1st depth is nan

combined_ds

<xarray.Dataset> Size: 3TB
Dimensions:  (depth: 33, lat: 1240, lon: 958, time: 8760)
Coordinates:
  * depth    (depth) float32 132B 0.0 3.0 5.0 10.0 ... 2e+03 3e+03 4e+03 5e+03
  * lat      (lat) float32 5kB 46.0 46.01 46.03 46.04 ... 62.7 62.72 62.73 62.74
  * lon      (lon) float32 4kB -16.0 -15.97 -15.94 -15.91 ... 12.94 12.97 13.0
  * time     (time) datetime64[ns] 70kB 2022-01-01T01:00:00 ... 2023-01-01
Data variables:
    deptho   (lat, lon) float32 5MB dask.array<chunksize=(620, 479), meta=np.ndarray>
    mask     (depth, lat, lon) float32 157MB dask.array<chunksize=(33, 1240, 958), meta=np.ndarray>
    mdt      (lat, lon) float32 5MB dask.array<chunksize=(1240, 958), meta=np.ndarray>
    thetao   (time, depth, lat, lon) float64 3TB dask.array<chunksize=(1, 11, 414, 320), meta=np.ndarray>
    zos      (time, lat, lon) float64 83GB dask.array<chunksize=(1, 1240, 958), meta=np.ndarray>
Attributes:
    Conventions:          CF-1.7
    contact:              servicedesk.cmems@mercator-ocean.eu
    creation_date:        2021-10-26T06:41:01Z
    credit:               E.U. Copernicus Marine Service Information (CMEMS)
    forcing_data_source:  ECMWF Global Atmospheric Model (HRES); UKMO NATL12;...
    history:              See source and creation_date attributes
    institution:          UK Met Office
    licence:              http://marine.copernicus.eu/services-portfolio/serv...
    netcdf-version-id:    netCDF-4
    product:              NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013
    references:           http://marine.copernicus.eu/
    source:               PS-OS 44, AMM-FOAM 1.5 km (tidal) NEMO v3.6_WAVEWAT...

Data checking#

combined_ds.zos.mean(dim=["lon", "lat"]).plot()

[<matplotlib.lines.Line2D at 0x7f4e1eff10d0>]

_images/b72f366a7f8ed750c6bd9ee67ac48993b7d90c1aecd9695a9bf3cb982c324445.png

Counting the data#

combined_ds.sel(time="2022-01")  # 743 values along time

<xarray.Dataset> Size: 240GB
Dimensions:  (depth: 33, lat: 1240, lon: 958, time: 743)
Coordinates:
  * depth    (depth) float32 132B 0.0 3.0 5.0 10.0 ... 2e+03 3e+03 4e+03 5e+03
  * lat      (lat) float32 5kB 46.0 46.01 46.03 46.04 ... 62.7 62.72 62.73 62.74
  * lon      (lon) float32 4kB -16.0 -15.97 -15.94 -15.91 ... 12.94 12.97 13.0
  * time     (time) datetime64[ns] 6kB 2022-01-01T01:00:00 ... 2022-01-31T23:...
Data variables:
    deptho   (lat, lon) float32 5MB dask.array<chunksize=(620, 479), meta=np.ndarray>
    mask     (depth, lat, lon) float32 157MB dask.array<chunksize=(33, 1240, 958), meta=np.ndarray>
    mdt      (lat, lon) float32 5MB dask.array<chunksize=(1240, 958), meta=np.ndarray>
    thetao   (time, depth, lat, lon) float64 233GB dask.array<chunksize=(1, 11, 414, 320), meta=np.ndarray>
    zos      (time, lat, lon) float64 7GB dask.array<chunksize=(1, 1240, 958), meta=np.ndarray>
Attributes:
    Conventions:          CF-1.7
    contact:              servicedesk.cmems@mercator-ocean.eu
    creation_date:        2021-10-26T06:41:01Z
    credit:               E.U. Copernicus Marine Service Information (CMEMS)
    forcing_data_source:  ECMWF Global Atmospheric Model (HRES); UKMO NATL12;...
    history:              See source and creation_date attributes
    institution:          UK Met Office
    licence:              http://marine.copernicus.eu/services-portfolio/serv...
    netcdf-version-id:    netCDF-4
    product:              NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013
    references:           http://marine.copernicus.eu/
    source:               PS-OS 44, AMM-FOAM 1.5 km (tidal) NEMO v3.6_WAVEWAT...

# sums over lon and lat and compare the number of values to the one above
count = (
    combined_ds.zos.sel(time="2022-01").mean(dim=["lon", "lat"]).count().compute()
)  # 743 too

count

<xarray.DataArray 'zos' ()> Size: 8B
np.int64(743)

Comm: tcp://127.0.0.1:44673	Total threads: 2
Dashboard: http://127.0.0.1:44217/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:37825
Local directory: /tmp/dask-scratch-space/worker-5tqqw777

Comm: tcp://127.0.0.1:45655	Total threads: 2
Dashboard: http://127.0.0.1:45185/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:43869
Local directory: /tmp/dask-scratch-space/worker-fptob1pn

Comm: tcp://127.0.0.1:37051	Total threads: 2
Dashboard: http://127.0.0.1:43085/status	Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:35887
Local directory: /tmp/dask-scratch-space/worker-p4ogs8b5

Create a Reference Model with virtualizarr

Contents

Create a Reference Model with `virtualizarr`#

1.a. Find the raw data#

1.b. Fetch the data#

2. Assembling The Model#

Preliminaries#

Setup#

How-to General Procedure#

Open virtually the `.nc` files given a remote `root` folder#

Concatenating the `vds` along time#

Speeding up the computation with `dask`#

Application: Example#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Static data (`mask` and `deptho`)#

Sea Surface Height (`zos`)#

Sea Water Potential Temperature (`thetao`)#

Merging the three data sources#

Option: copy the model to S3#

Fixing `virtualizarr` bug and quick data checking#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Data checking#

Counting the data#

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:35019/status

Comm: tcp://127.0.0.1:36717	Workers: 3
Dashboard: http://127.0.0.1:35019/status	Total threads: 6
Started: Just now	Total memory: 24.00 GiB

Create a Reference Model with virtualizarr

Contents

Create a Reference Model with virtualizarr#

1.a. Find the raw data#

1.b. Fetch the data#

2. Assembling The Model#

Preliminaries#

Setup#

How-to General Procedure#

Open virtually the .nc files given a remote root folder#

Concatenating the vds along time#

Speeding up the computation with dask#

Application: Example#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Static data (mask and deptho)#

Sea Surface Height (zos)#

Sea Water Potential Temperature (thetao)#

Merging the three data sources#

Option: copy the model to S3#

Fixing virtualizarr bug and quick data checking#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Data checking#

Counting the data#

Create a Reference Model with `virtualizarr`#

Open virtually the `.nc` files given a remote `root` folder#

Concatenating the `vds` along time#

Speeding up the computation with `dask`#

Static data (`mask` and `deptho`)#

Sea Surface Height (`zos`)#

Sea Water Potential Temperature (`thetao`)#

Fixing `virtualizarr` bug and quick data checking#