Tutorial: processing several tags with kbatch_papermill#

In this tutorial, we cover the handling of the processing of your biologging data, from hereafter referred to as tags.

Here, we will illustrate the case where you want to run the fish tracking estimation model implemented by pangeo-fish entirely, for all your tags.

As a biologist, you might wonder how to run the estimation model on all these data…

This tutorial aims to clarify this point, by providing you a way to automatically scale this processing!

The overall idea is the following:

  1. First, you write a Jupyter notebook that will perform all the operations you want for a tag (using some functions of pangeo-fish). It can be computing data, plotting some of the results along the way, checking the data as the cells go on etc.

  2. Then, thanks to a launcher notebook we are going to learn to write here, you run all the tags at once, with whatever HPC resource you have access to.

Note that this workflow is compatible for any time of computation. That being said, please be aware of the following limitations (and pitfalls):

  1. “Inside” the HPC, the notebooks will be run in containers, whose local storage is lost once the notebook is executed. As such, if the latter saves some data, make sure to send it somewhere (typically, to a S3 bucket).

  2. Since the run notebooks are retrieved after their execution, any interactive plot won’t be shown. Therefore, we recommend considering either saving them (as HTML file for example) and adding cells in the notebook that would statically plot an equivalent plot.

Technologies behind the launcher notebook#

The workflow covered in this tutorial, the launcher notebook, relies on kbatch_papermill, a package that lets you parametrize notebooks and run them as jobs.

Specifically kbatch_papermill is built on top of papermill, a Python library that enable parameterized execution of Jupyter notebooks. kbatch_papermill provides a convenient API for running the aforementioned notebooks as jobs on your cluster.

In fact, kbatch_papermill has been primarily designed for this use-case!

To summarize, in this tutorial notebook, you will learn how to write a launcher notebook. Here, the routine we will set up is the generation of fish location estimations for your tags.

Before we set out to do anything, let’s import all the required Python packages:

import json
import os
import re
import s3fs

from pathlib import Path
from tqdm.notebook import tqdm

from kbatch_papermill import kbatch_papermill, print_job_status

Main inputs for launcher notebook#

a. Details#

In a nutshell, the notebook consists of submitting jobs to your HPC resources where the tasks are a parametrized notebook.

The function used to submit the jobs is kbatch_papermill.kbatch_papermill, which requires:

  1. Information about the notebook:

    • code_dir: path to the folder containing the notebook. The folder will be copied alongside the notebook itself, allowing you to have access to any necessary file for your tasks.

    • notebook: the path to the notebook itself, relative to code_dir.

  2. Information about the remote storage of the notebook:

    • s3_dest: the uri to save the notebook (once executed)

NB: kbatch_papermill does have more parameters, that you might be interested in using once you have gained more experience. In this tutorial, we will define them for you!

b. Application#

First, clone the pangeo-fish’s repository to have the notebook:

# in a new terminal
git clone https://github.com/pangeo-fish/pangeo-fish.git pangeo-fish
# input/local variables
code_dir = Path.home() / "pangeo-fish"
notebook = "notebooks/pangeo-fish.ipynb"

# where to store the result of this tutorial
s3_dest = "s3://gfts-ifremer/kbatch_papermill/"
# additional variables
user_name = os.getenv("JUPYTERHUB_USER")
storage_options = {
    "anon": False,
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
        "region_name": "gra",
    },
}
# appends to `s3_dest` your username
s3_dest += user_name
s3_nb_dest = (
    f"{s3_dest}/nbs"  # the notebooks will be stored in a dedicated directory "nbs"
)
# remote accessor
s3 = s3fs.S3FileSystem(anon=False)
s3.mkdir(s3_nb_dest, exist_ok=True)

print("Remote storage root:", s3_dest)
print("Remote storage root for the notebooks:", s3_nb_dest)

Additionally, let’s define a folder where we will save:

  1. Metadata of what we run (in a .json file jobs.json)

  2. Fetch the remotely stored notebooks (once they are executed)

local_output = Path("notebook_launcher_tutorial")
local_output.mkdir(exist_ok=True)
job_dict = {}

Parametrizing and launching notebooks#

a. Details#

In this tutorial, we simply run the example notebook included in the pangeo-fish’s repository, with different tag names.

To do so, we need to change the variable tag_name of the notebook, since it corresponds to the name of the tag to process. For instance, let’s run it for the tags A19124, A18831 and A18832.

b. Application#

parameters = {
    "storage_options": storage_options,
    "scratch_root": s3_dest,  # in the notebook, the remote root is defined with the variable `scratch_root`
    # URL to the reference data
    "ref_url": "s3://gfts-reference-data/NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013/combined_2022_to_2024.parq/",
}
tag_list = ["A19124", "A18831", "A18832"]
for tag_name in tqdm(tag_list, desc="Processing tags"):
    try:
        # remotes from the tag name any conflicting characters (such as "_") with Kubernetes
        safe_tag_name = re.sub(r"[^a-z0-9-]", "", tag_name.lower())
        # parameters (with `tag_name`)
        params = parameters | {"tag_name": tag_name}
        # where to store the notebook remotely
        s3_nb_path = f"{s3_nb_dest}/{tag_name}.ipynb"

        job_id = kbatch_papermill(
            # input info
            code_dir=code_dir,
            notebook=notebook,
            # output info
            s3_dest=s3_nb_path,
            parameters=params,
            # additional parameters (not explained here)
            job_name=f"tuto-{safe_tag_name}",  # name of the job (here, w.r.t the name of the tag)
            s3_code_dir=f"gfts-ifremer/kbatch/{user_name}",  # where to zip and dump the code for the container
            profile_name="big60",  # specification of the container's hardware
        )
        print(
            f'Notebook for the tag "{tag_name}" has been launched as the job "{job_id}"!'
        )

        # we keep the remote paths of the launched jobs
        job_dict[job_id] = s3_nb_path
    except Exception as e:
        print(f"Error for {tag_name}: {e.__class__.__name__}: {e}")
        raise
# saves the jobs' metadata in the local folder
dict_path = local_output / "jobs.json"
with dict_path.open("w") as file:
    json.dump(job_dict, file)

You can monitor the status of the jobs with the following cell:

print_job_status()

When the jobs are finished, you can fetch the notebooks locally with the remote accessor s3:

s3.get(f"{s3_nb_dest}/*", local_output, recursive=True)

As for the the results of each notebook (or tag), they are stored next s3_nb_dest, under s3_dest. You can explore them with the ls function of s3:

s3.ls(s3_dest)

Further Readings#

Extended Code Explanation#

This section aims to provide you with additional explanations of the code.

It targets users who want to gain better knowledge about the kbatch_papermill function.

The last parameters of kbatch_papermill() that are not covered above are the following:

  • s3_code_dir

  • profile_name

There is little to add for s3_code_dir. It defines the path to a repository in which the files under code_dir are zipped into .zip files that is later sent to the kubernetes containers. Upon the execution of the containers, the .zip files are removed.

As for profile_name, it defines the specification of the container’s hardware resources. To see the available profiles of your HPC, open a terminal and run the following command:

kbatch profiles