Tutorial: processing several tags with kbatch_papermill
#
In this tutorial, we cover the handling of the processing of your biologging data, from hereafter referred to as tags.
Here, we will illustrate the case where you want to run the fish tracking estimation model implemented by pangeo-fish
entirely, for all your tags.
As a biologist, you might wonder how to run the estimation model on all these data…
This tutorial aims to clarify this point, by providing you a way to automatically scale this processing!
The overall idea is the following:
First, you write a Jupyter notebook that will perform all the operations you want for a tag (using some functions of
pangeo-fish
). It can be computing data, plotting some of the results along the way, checking the data as the cells go on etc.Then, thanks to a launcher notebook we are going to learn to write here, you run all the tags at once, with whatever HPC resource you have access to.
Note that this workflow is compatible for any time of computation. That being said, please be aware of the following limitations (and pitfalls):
“Inside” the HPC, the notebooks will be run in containers, whose local storage is lost once the notebook is executed. As such, if the latter saves some data, make sure to send it somewhere (typically, to a S3 bucket).
Since the run notebooks are retrieved after their execution, any interactive plot won’t be shown. Therefore, we recommend considering either saving them (as HTML file for example) and adding cells in the notebook that would statically plot an equivalent plot.
Technologies behind the launcher notebook#
The workflow covered in this tutorial, the launcher notebook, relies on kbatch_papermill
, a package that lets you parametrize notebooks and run them as jobs.
Specifically kbatch_papermill
is built on top of papermill
, a Python library that enable parameterized execution of Jupyter notebooks.
kbatch_papermill
provides a convenient API for running the aforementioned notebooks as jobs on your cluster.
In fact, kbatch_papermill
has been primarily designed for this use-case!
To summarize, in this tutorial notebook, you will learn how to write a launcher notebook. Here, the routine we will set up is the generation of fish location estimations for your tags.
Before we set out to do anything, let’s import all the required Python packages:
import json
import os
import re
import s3fs
from pathlib import Path
from tqdm.notebook import tqdm
from kbatch_papermill import kbatch_papermill, print_job_status
Main inputs for launcher notebook#
a. Details#
In a nutshell, the notebook consists of submitting jobs to your HPC resources where the tasks are a parametrized notebook.
The function used to submit the jobs is kbatch_papermill.kbatch_papermill
, which requires:
Information about the notebook:
code_dir
: path to the folder containing the notebook. The folder will be copied alongside the notebook itself, allowing you to have access to any necessary file for your tasks.notebook
: the path to the notebook itself, relative tocode_dir
.
Information about the remote storage of the notebook:
s3_dest
: the uri to save the notebook (once executed)
NB: kbatch_papermill
does have more parameters, that you might be interested in using once you have gained more experience. In this tutorial, we will define them for you!
b. Application#
First, clone the pangeo-fish
’s repository to have the notebook:
# in a new terminal
git clone https://github.com/pangeo-fish/pangeo-fish.git pangeo-fish
# input/local variables
code_dir = Path.home() / "pangeo-fish"
notebook = "notebooks/pangeo-fish.ipynb"
# where to store the result of this tutorial
s3_dest = "s3://gfts-ifremer/kbatch_papermill/"
# additional variables
user_name = os.getenv("JUPYTERHUB_USER")
storage_options = {
"anon": False,
"client_kwargs": {
"endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
"region_name": "gra",
},
}
# appends to `s3_dest` your username
s3_dest += user_name
s3_nb_dest = (
f"{s3_dest}/nbs" # the notebooks will be stored in a dedicated directory "nbs"
)
# remote accessor
s3 = s3fs.S3FileSystem(anon=False)
s3.mkdir(s3_nb_dest, exist_ok=True)
print("Remote storage root:", s3_dest)
print("Remote storage root for the notebooks:", s3_nb_dest)
Additionally, let’s define a folder where we will save:
Metadata of what we run (in a
.json
filejobs.json
)Fetch the remotely stored notebooks (once they are executed)
local_output = Path("notebook_launcher_tutorial")
local_output.mkdir(exist_ok=True)
job_dict = {}
Parametrizing and launching notebooks#
a. Details#
In this tutorial, we simply run the example notebook included in the pangeo-fish
’s repository, with different tag names.
To do so, we need to change the variable tag_name
of the notebook, since it corresponds to the name of the tag to process.
For instance, let’s run it for the tags A19124
, A18831
and A18832
.
b. Application#
parameters = {
"storage_options": storage_options,
"scratch_root": s3_dest, # in the notebook, the remote root is defined with the variable `scratch_root`
# URL to the reference data
"ref_url": "s3://gfts-reference-data/NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013/combined_2022_to_2024.parq/",
}
tag_list = ["A19124", "A18831", "A18832"]
for tag_name in tqdm(tag_list, desc="Processing tags"):
try:
# remotes from the tag name any conflicting characters (such as "_") with Kubernetes
safe_tag_name = re.sub(r"[^a-z0-9-]", "", tag_name.lower())
# parameters (with `tag_name`)
params = parameters | {"tag_name": tag_name}
# where to store the notebook remotely
s3_nb_path = f"{s3_nb_dest}/{tag_name}.ipynb"
job_id = kbatch_papermill(
# input info
code_dir=code_dir,
notebook=notebook,
# output info
s3_dest=s3_nb_path,
parameters=params,
# additional parameters (not explained here)
job_name=f"tuto-{safe_tag_name}", # name of the job (here, w.r.t the name of the tag)
s3_code_dir=f"gfts-ifremer/kbatch/{user_name}", # where to zip and dump the code for the container
profile_name="big60", # specification of the container's hardware
)
print(
f'Notebook for the tag "{tag_name}" has been launched as the job "{job_id}"!'
)
# we keep the remote paths of the launched jobs
job_dict[job_id] = s3_nb_path
except Exception as e:
print(f"Error for {tag_name}: {e.__class__.__name__}: {e}")
raise
# saves the jobs' metadata in the local folder
dict_path = local_output / "jobs.json"
with dict_path.open("w") as file:
json.dump(job_dict, file)
You can monitor the status of the jobs with the following cell:
print_job_status()
When the jobs are finished, you can fetch the notebooks locally with the remote accessor s3
:
s3.get(f"{s3_nb_dest}/*", local_output, recursive=True)
As for the the results of each notebook (or tag), they are stored next s3_nb_dest
, under s3_dest
.
You can explore them with the ls
function of s3
:
s3.ls(s3_dest)
Further Readings#
More information about the results, please check
pangeo-fish
’s tutorial.To learn how to parameterize Jupyter notebooks, see papermill documentation.
Extended Code Explanation#
This section aims to provide you with additional explanations of the code.
It targets users who want to gain better knowledge about the kbatch_papermill
function.
The last parameters of kbatch_papermill()
that are not covered above are the following:
s3_code_dir
profile_name
There is little to add for s3_code_dir
.
It defines the path to a repository in which the files under code_dir
are zipped into .zip
files that is later sent to the kubernetes containers.
Upon the execution of the containers, the .zip
files are removed.
As for profile_name
, it defines the specification of the container’s hardware resources.
To see the available profiles of your HPC, open a terminal and run the following command:
kbatch profiles