Tag Data Preparation#

This tutorial aims to guide you through how we prepare raw tag data.

The main goal of this preparation is to ensure that the time stamps are expressed in UTC.

First, we will detail what we mean by ‘’raw data’’. Then, we present what the processing consists in. Finally, we apply it to an example.

Raw Data Description#

We expect the raw data of a given tag id to be located in a specific folder in the GFTS bucket. The latter should have 4 .csv files:

  • For id.csv contains the recorded data as a table entitled by Date Time Stamp, Pressure and Temp.

  • acoustic.csv contains the acoustic detections of the fish (for instance, by stations). In case of no detection, the file is an empty table.

    Otherwise, we expect the columns date_time, deployment_id, deploy_longitude and deploy_latitude.

  • metadata.csv contains information about the tag. It can be any tabular data.

  • tagging_events_id.csv contains the times and positions of the release and recapture events.

    We expect a 2 by 4 tabular file, entitled by event*, time, longitude and latitude. The event* column is the index, whose values describe the events, e.g., ‘’release’’ and ‘’fish death’’. The recapture information can contain nan longitude and latitude values.

Processing Description#

In the following, we present the 4 functions to process each file mentioned above. Here a brief description of their objectives:

  • For id.csv: the time stamps are converted to UTC time and the columns are renamed to temperature and pressure.

  • For acoustic.csv: the time stamps are converted to UTC time.

  • For metadata.csv: the file is loaded as a DataFrame and then exported as a .json file.

  • For tagging_events_id.csv: the time stamps are converted to UTC and the columns renamed to event_name, longitude and latitude.

Implementation#

from s3fs.core import S3FileSystem
import pandas as pd
import io
## counter-intuitive specifications! See:
# https://en.wikipedia.org/wiki/Tz_database#Area
# https://pvlib-python.readthedocs.io/en/stable/user_guide/timetimezones.html#fixed-offsets


def process_dst(
    file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2", time_col_index=0
):
    """
    Process a `.csv` file containing the recorded time series of a tagged fish.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
    :type time_zone: str
    :param time_col_index: Index of the time column, defaults to 0.
    :type time_col_index: int
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    with s3.open(file_path, "rb") as f:
        # assigns a new column "time" as the index
        df = (
            pd.read_csv(f)
            .assign(
                time=lambda df: pd.to_datetime(
                    df.iloc[:, time_col_index], dayfirst=True
                )
            )
            .set_index("time")
        )
        # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
        df = df.tz_localize(time_zone).tz_convert("UTC").iloc[:, 1:3]
        df.columns = ["pressure", "temperature"]
    return df


def process_tagging_event(file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2"):
    """
    Process a `.csv` file containing the tagging events (release and recapture) of a tagged fish.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
    :type time_zone: str
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    # TODO: Mathieu told me that times are already given as UTC+2? Ok so it is good
    # TODO: input a more friendly time zone such as Europe/Paris and compute the GMT shift
    with s3.open(file_path, "r") as f:
        lines = f.readlines()
    cleaned_lines = []
    for line in lines:
        cleaned_line = line.strip().strip('"').replace("\t", "")
        cleaned_lines.append(cleaned_line)

    # assigns a new column "time" as the index
    df = (
        pd.read_csv(io.StringIO("\n".join(cleaned_lines)))
        .assign(time=lambda df: pd.to_datetime(df["time"]))
        .set_index("time")
    )
    # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
    df = df.tz_convert(None).tz_localize(time_zone).tz_convert("UTC")
    df.columns = ["event_name", "longitude", "latitude"]
    return df


def process_metadata(file_path: str, s3: S3FileSystem):
    """
    Open a `.csv` file located in a S3 bucket with `pandas` as a `DataFrame`.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :return: A DataFrame.
    :rtype: pd.DataFrame
    """
    with s3.open(file_path, "rb") as f:
        df = pd.read_csv(f)
    return df


def process_acoustic_data(
    file_path: str, s3: S3FileSystem, time_zone="UTC", time_col_index=0
):
    """
    Process a `.csv` file containing a time series of ponctual detections of a tagged fish by acoustic receivers.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "UTC".
    :type time_zone: str
    :param time_col_index: Index of the time column, defaults to 0.
    :type time_col_index: int
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    # TODO: check again with Mathieu that **the acoustic data is assumed to be given in UTC**.
    with s3.open(file_path, "rb") as f:
        # assigns a new column "time" as the index
        df = (
            pd.read_csv(f)
            .assign(
                time=lambda df: pd.to_datetime(
                    df.iloc[:, time_col_index], dayfirst=True, format="ISO8601"
                )
            )
            .set_index("time")
        )
        # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
        df = df.tz_localize(time_zone).tz_convert("UTC")
        df.drop(["date_time"], axis="columns", inplace=True)
    return df

Example#

First, let’s define variable to access the S3 bucket.

import s3fs
import pandas as pd  # noqa: F811
import io  # noqa: F811

storage_options = {
    "anon": False,
    "profile": "gfts",
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net/",
        "region_name": "gra",
    },
}
remote_dir = "gfts-ifremer/tag_data_demo/"
s3 = s3fs.S3FileSystem(**storage_options)

The raw files are located here:

s3.ls(remote_dir)
['gfts-ifremer/tag_data_demo/A123456.csv',
 'gfts-ifremer/tag_data_demo/acoustic.csv',
 'gfts-ifremer/tag_data_demo/metadata.csv',
 'gfts-ifremer/tag_data_demo/tagging_events_A123456.csv']
from pathlib import Path

local_dir = "tag_data_demo"
output_path = Path(local_dir)
output_path.mkdir(exist_ok=True)
time_zone = "Etc/GMT-2"
date_format = "%Y-%m-%dT%H:%M:%SZ"
device_id = "A123456"

Then, let’s process each file and store locally the results in the output directory (tag_data_demo)

acoustic_df = process_acoustic_data(remote_dir + "acoustic.csv", s3, time_zone="UTC")
acoustic_df.to_csv(output_path / "acoustic.csv", date_format=date_format)

event_tags_df = process_tagging_event(
    remote_dir + f"tagging_events_{device_id}.csv", s3, time_zone
)
event_tags_df.to_csv(output_path / "tagging_events.csv", date_format=date_format)


dst_df = process_dst(remote_dir + f"{device_id}.csv", s3, time_zone)
dst_df.to_csv(output_path / "dst.csv", date_format=date_format)


md_df = process_metadata(remote_dir + "metadata.csv", s3)
md_df.to_json(output_path / "metadata.json")

Biologging data#

dst_df.head(10)
pressure temperature
time
2014-05-21 22:00:00+00:00 1.751477 17.514350
2014-05-21 22:01:30+00:00 1.477457 17.898020
2014-05-21 22:03:00+00:00 1.741089 19.238910
2014-05-21 22:04:30+00:00 1.833988 18.834639
2014-05-21 22:06:00+00:00 1.567610 18.458077
2014-05-21 22:07:30+00:00 1.207911 18.109961
2014-05-21 22:09:00+00:00 1.435166 17.014766
2014-05-21 22:10:30+00:00 1.869589 16.976774
2014-05-21 22:12:00+00:00 1.855380 16.861258
2014-05-21 22:13:30+00:00 1.995521 16.243335

Acoustic Detections#

acoustic_df.head(2)
deployment_id deploy_longitude deploy_latitude
time
2014-05-22 09:40:30+00:00 10 -2.6812 46.1433
2014-05-22 09:46:08+00:00 42 5.7369 47.6660

Tag Events DataFrame#

event_tags_df
event_name longitude latitude
time
2014-05-21 22:00:00+00:00 release 5.5369 47.966
2014-06-02 06:00:00+00:00 fish_death NaN NaN

Tag Information#

md_df
pit_tag_number acoustic_tag_id scientific_name common_name project
0 A123456 MAZ-42 Lorem ipsum démo how-to-guide