Tag Data Preparation

Tag Data Preparation#

This tutorial aims to guide you through how we prepare raw tag data.

The main goal of this preparation is to ensure that the time stamps are expressed in UTC.

First, we will detail what we mean by ‘’raw data’’. Then, we present what the processing consists in. Finally, we apply it to an example.

Raw Data Description#

We expect the raw data of a given tag id to be located in a specific folder in the GFTS bucket. The latter should have 4 .csv files:

For id.csv contains the recorded data as a table entitled by Date Time Stamp, Pressure and Temp.
acoustic.csv contains the acoustic detections of the fish (for instance, by stations). In case of no detection, the file is an empty table.

Otherwise, we expect the columns date_time, deployment_id, deploy_longitude and deploy_latitude.
metadata.csv contains information about the tag. It can be any tabular data.
tagging_events_id.csv contains the times and positions of the release and recapture events.

We expect a 2 by 4 tabular file, entitled by event*, time, longitude and latitude. The event* column is the index, whose values describe the events, e.g., ‘’release’’ and ‘’fish death’’. The recapture information can contain nan longitude and latitude values.

Processing Description#

In the following, we present the 4 functions to process each file mentioned above. Here a brief description of their objectives:

For id.csv: the time stamps are converted to UTC time and the columns are renamed to temperature and pressure.
For acoustic.csv: the time stamps are converted to UTC time.
For metadata.csv: the file is loaded as a DataFrame and then exported as a .json file.
For tagging_events_id.csv: the time stamps are converted to UTC and the columns renamed to event_name, longitude and latitude.

Implementation#

from s3fs.core import S3FileSystem
import pandas as pd
import io
## counter-intuitive specifications! See:
# https://en.wikipedia.org/wiki/Tz_database#Area
# https://pvlib-python.readthedocs.io/en/stable/user_guide/timetimezones.html#fixed-offsets


def process_dst(
    file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2", time_col_index=0
):
    """
    Process a `.csv` file containing the recorded time series of a tagged fish.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
    :type time_zone: str
    :param time_col_index: Index of the time column, defaults to 0.
    :type time_col_index: int
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    with s3.open(file_path, "rb") as f:
        # assigns a new column "time" as the index
        df = (
            pd.read_csv(f)
            .assign(
                time=lambda df: pd.to_datetime(
                    df.iloc[:, time_col_index], dayfirst=True
                )
            )
            .set_index("time")
        )
        # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
        df = df.tz_localize(time_zone).tz_convert("UTC").iloc[:, 1:3]
        df.columns = ["pressure", "temperature"]
    return df


def process_tagging_event(file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2"):
    """
    Process a `.csv` file containing the tagging events (release and recapture) of a tagged fish.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
    :type time_zone: str
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    # TODO: Mathieu told me that times are already given as UTC+2? Ok so it is good
    # TODO: input a more friendly time zone such as Europe/Paris and compute the GMT shift
    with s3.open(file_path, "r") as f:
        lines = f.readlines()
    cleaned_lines = []
    for line in lines:
        cleaned_line = line.strip().strip('"').replace("\t", "")
        cleaned_lines.append(cleaned_line)

    # assigns a new column "time" as the index
    df = (
        pd.read_csv(io.StringIO("\n".join(cleaned_lines)))
        .assign(time=lambda df: pd.to_datetime(df["time"]))
        .set_index("time")
    )
    # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
    df = df.tz_convert(None).tz_localize(time_zone).tz_convert("UTC")
    df.columns = ["event_name", "longitude", "latitude"]
    return df


def process_metadata(file_path: str, s3: S3FileSystem):
    """
    Open a `.csv` file located in a S3 bucket with `pandas` as a `DataFrame`.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :return: A DataFrame.
    :rtype: pd.DataFrame
    """
    with s3.open(file_path, "rb") as f:
        df = pd.read_csv(f)
    return df


def process_acoustic_data(
    file_path: str, s3: S3FileSystem, time_zone="UTC", time_col_index=0
):
    """
    Process a `.csv` file containing a time series of ponctual detections of a tagged fish by acoustic receivers.

    :param file_path: Path of the file.
    :type file_path: str
    :param s3: The file system of the S3 bucket.
    :type S3FileSystem: str
    :param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "UTC".
    :type time_zone: str
    :param time_col_index: Index of the time column, defaults to 0.
    :type time_col_index: int
    :return: The processed DataFrame.
    :rtype: pd.DataFrame
    """
    # TODO: check again with Mathieu that **the acoustic data is assumed to be given in UTC**.
    with s3.open(file_path, "rb") as f:
        # assigns a new column "time" as the index
        df = (
            pd.read_csv(f)
            .assign(
                time=lambda df: pd.to_datetime(
                    df.iloc[:, time_col_index], dayfirst=True, format="ISO8601"
                )
            )
            .set_index("time")
        )
        # removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
        df = df.tz_localize(time_zone).tz_convert("UTC")
        df.drop(["date_time"], axis="columns", inplace=True)
    return df

Example#

First, let’s define variable to access the S3 bucket.

import s3fs
import pandas as pd  # noqa: F811
import io  # noqa: F811

storage_options = {
    "anon": False,
    "profile": "gfts",
    "client_kwargs": {
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net/",
        "region_name": "gra",
    },
}
remote_dir = "gfts-ifremer/tag_data_demo/"
s3 = s3fs.S3FileSystem(**storage_options)

The raw files are located here:

s3.ls(remote_dir)

['gfts-ifremer/tag_data_demo/A123456.csv',
 'gfts-ifremer/tag_data_demo/acoustic.csv',
 'gfts-ifremer/tag_data_demo/metadata.csv',
 'gfts-ifremer/tag_data_demo/tagging_events_A123456.csv']

from pathlib import Path

local_dir = "tag_data_demo"
output_path = Path(local_dir)
output_path.mkdir(exist_ok=True)

time_zone = "Etc/GMT-2"
date_format = "%Y-%m-%dT%H:%M:%SZ"
device_id = "A123456"

Then, let’s process each file and store locally the results in the output directory (tag_data_demo)

acoustic_df = process_acoustic_data(remote_dir + "acoustic.csv", s3, time_zone="UTC")
acoustic_df.to_csv(output_path / "acoustic.csv", date_format=date_format)

event_tags_df = process_tagging_event(
    remote_dir + f"tagging_events_{device_id}.csv", s3, time_zone
)
event_tags_df.to_csv(output_path / "tagging_events.csv", date_format=date_format)


dst_df = process_dst(remote_dir + f"{device_id}.csv", s3, time_zone)
dst_df.to_csv(output_path / "dst.csv", date_format=date_format)


md_df = process_metadata(remote_dir + "metadata.csv", s3)
md_df.to_json(output_path / "metadata.json")

Biologging data#

dst_df.head(10)

	pressure	temperature
time
2014-05-21 22:00:00+00:00	1.751477	17.514350
2014-05-21 22:01:30+00:00	1.477457	17.898020
2014-05-21 22:03:00+00:00	1.741089	19.238910
2014-05-21 22:04:30+00:00	1.833988	18.834639
2014-05-21 22:06:00+00:00	1.567610	18.458077
2014-05-21 22:07:30+00:00	1.207911	18.109961
2014-05-21 22:09:00+00:00	1.435166	17.014766
2014-05-21 22:10:30+00:00	1.869589	16.976774
2014-05-21 22:12:00+00:00	1.855380	16.861258
2014-05-21 22:13:30+00:00	1.995521	16.243335

Acoustic Detections#

acoustic_df.head(2)

	deployment_id	deploy_longitude	deploy_latitude
time
2014-05-22 09:40:30+00:00	10	-2.6812	46.1433
2014-05-22 09:46:08+00:00	42	5.7369	47.6660

Tag Events DataFrame#

event_tags_df

	event_name	longitude	latitude
time
2014-05-21 22:00:00+00:00	release	5.5369	47.966
2014-06-02 06:00:00+00:00	fish_death	NaN	NaN

Tag Information#

md_df

	pit_tag_number	acoustic_tag_id	scientific_name	common_name	project
0	A123456	MAZ-42	Lorem ipsum	démo	how-to-guide