Tag Data Preparation#
This tutorial aims to guide you through how we prepare raw tag data.
The main goal of this preparation is to ensure that the time stamps are expressed in UTC.
First, we will detail what we mean by ‘’raw data’’. Then, we present what the processing consists in. Finally, we apply it to an example.
Raw Data Description#
We expect the raw data of a given tag id
to be located in a specific folder in the GFTS bucket.
The latter should have 4 .csv
files:
For
id
.csv contains the recorded data as a table entitled by Date Time Stamp, Pressure and Temp.acoustic.csv contains the acoustic detections of the fish (for instance, by stations). In case of no detection, the file is an empty table.
Otherwise, we expect the columns date_time, deployment_id, deploy_longitude and deploy_latitude.
metadata.csv contains information about the tag. It can be any tabular data.
tagging_events_
id
.csv contains the times and positions of the release and recapture events.We expect a 2 by 4 tabular file, entitled by event*, time, longitude and latitude. The event* column is the index, whose values describe the events, e.g., ‘’release’’ and ‘’fish death’’. The recapture information can contain
nan
longitude and latitude values.
Processing Description#
In the following, we present the 4 functions to process each file mentioned above. Here a brief description of their objectives:
For
id
.csv: the time stamps are converted to UTC time and the columns are renamed to temperature and pressure.For acoustic.csv: the time stamps are converted to UTC time.
For metadata.csv: the file is loaded as a
DataFrame
and then exported as a.json
file.For tagging_events_
id
.csv: the time stamps are converted to UTC and the columns renamed to event_name, longitude and latitude.
Implementation#
from s3fs.core import S3FileSystem
import pandas as pd
import io
## counter-intuitive specifications! See:
# https://en.wikipedia.org/wiki/Tz_database#Area
# https://pvlib-python.readthedocs.io/en/stable/user_guide/timetimezones.html#fixed-offsets
def process_dst(
file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2", time_col_index=0
):
"""
Process a `.csv` file containing the recorded time series of a tagged fish.
:param file_path: Path of the file.
:type file_path: str
:param s3: The file system of the S3 bucket.
:type S3FileSystem: str
:param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
:type time_zone: str
:param time_col_index: Index of the time column, defaults to 0.
:type time_col_index: int
:return: The processed DataFrame.
:rtype: pd.DataFrame
"""
with s3.open(file_path, "rb") as f:
# assigns a new column "time" as the index
df = (
pd.read_csv(f)
.assign(
time=lambda df: pd.to_datetime(
df.iloc[:, time_col_index], dayfirst=True
)
)
.set_index("time")
)
# removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
df = df.tz_localize(time_zone).tz_convert("UTC").iloc[:, 1:3]
df.columns = ["pressure", "temperature"]
return df
def process_tagging_event(file_path: str, s3: S3FileSystem, time_zone="Etc/GMT-2"):
"""
Process a `.csv` file containing the tagging events (release and recapture) of a tagged fish.
:param file_path: Path of the file.
:type file_path: str
:param s3: The file system of the S3 bucket.
:type S3FileSystem: str
:param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "Etc/GMT-2".
:type time_zone: str
:return: The processed DataFrame.
:rtype: pd.DataFrame
"""
# TODO: Mathieu told me that times are already given as UTC+2? Ok so it is good
# TODO: input a more friendly time zone such as Europe/Paris and compute the GMT shift
with s3.open(file_path, "r") as f:
lines = f.readlines()
cleaned_lines = []
for line in lines:
cleaned_line = line.strip().strip('"').replace("\t", "")
cleaned_lines.append(cleaned_line)
# assigns a new column "time" as the index
df = (
pd.read_csv(io.StringIO("\n".join(cleaned_lines)))
.assign(time=lambda df: pd.to_datetime(df["time"]))
.set_index("time")
)
# removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
df = df.tz_convert(None).tz_localize(time_zone).tz_convert("UTC")
df.columns = ["event_name", "longitude", "latitude"]
return df
def process_metadata(file_path: str, s3: S3FileSystem):
"""
Open a `.csv` file located in a S3 bucket with `pandas` as a `DataFrame`.
:param file_path: Path of the file.
:type file_path: str
:param s3: The file system of the S3 bucket.
:type S3FileSystem: str
:return: A DataFrame.
:rtype: pd.DataFrame
"""
with s3.open(file_path, "rb") as f:
df = pd.read_csv(f)
return df
def process_acoustic_data(
file_path: str, s3: S3FileSystem, time_zone="UTC", time_col_index=0
):
"""
Process a `.csv` file containing a time series of ponctual detections of a tagged fish by acoustic receivers.
:param file_path: Path of the file.
:type file_path: str
:param s3: The file system of the S3 bucket.
:type S3FileSystem: str
:param time_zone: The time zone corresponding to the GMT offset within the time stamps, defaults to "UTC".
:type time_zone: str
:param time_col_index: Index of the time column, defaults to 0.
:type time_col_index: int
:return: The processed DataFrame.
:rtype: pd.DataFrame
"""
# TODO: check again with Mathieu that **the acoustic data is assumed to be given in UTC**.
with s3.open(file_path, "rb") as f:
# assigns a new column "time" as the index
df = (
pd.read_csv(f)
.assign(
time=lambda df: pd.to_datetime(
df.iloc[:, time_col_index], dayfirst=True, format="ISO8601"
)
)
.set_index("time")
)
# removes any assumption of the time zone with None and relocalizes to time_zone + UTC conversion
df = df.tz_localize(time_zone).tz_convert("UTC")
df.drop(["date_time"], axis="columns", inplace=True)
return df
Example#
First, let’s define variable to access the S3 bucket.
import s3fs
import pandas as pd # noqa: F811
import io # noqa: F811
storage_options = {
"anon": False,
"profile": "gfts",
"client_kwargs": {
"endpoint_url": "https://s3.gra.perf.cloud.ovh.net/",
"region_name": "gra",
},
}
remote_dir = "gfts-ifremer/tag_data_demo/"
s3 = s3fs.S3FileSystem(**storage_options)
The raw files are located here:
s3.ls(remote_dir)
['gfts-ifremer/tag_data_demo/A123456.csv',
'gfts-ifremer/tag_data_demo/acoustic.csv',
'gfts-ifremer/tag_data_demo/metadata.csv',
'gfts-ifremer/tag_data_demo/tagging_events_A123456.csv']
from pathlib import Path
local_dir = "tag_data_demo"
output_path = Path(local_dir)
output_path.mkdir(exist_ok=True)
time_zone = "Etc/GMT-2"
date_format = "%Y-%m-%dT%H:%M:%SZ"
device_id = "A123456"
Then, let’s process each file and store locally the results in the output directory (tag_data_demo
)
acoustic_df = process_acoustic_data(remote_dir + "acoustic.csv", s3, time_zone="UTC")
acoustic_df.to_csv(output_path / "acoustic.csv", date_format=date_format)
event_tags_df = process_tagging_event(
remote_dir + f"tagging_events_{device_id}.csv", s3, time_zone
)
event_tags_df.to_csv(output_path / "tagging_events.csv", date_format=date_format)
dst_df = process_dst(remote_dir + f"{device_id}.csv", s3, time_zone)
dst_df.to_csv(output_path / "dst.csv", date_format=date_format)
md_df = process_metadata(remote_dir + "metadata.csv", s3)
md_df.to_json(output_path / "metadata.json")
Biologging data#
dst_df.head(10)
pressure | temperature | |
---|---|---|
time | ||
2014-05-21 22:00:00+00:00 | 1.751477 | 17.514350 |
2014-05-21 22:01:30+00:00 | 1.477457 | 17.898020 |
2014-05-21 22:03:00+00:00 | 1.741089 | 19.238910 |
2014-05-21 22:04:30+00:00 | 1.833988 | 18.834639 |
2014-05-21 22:06:00+00:00 | 1.567610 | 18.458077 |
2014-05-21 22:07:30+00:00 | 1.207911 | 18.109961 |
2014-05-21 22:09:00+00:00 | 1.435166 | 17.014766 |
2014-05-21 22:10:30+00:00 | 1.869589 | 16.976774 |
2014-05-21 22:12:00+00:00 | 1.855380 | 16.861258 |
2014-05-21 22:13:30+00:00 | 1.995521 | 16.243335 |
Acoustic Detections#
acoustic_df.head(2)
deployment_id | deploy_longitude | deploy_latitude | |
---|---|---|---|
time | |||
2014-05-22 09:40:30+00:00 | 10 | -2.6812 | 46.1433 |
2014-05-22 09:46:08+00:00 | 42 | 5.7369 | 47.6660 |
Tag Events DataFrame#
event_tags_df
event_name | longitude | latitude | |
---|---|---|---|
time | |||
2014-05-21 22:00:00+00:00 | release | 5.5369 | 47.966 |
2014-06-02 06:00:00+00:00 | fish_death | NaN | NaN |
Tag Information#
md_df
pit_tag_number | acoustic_tag_id | scientific_name | common_name | project | |
---|---|---|---|---|---|
0 | A123456 | MAZ-42 | Lorem ipsum | démo | how-to-guide |