Loading¶
In order to load the data from its sources (providers csv or netcdf files, already processed files), one can use DataSource objects. Basically, the data sources are initializated with all necessary infomations on providers, files locations and variables and only calling the load_all
method is needed to load the data and return a storer.
Loading from providers data¶
When loading from a provider, the following arguments must be given to the DataSource at least:
- The name of the provider:
provider_name
- The format of the data:
data_format
- The directory containing the files to load:
dirin
- The category of the data ('float' or 'in_situ'):
data_category
- The pattern of the names of the files to read (a
FileNamePattern
):files_pattern
- The
SourceVariableSet
object containing all variables:variable_ensemble
Behind the hood, theses DataSource
object instantiate loaders, all deriving from BaseLoader
. All loaders are file-format specific, meaning that at least one specific subclass of Baseloader
exists for every supported data format.
Loader for CSV files uses the read_params
additional argument to pass specific argument to pandas.read_csv
from bgc_data_processing.core.loaders.csv_loaders import CSVLoader
loader = CSVLoader(
provider_name="GLODAP_2022",
category="in_situ",
exclude=[],
variables=variables, # (1)
read_params={
"low_memory": False,
"index_col": False,
"na_values": -9999,
},
)
- Pre-set
SourceVariableSet
object
Loader for NetCDF files.
from bgc_data_processing.loaders.netcdf_loaders import NetCDFLoader
loader = NetCDFLoader(
provider_name="ARGO",
category="float",
exclude=[],
variables=variables, # (1)
)
- Pre-set VariablesStorer object
Loader for AB files uses the grid_basename
additional argument to get the regional grid for the AB file. The AB files are handled using a library developped by the Nansen Environmental and remote Sensing center. Ths library is included into the src folder and therefore will not be improved / impacted by any change that could occur in the original repository.
from bgc_data_processing.loaders.abfile_loaders import ABFileLoader
loader = ABFileLoader(
provider_name="HYCOM",
category="float",
exclude=[],
variables=variables, # (1)
grid_basename="regional.grid"
)
- Pre-set VariablesStorer object
Once the data source is set, it is simple to get the corresponding storer :
storer = dsource.load_all()
Loading from already processed file¶
It is also possible to load data from files which have saved using the read_files
function:
from pathlib import Path
import bgc_data_processing as bgc_dp
storer = bgc_dp.io.read_files(
filepath = [Path("file1.txt"), Path("file2.txt")], # (1)!
providers_column_label = "PROVIDER", # (2)!
expocode_column_label = "EXPOCODE", # (3)!
date_column_label = "DATE", # (4)!
year_column_label = "YEAR", # (5)!
month_column_label = "MONTH", # (6)!
day_column_label = "DAY", # (7)!
hour_column_label = "HOUR", # (8)!
latitude_column_label = "LATITUDE", # (9)!
longitude_column_label = "LONGITUDE", # (10)!
depth_column_label = "DEPH", # (11)!
category = "in_situ", # (12)!
unit_row_index = 1, # (13)!
delim_whitespace = True, # (14)!
)
- List of the filepaths of the files to load
- Name of the column containing the provider informations in all files
- Name of the column containing the expocode informations in all files
- Name of the column containing the date informations in all files
- Name of the column containing the year informations in all files
- Name of the column containing the month informations in all files
- Name of the column containing the day informations in all files
- Name of the column containing the hour informations in all files
- Name of the column containing the latitude informations in all files
- Name of the column containing the longitude informations in all files
- Name of the column containing the depth informations in all files
- Category of all files (otherwise they couldn't be aggregated together in a single storer)
- Index of the unit row.
- Whether to delimitate values based on whitespaces
Storers¶
Once the data in a Storer, it is easy to save this data to a file using the save_storer
function:
import bgc_data_processing as bgc_dp
bgc_dp.io.save_all_storer(Path("filepath/to/save/in.txt"))
It also possible to slice the Dataframe based on the dates of the rows using the .slice_on_dates
method. This will return a Slice object, a child class of Storer but only storing indexes of the dataframe slice and not the dataframe slice itself (to reduce the amount of memory used) :
import pandas as pd
import datetime as dt
drng = pd.Series(
{
"start_date": dt.datetime(2000,1,1),
"end_date": dt.datetime(2020,1,1),
}
)
slicer = storer.slice_on_dates(drng) # (1)
- Storer is a pre-set Storer object.
Slice objects can be saved in the same way as any Storer:
import bgc_data_processing as bgc_dp
bgc_dp.io.save_all_storer("filepath/to/save/in.txt")