Variables¶
Variable objects save the meta data for data variables.
The object's informations are among:
- variable name in file:
alias
- variable name to use to manipulate the object:
name
- name to write in the dataframe:
label
- unit name to display in the dataframe:
unit
- the data type to use for this particular data:
var_type
Additionally, the object also contains the informations on the transformations to apply to the data :
- corrections functions to apply to the column (for example to change its unit)
- flags informations to use to only keep data with a 'good' flag
Different types of variables exist :
Pre-created variable which can then be turned into ExistingVar or NotExistingVar depending on the variables in the dataset.
import bgc_data_processing as bgc_dp
template = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
Usecase of TemplateVar
When loading data from different sources, it is recommended to use TemplateVar to define all variable and then properly instantiate the variable for each source using the .not_in_file
and .in_file_as
methods.
Variable which is known to not exist in the dataset. If needed, the corresponding column in the dataframe can be filled later or it will remain as nan.
They can be created from a TemplateVar (recommended):
import bgc_data_processing as bgc_dp
template = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
notexisting = template.not_in_file()
or they can be created from scratch:
import bgc_data_processing as bgc_dp
notexisting = bgc_dp.variables.NotExistingVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
Variable which is supposed to be find in the dataset under a certain alias. These objects also come methods to define correction functions and flag filtering options.
To use theses variables properly, one must define the aliases (the name of the variable in the dataset) for the variable. It can be given any number of aliases, but the order of the aliases in important since if defines their relative priority (the first the highest priority). When loading the dataset, the first found aliases will be used to load the variable from the dataset.
They can be created from a TemplateVar (recommended):
import bgc_data_processing as bgc_dp
template = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
existing = template.in_file_as(
("latitude","latitude_flag", [1]) # (1)
("latitude2",None,None), # (2)
)
- Use column "latitude" from source, only keep rows where the flag column (name "latitude_flag") value is 1.
- No flag filtering for the second alias.
or they can be created from scratch:
import bgc_data_processing as bgc_dp
existing = bgc_dp.variables.ExistingVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
).in_file_as(
("latitude","latitude_flag", [1])
("latitude2",None,None),
)
Variable partially reconstructed from a csv file saved with a StorerSaver
.
They can be created from scratch but usually it useless to manually use them.
Variable which result from a feature
. A feature variable is made out of operations over other variables.
For example, the CPHL
(chlorophyll) variable, made from DIAC
(diatoms) and FLAC
(flagellates) :
import numpy as np
import bgc_data_processing as bgc_dp
feature_var = bgc_dp.variables.FeatureVar(
feature = bgc_dp.features.ChlorophyllFromDiatomFlagellate(
diatom_variable=DIATOM_VAR, # (1)!
flagellate_variable=FLAGELLATE_VAR, # (2)!
var_name = "CPHL",
var_unit = "[mg/m3]",
var_type = float,
var_default = np.nan,
var_name_format = "%-10s",
var_value_format = "%10.3f",
)
)
- Pre-defined
Existingvar
for diatom concentration - Pre-defined
Existingvar
for flagellate concentration
using the is_loadable
from the feature will return True if the input list of variables contains all necessary variable to create the feature.
Then, using the insert_in_storer
of the FeatureVar.feature property makes it possible to insert the FeatureVar into a storer containing all required variables.
Note that no variable is created by the DataSource
. For example, if the 'DATE' variable is required in the loader's routine, then the variable must exists in the SourceVariableSet
provided when initializating the object.
Corrections¶
It is possible to specify corrections functions to apply to an ExistingVar in order to apply minor correction. This can be done using the .correct_with
method. The function given to the method will then be applied to the column once the data loaded.
import bgc_data_processing as bgc_dp
template = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
existing = template.in_file_as(
("latitude","latitude_flag", [1])
("latitude2",None,None),
).correct_with(
lambda x : 2*x # (1)
)
- Correction function definition to double the value of the variable in all rows.
Removing rows when variables are NaN¶
It possible to specify settings for ExistingVar and NotExistingVar to remove the rows where the variable is NaN or where specific variable ar all NaN
It can be done using the .remove_when_nan method. Then, when the values associated to the object returned by this method will be nan, the row will be deleted.
import bgc_data_processing as bgc_dp
template = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
existing = template.in_file_as(
("latitude","latitude_flag", [1])
("latitude2",None,None),
).remove_when_nan() # (1)
- If latitude value is NaN, the row is dropped.
It can be done using the .remove_when_all_nan
method. Then, when the values associated to the object returned by this method will be nan, the row will be deleted.
import bgc_data_processing as bgc_dp
template_lat = bgc_dp.variables.TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
template_lon = bgc_dp.variables.TemplateVar(
name = "LONGITUDE",
unit = "[deg_E]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
existing_lat = template_lat.in_file_as(
("latitude","latitude_flag", [1])
).remove_when_all_nan() # (1)
existing_lon = template_lon.in_file_as(
("longitude","longitude_flag", [1])
).remove_when_all_nan() # (2)
- If both latitude and longitude value are NaN, the row is dropped.
- If both latitude and longitude value are NaN, the row is dropped.
Variables Sets¶
All variables can then be stored in a VariableSet
object so that loaders can easily interact with them.
from bgc_data_processing.core.variables.vars import TemplateVar
from bgc_data_processing.core.variables.sets import VariableSet
template_lat = TemplateVar(
name = "LATITUDE",
unit = "[deg_N]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
template_lon = TemplateVar(
name = "LONGITUDE",
unit = "[deg_E]",
var_type = float,
name_format = "%-12s",
value_format = "%12.6f",
)
existing_lat = template_lat.in_file_as(
("latitude","latitude_flag", [1])
)
existing_lon = template_lon.in_file_as(
("longitude","longitude_flag", [1])
)
variables_storer = VariablesStorer(
latitude=existing_lat,
longitude=existing_lon,
)
Default variables¶
By default, some variables are alreadey defined in config/variables.toml
(in config/default/variables.toml
) as TemplateVar. These variables are the most common ones for this project and the templates can be used to instanciate the ExistingVar
or NotExistingvar
depending on the source dataset.
One variable definition example can be found here:
# Lines starting with '#? ' are used to verify variables' types
# Type hints lines are structured the following way:
# Variable keys: possible types: additionnal comment
[provider]
#? provider.NAME: str: variable name
NAME = "PROVIDER"
#? provider.UNIT: str: variable unit
UNIT = "[]"
#? provider.TYPE: str: variable type (among ['int', 'float', 'str', 'datetime64[ns]'])
TYPE = "str"
#? provider.DEFAULT: str | int | float: default variable value if nan or not existing
DEFAULT = nan
To add a new variable, one simply has to create and edit a new set of rows, following the pattern of the already defined variables, creating for example the variable var
:
[var1]
#? var1.NAME: str: variable name
NAME="VAR1"
#? var1.UNIT: str: variable unit
UNIT="[]"
#? var1.TYPE: str: variable type (among ['int', 'float', 'str', 'datetime64[ns]'])
TYPE="str"
#? var1.DEFAULT: str | int | float: default variable value if nan or not existing
DEFAULT=nan
#? var1.NAME_FORMAT: str: format to use to save the name and unit of the variable
NAME_FORMAT="%-15s"
#? var1.VALUE_FORMAT: str: format to use to save the values of the variable
VALUE_FORMAT="%15s"
The lines starting with #?
allow type hinting for the variables to ensure that the correct value type is inputed.