Variables¶

Variable objects save the meta data for data variables.
The object's informations are among:

variable name in file: alias
variable name to use to manipulate the object: name
name to write in the dataframe: label
unit name to display in the dataframe: unit
the data type to use for this particular data: var_type

Additionally, the object also contains the informations on the transformations to apply to the data :

corrections functions to apply to the column (for example to change its unit)
flags informations to use to only keep data with a 'good' flag

Different types of variables exist :

TemplateVarNotExistingVarExistingVarParsedVarFeatureVar

Pre-created variable which can then be turned into ExistingVar or NotExistingVar depending on the variables in the dataset.

import bgc_data_processing as bgc_dp

template = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)

Usecase of TemplateVar

When loading data from different sources, it is recommended to use TemplateVar to define all variable and then properly instantiate the variable for each source using the .not_in_file and .in_file_as methods.

Variable which is known to not exist in the dataset. If needed, the corresponding column in the dataframe can be filled later or it will remain as nan.

They can be created from a TemplateVar (recommended):

import bgc_data_processing as bgc_dp

template = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
notexisting = template.not_in_file()

or they can be created from scratch:

import bgc_data_processing as bgc_dp

notexisting = bgc_dp.variables.NotExistingVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)

Variable which is supposed to be find in the dataset under a certain alias. These objects also come methods to define correction functions and flag filtering options.
To use theses variables properly, one must define the aliases (the name of the variable in the dataset) for the variable. It can be given any number of aliases, but the order of the aliases in important since if defines their relative priority (the first the highest priority). When loading the dataset, the first found aliases will be used to load the variable from the dataset.

They can be created from a TemplateVar (recommended):

import bgc_data_processing as bgc_dp

template = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
existing = template.in_file_as(
    ("latitude","latitude_flag", [1])   # (1)
    ("latitude2",None,None),            # (2)
)

Use column "latitude" from source, only keep rows where the flag column (name "latitude_flag") value is 1.
No flag filtering for the second alias.

or they can be created from scratch:

import bgc_data_processing as bgc_dp

existing = bgc_dp.variables.ExistingVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
).in_file_as(
    ("latitude","latitude_flag", [1])
    ("latitude2",None,None),
)

Variable partially reconstructed from a csv file saved with a StorerSaver.

They can be created from scratch but usually it useless to manually use them.

Variable which result from a feature. A feature variable is made out of operations over other variables.

For example, the CPHL (chlorophyll) variable, made from DIAC(diatoms) and FLAC (flagellates) :

import numpy as np

import bgc_data_processing as bgc_dp

feature_var = bgc_dp.variables.FeatureVar(
    feature = bgc_dp.features.ChlorophyllFromDiatomFlagellate(
        diatom_variable=DIATOM_VAR,                             # (1)!
        flagellate_variable=FLAGELLATE_VAR,                     # (2)!
        var_name = "CPHL",
        var_unit = "[mg/m3]",
        var_type = float,
        var_default = np.nan,
        var_name_format = "%-10s",
        var_value_format = "%10.3f",
    )
)

Pre-defined Existingvar for diatom concentration
Pre-defined Existingvar for flagellate concentration

using the is_loadable from the feature will return True if the input list of variables contains all necessary variable to create the feature.

Then, using the insert_in_storer of the FeatureVar.feature property makes it possible to insert the FeatureVar into a storer containing all required variables.

Note that no variable is created by the DataSource. For example, if the 'DATE' variable is required in the loader's routine, then the variable must exists in the SourceVariableSet provided when initializating the object.

Corrections¶

It is possible to specify corrections functions to apply to an ExistingVar in order to apply minor correction. This can be done using the .correct_with method. The function given to the method will then be applied to the column once the data loaded.

import bgc_data_processing as bgc_dp

template = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
existing = template.in_file_as(
    ("latitude","latitude_flag", [1])
    ("latitude2",None,None),
).correct_with(
    lambda x : 2*x                      # (1)
)

Correction function definition to double the value of the variable in all rows.

Removing rows when variables are NaN¶

It possible to specify settings for ExistingVar and NotExistingVar to remove the rows where the variable is NaN or where specific variable ar all NaN

When a particular variable is NaNWhen many variables are Nan

It can be done using the .remove_when_nan method. Then, when the values associated to the object returned by this method will be nan, the row will be deleted.

import bgc_data_processing as bgc_dp

template = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
existing = template.in_file_as(
    ("latitude","latitude_flag", [1])
    ("latitude2",None,None),
).remove_when_nan()                     # (1)

If latitude value is NaN, the row is dropped.

It can be done using the .remove_when_all_nan method. Then, when the values associated to the object returned by this method will be nan, the row will be deleted.

import bgc_data_processing as bgc_dp

template_lat = bgc_dp.variables.TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
template_lon = bgc_dp.variables.TemplateVar(
    name = "LONGITUDE",
    unit = "[deg_E]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
existing_lat = template_lat.in_file_as(
    ("latitude","latitude_flag", [1])
).remove_when_all_nan()                     # (1)
existing_lon = template_lon.in_file_as(
    ("longitude","longitude_flag", [1])
).remove_when_all_nan()                     # (2)

If both latitude and longitude value are NaN, the row is dropped.
If both latitude and longitude value are NaN, the row is dropped.

Variables Sets¶

All variables can then be stored in a VariableSet object so that loaders can easily interact with them.

from bgc_data_processing.core.variables.vars import TemplateVar
from bgc_data_processing.core.variables.sets import VariableSet

template_lat = TemplateVar(
    name = "LATITUDE",
    unit = "[deg_N]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
template_lon = TemplateVar(
    name = "LONGITUDE",
    unit = "[deg_E]",
    var_type = float,
    name_format = "%-12s",
    value_format = "%12.6f",
)
existing_lat = template_lat.in_file_as(
    ("latitude","latitude_flag", [1])
)
existing_lon = template_lon.in_file_as(
    ("longitude","longitude_flag", [1])
)
variables_storer = VariablesStorer(
    latitude=existing_lat,
    longitude=existing_lon,
)

Default variables¶

By default, some variables are alreadey defined in config/variables.toml (in config/default/variables.toml) as TemplateVar. These variables are the most common ones for this project and the templates can be used to instanciate the ExistingVar or NotExistingvar depending on the source dataset.

One variable definition example can be found here:

# Lines starting with '#? ' are used to verify variables' types
# Type hints lines are structured the following way:
# Variable keys: possible types: additionnal comment

[provider]
#? provider.NAME: str: variable name
NAME = "PROVIDER"
#? provider.UNIT: str: variable unit
UNIT = "[]"
#? provider.TYPE: str: variable type (among ['int', 'float', 'str', 'datetime64[ns]'])
TYPE = "str"
#? provider.DEFAULT: str | int | float: default variable value if nan or not existing
DEFAULT = nan

To add a new variable, one simply has to create and edit a new set of rows, following the pattern of the already defined variables, creating for example the variable var:

[var1]
#? var1.NAME: str: variable name
NAME="VAR1"
#? var1.UNIT: str: variable unit
UNIT="[]"
#? var1.TYPE: str: variable type (among ['int', 'float', 'str', 'datetime64[ns]'])
TYPE="str"
#? var1.DEFAULT: str | int | float: default variable value if nan or not existing
DEFAULT=nan
#? var1.NAME_FORMAT: str: format to use to save the name and unit of the variable
NAME_FORMAT="%-15s"
#? var1.VALUE_FORMAT: str: format to use to save the values of the variable
VALUE_FORMAT="%15s"

The lines starting with #? allow type hinting for the variables to ensure that the correct value type is inputed.