π Advanced usageΒΆ
βοΈ (1) Custom YAML configurationsΒΆ
The YAML configuration file provides the flexibility for users to specify their own feature subsets as well as incorporate new time-series features and feature-computing functions. Below, the general structure of a pyhctsa YAML file is specified. In general, a YAML file follows a systematic nested structure. If a field is not applicable (e.g., no dependencies), it can either be left empty or excluded from the configuration altogether.
module:
function_name:
base_name: function_name
dependencies:
configs:
- {param1: 1.0, param2: 2.0, zscore: True, abs: True}
legacy_name: MD_polvar
ordered_args: ['d', 'D']
In plain English, the YAML configuration instructs pyhctsa to
1. Pre-process the time-series data by first z-score normalising and then taking the absolute value.
2. Go to `pyhctsa.operations.<module>` and evaluate the function <function_name> on the data with param1 set to 1.0 and param2 set to 2.0.
3. Construct a unique identifier for the output as <base_name>_<d>_<D>
Field Descriptions
moduleThe top-level key (e.g.,
distribution) must exactly match a module inpyhctsa.operations(case-sensitive). If a function is not nested within the specified module,pyhctsawill skip its computation.function_nameThe name of the function as defined in the module (case-sensitive), e.g.,
add_noiseforpyhctsa.operations.distribution.add_noise.base_nameThe operation name used in feature labels. Users may set this freely.
dependenciesAny additional external dependencies required by the feature-computing function, beyond standard Python scientific libraries. May be left empty or omitted.
configsA list of parameter sets, where each entry defines a unique configuration. See below for a detailed explanation.
legacy_nameMaps functions from the MATLAB HCTSA to equivalent
pyhctsaimplementations. May be left empty or omitted if not applicable.ordered_argsSpecifies the order in which parameters from each configuration entry are injected into the feature function. Also used to construct the time-series feature identifiers.
Configs
In pyhctsa YAML files, configs follow a specific convention. It is important to adhere to these conventions when specifying new functions or creating custom subsets to ensure the output is as expected. When specifying a configuration, function-specific parameters are always followed by pre-processing arguments. Consider a generic function:
def feature_function(x, param1, param2):
return feature
Here, the function accepts a raw time-series input (as an array), x, and two
parameters param1 and param2. In this example, both parameters take scalar
values, but in general, parameters can be of type str, bool, array,
list, int, or float.
The following YAML is specified:
module name:
feature_function:
base_name: function_name
dependencies:
configs:
- {param1: 1.0, param2: 1.0}
- {param1: 2.0, param2: 2.0}
legacy_name:
ordered_args: ['param1', 'param2']
In this example, two separate configurations are specified: one where param1 = 1.0
and param2 = 1.0, and another where param1 = 2.0 and param2 = 2.0. Accordingly, feature_function
will be evaluated twice, once for the first configuration (param1 = 1.0
and param2 = 1.0), and once for the second (param1 = 2.0 and param2 = 2.0).
Pre-processing arguments
By default, functions are evaluated on raw (unprocessed) time-series data. In many
cases, it is preferable (or necessary) to compute features in a way that is invariant
to the absolute scale of the data. This can be achieved by pre-processing the time
series prior to feature computation. The two pre-processing options available in
pyhctsa are:
z-score: Normalise the data by subtracting the mean and dividing by the standard deviation. See the
z_scorefunction documentation.absolute (abs) value: Take the absolute value of the time series.
Warning
By default, pyhctsa will always z-score the data before applying the absolute value operation (if set to True).
To use either or both of these pre-processing options, they can be specified in the YAML:
module name:
feature_function:
base_name: function_name
dependencies:
configs:
- {param1: 1.0, param2: 1.0, zscore: True}
- {param1: 2.0, param2: 2.0, zscore: True, abs: True}
legacy_name:
ordered_args: ['param1', 'param2']
In this example, pyhctsa will first z-score the time-series data before computing feature_function for the first configuration, while for the second configuration, it will first z-score the data, then take the absolute value. Unless otherwise specified in the configuration, zscore=False and abs=False by default.
To make the YAML structure example more concrete, consider the pol_var function:
medical:
pol_var:
base_name: pol_var
dependencies:
configs:
- {d: 1.0, D: 3, zscore: True}
- {d: 1.0, D: 4, zscore: True}
- {d: 1.0, D: 5, zscore: True}
- {d: 1.0, D: 6, zscore: True}
legacy_name: MD_polvar
ordered_args: ['d', 'D']
As per the API, the pol_var function accepts two parameters: d and D. A total of 4 unique configurations will be evaluated by
pyhctsa corresponding to different inputs for D. As per the configuration, the time-series input is first wrapped by a z-score operation before being input into the pol_var function.
In this case, each time the pol_var function is evaluated a single scalar value is returned, thus yielding
four time-series features across the four configurations:
pol_var_1_3
pol_var_1_4
pol_var_1_5
pol_var_1_6
Once a YAML configuration has been created (i.e., as a .yaml file), it can be loaded into the FeatureCalculator as follows:
from pyhctsa.calculator import FeatureCalculator
calc = FeatureCalculator(config_path='<path_to_custom_yaml>.yaml')
calc.extract(data)
β° (2) Feature filteringΒΆ
When constructing custom feature subsets, users may wish to control which time-series features are returned by an operation.
Consider the binary_stats function located in the Symbolic module. See binary_stats for details.
The default configuration for this function in pyhctsa is:
symbolic:
binary_stats:
base_name: binary_stats
dependencies:
configs:
- {binary_method: 'mean', zscore: True}
legacy_name: SB_BinaryStats
ordered_args: ['binary_method']
We can then run the FeatureCalculator with this default configuration:
from pyhctsa.calculator import FeatureCalculator
from pyhctsa.utils import get_dataset
data = get_dataset(which='e1000')[0]
calc = FeatureCalculator(config_path='<path_to_yaml>')
res = calc.extract(data)
print(f'Number of time-series features: {res.shape[1]}')
With these default settings, the binary_stats function returns 18 different time-series features:
'binary_stats_mean.pupstat2',
'binary_stats_mean.pstretch1',
'binary_stats_mean.longstretch0',
'binary_stats_mean.longstretch0norm',
'binary_stats_mean.meanstretch0',
'binary_stats_mean.meanstretch0norm',
'binary_stats_mean.stdstretch0',
'binary_stats_mean.stdstretch0norm',
'binary_stats_mean.longstretch1',
'binary_stats_mean.longstretch1norm',
'binary_stats_mean.meanstretch1',
'binary_stats_mean.meanstretch1norm',
'binary_stats_mean.stdstretch1',
'binary_stats_mean.stdstretch1norm',
'binary_stats_mean.meanstretchdiff',
'binary_stats_mean.stdstretchdiff',
'binary_stats_mean.diff21stretch1',
'binary_stats_mean.diff21stretch0'
If we would like to only return a specific time-series feature (or group of features), we can specify this in the configuration YAML using the select key. For example, to isolate the feature pupstat2, we can specify the following:
symbolic:
binary_stats:
base_name: binary_stats
dependencies:
configs:
- {binary_method: 'mean', zscore: True, select: 'pupstat2'}
legacy_name: SB_BinaryStats
ordered_args: ['binary_method']
Running the FeatureCalculator with this configuration:
calc = FeatureCalculator(config_path='<path_to_new_yaml>')
res = calc.extract(data)
print(f'Number of time-series features: {res.shape[1]}')
Now, we can see that only a single time-series feature is returned:
'binary_stats_mean.pupstat2'
Alternatively, if we would like to specify which features to discard, we can use the exclude key:
symbolic:
# same as before
configs:
- {binary_method: 'mean', zscore: True, exclude: 'pupstat2'}
# same as before
Here, all time-series features excluding that specified in the configuration (pupstat2) will be returned.
To keep/discard multiple features, we can specify each in a list of strings as follows:
symbolic:
# same as before
configs:
- {binary_method: 'mean', zscore: True, select: ['pupstat2', 'pstretch1', 'meanstretchdiff']}
# same as before