api_24sea.datasignals.core#

The core module for the api_24sea.datasignals subpackage

Classes#

API

Accessor for working with data signals coming from the 24SEA API.

AsyncAPI

Async version of the API class. Get data from 24sea API /datasignals

Functions#

to_category_value(→ pandas.DataFrame)

Categorize the data based on the metrics overview.

to_star_schema(→ Optional[Union[Dict[str, Any], ...)

Transforms the data and metrics_overview into a star schema format for

Module Contents#

class API#

Accessor for working with data signals coming from the 24SEA API.

property authenticated: bool#

Whether the client is authenticated

property metrics_overview: pandas.DataFrame | None#

Get the metrics overview DataFrame.

authenticate(username: str, password: str, permissions_overview: pandas.DataFrame | None = None)#

Authenticate with username/password

get_metrics(site: str | None = None, locations: str | List[str] | None = None, metrics: str | List[str] | None = None, headers: Dict[str, str] | None = None) List[Dict[str, str | None]] | None#

Get the metrics names for a site, provided the following parameters.

Parameters#

siteOptional[str]

The site name. If None, the queryable metrics for all sites will be returned, and the locations and metrics parameters will be ignored.

locationsOptional[Union[str, List[str]]]

The locations for which to get the metrics. If None, all locations will be considered.

metricsOptional[Union[str, List[str]]]

The metrics to get. They can be specified as regular expressions. If None, all metrics will be considered.

For example:

  • metrics=["^ACC", "^DEM"] will return all the metrics that
    start with ACC or DEM,
  • Similarly, metrics=["windspeed$", "winddirection$"] will | return all the metrics that end with windspeed and | winddirection,

  • and metrics=[".*WF_A01.*",".*WF_A02.*"] will return all | metrics that contain WF_A01 or WF_A02.

Returns#

Optional[List[Dict[str, Optional[str]]]]

The metrics names for the given site, locations and metrics.

Note

This class method is legacy because it does not add functionality to the DataSignals pandas accessor.

selected_metrics(data: pandas.DataFrame) pandas.DataFrame#

Return the selected metrics for the query.

get_data(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, as_star_schema: bool = False, outer_join_on_timestamp: bool = True, headers: Dict[str, str] | None = None, data: pandas.DataFrame | None = None, timeout: int = 3600, threads: int | None = None, location: List | str | None = None, force_cache_miss: bool = False, method: str = 'POST') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]] | List[Any | str] | None#

Get the data signals from the 24SEA API.

Parameters#

sitesOptional[Union[List, str]]

The site name or List of site names. If None, the site will be inferred from the metrics.

locationsOptional[Union[List, str]]

The location name or List of location names. If None, the location will be inferred from the metrics.

metricsUnion[List, str]

The metric name or List of metric names. It must be provided. They do not have to be the entire metric name, but can be a part of it. For example, if the metric name is "mean_WF_A01_windspeed", the user can equivalently provide sites="wf", locations="a01", metric="mean windspeed".

start_timestampUnion[str, datetime.datetime]

The start timestamp for the query. It must be in ISO 8601 format, e.g., "2021-01-01T00:00:00Z" or a datetime object.

end_timestampUnion[str, datetime.datetime]

The end timestamp for the query. It must be in ISO 8601 format, e.g., "2021-01-01T00:00:00Z" or a datetime object.

as_dictbool, optional

If True, the data will be returned as a list of dictionaries. Default is False.

as_star_schemabool, optional

If True, the data will be returned in a star schema format. Default is False.

outer_join_on_timestampbool

If False, the data will be returned as a block-diagonal DataFrame, and it will contain the site and location columns. Besides, the timestamp column will not contain unique values since it will be repeated for each site and location. If False, the data will be returned as a full DataFrame, it will not contain the site and location columns, and the timestamp column will contain unique values.

headersOptional[Union[Dict[str, str]]], optional

The headers to pass to the request. If None, the default headers will be used as {"accept": "application/json"}. Default is None.

datapd.DataFrame

The DataFrame to update with the data signals. If None, a new DataFrame will be created. Default is None.

timeoutint, optional

The timeout for the request in seconds. Default is 3600.

threadsint, optional

The number of threads to use for the request. Default is the number of CPU cores. If None, it will be set to the number of CPU cores.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

force_cache_missbool, optional

Whether to force a cache miss on the backend data endpoint. Default is False.

methodstr, optional

HTTP method to use for the backend request. Default is "GET".

Returns#

Union[pd.DataFrame, Dict[str, Dict[str, pd.DataFrame]]]
  • The DataFrame containing the data signals, or

  • A dictionary containing the data signals divided by location, or

  • A dictionary containing the data signals in star schema format.

get_stats(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, threads: int | None = None, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the metrics statistics (MAX, MIN, AVG) for the specified time range.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

threads: Optional[int]

The number of threads to use for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

methodstr, optional

HTTP method to use for the backend request. Default is "GET".

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

get_null_timestamps(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, threads: int | None = None, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the list of timestamps which the selected metrics have null values in the specified time range.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

threads: Optional[int]

The number of threads to use for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

methodstr, optional

HTTP method to use for the backend request. Default is "GET".

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

get_availability(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, granularity: str | int, sampling_interval_seconds: int | None = None, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, threads: int | None = None, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the metrics statistics (MAX, MIN, AVG) for the specified time range.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

granularity: Union[str, int]

The granularity of the data, can be a string, or an integer number of seconds. String values are restricted to “day”, “week”, “calendarmonth”, “30days”, or “365days”. If “calendarmonth” is used, the availability will refer to the specific calendar month (e.g. January 2023), and not to a rolling period of 30 days.

sampling_interval_seconds: Optional[int]

The sampling interval in seconds. If None, the default value is used, which is 600 seconds (10 minutes).

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

threads: Optional[int]

The number of threads to use for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

methodstr, optional

HTTP method to use for the backend request. Default is "GET".

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

get_oldest_timestamp(sites: str | List[str], locations: List[str] | str | None, method: str = 'GET') pandas.DataFrame#

Get oldest timestamp for one or multiple sites (sync).

get_stats_predefined_intervals(sites: List | str | None, locations: List | str | None, metrics: List | str, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, threads: int | None = None, method: str = 'GET') Dict[str, Any]#
Run get_stats for predefined intervals:
  • all_time: (datetime.min -> datetime.max)

  • last_year: (now-365d -> now)

  • last_month: (now-30d -> now)

class AsyncAPI#

Async version of the API class. Get data from 24sea API /datasignals asyncronously

async get_metrics_overview() pandas.DataFrame | None#

Asynchronously get metrics overview, authenticating if needed

async get_metrics(site: str | None = None, locations: str | List[str] | None = None, metrics: str | List[str] | None = None, headers: Dict[str, str] | None = None) Any#

Get the metrics names for a site asynchronously.

async get_data(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, as_star_schema: bool = False, outer_join_on_timestamp: bool = True, headers: Dict[str, str] | None = None, data: pandas.DataFrame | None = None, max_retries: int = 5, timeout: int = 1800, location: List | str | None = None, force_cache_miss: bool = False, method: str = 'POST') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]] | List[Any | str] | None#

Get the data signals from the 24SEA API asynchronously.

Asynchronous version of API.get_data(), with the same parameters and return type. The only difference is that this method is asynchronous and returns a coroutine, so it must be awaited to get the actual data. Moreover, in case of any request failure, instead of returning a DataFrame with the successfully retrieved data, it returns a list of error messages.

Parameters#

sitesOptional[Union[List, str]]

The site name or List of site names. If None, the site will be inferred from the metrics.

locationsOptional[Union[List, str]]

The location name or List of location names. If None, the location will be inferred from the metrics.

metricsUnion[List, str]

The metric name or List of metric names. It must be provided. They do not have to be the entire metric name, but can be a part of it. For example, if the metric name is "mean_WF_A01_windspeed", the user can equivalently provide sites="wf", locations="a01", metric="mean windspeed".

start_timestampUnion[str, datetime.datetime]

The start timestamp for the query. It must be in ISO 8601 format, e.g., "2021-01-01T00:00:00Z" or a datetime object.

end_timestampUnion[str, datetime.datetime]

The end timestamp for the query. It must be in ISO 8601 format, e.g., "2021-01-01T00:00:00Z" or a datetime object.

as_dictbool, optional

If True, the data will be returned as a list of dictionaries. Default is False.

as_star_schemabool, optional

If True, the data will be returned in a star schema format. Default is False.

outer_join_on_timestampbool

If False, the data will be returned as a block-diagonal DataFrame, and it will contain the site and location columns. Besides, the timestamp column will not contain unique values since it will be repeated for each site and location. If False, the data will be returned as a full DataFrame, it will not contain the site and location columns, and the timestamp column will contain unique values.

headersOptional[Union[Dict[str, str]]], optional

The headers to pass to the request. If None, the default headers will be used as {"accept": "application/json"}. Default is None.

datapd.DataFrame

The DataFrame to update with the data signals. If None, a new DataFrame will be created. Default is None.

timeoutint, optional

The timeout for the request in seconds. Default is 3600.

threadsint, optional

The number of threads to use for the request. Default is the number of CPU cores. If None, it will be set to the number of CPU cores.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

force_cache_missbool, optional

Whether to force a cache miss on the backend data endpoint. Default is False.

methodstr, optional

HTTP method to use for the backend request. Default is "GET".

Returns#

Union[pd.DataFrame, Dict[str, Dict[str, pd.DataFrame]]]

Coroutine that returns either: - The DataFrame containing the data signals, or - A dictionary containing the data signals divided by location, or - A dictionary containing the data signals in star schema format.

async get_oldest_timestamp(sites: str | List[str], locations: List[str] | str | None, method: str = 'GET') pandas.DataFrame#

Get oldest timestamp for one or multiple sites (async).

Parameters#

site: Union[str, List[str]]

The site(s) to retrieve the oldest timestamp for.

locations: Optional[Union[List[str], str]],

The location(s) to retrieve the oldest timestamp for.

Returns#

pd.DataFrame

A DataFrame containing the oldest timestamp for the specified site(s) and location(s).

async get_stats(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the metrics statistics (MAX, MIN, AVG) for the specified time range asynchronously.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

async get_null_timestamps(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the metrics statistics (MAX, MIN, AVG) for the specified time range asynchronously.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

async get_availability(sites: List | str | None, locations: List | str | None, metrics: List | str, start_timestamp: str | datetime.datetime, end_timestamp: str | datetime.datetime, granularity: str | int, sampling_interval_seconds: int | None = None, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, location: List | str | None = None, method: str = 'GET') pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame] | Dict[str, Any]]#

Get the metrics statistics (MAX, MIN, AVG) for the specified time range.

Parameters#

sites: Optional[Union[List, str]]

The sites to filter the data.

locations: Optional[Union[List, str]]

The locations to filter the data.

metrics: Union[List, str]

The metrics to retrieve.

start_timestamp: Union[str, datetime.datetime]

The start timestamp for the data retrieval.

end_timestamp: Union[str, datetime.datetime]

The end timestamp for the data retrieval.

granularity: Union[str, int]

The granularity of the data, can be a string, or an integer number of seconds. String values are restricted to “day”, “week”, “calendarmonth”, “30days”, or “365days”. If “calendarmonth” is used, the availability will refer to the specific calendar month (e.g. January 2023), and not to a rolling period of 30 days.

sampling_interval_seconds: Optional[int]

The sampling interval in seconds. If None, the default value is used, which is 600 seconds (10 minutes).

as_dict: bool

Whether to return the data as a dictionary.

headers: Optional[Union[Dict[str, str]]]

Headers to include in the request.

timeout: int

The timeout for the request.

threads: Optional[int]

The number of threads to use for the request.

location: Optional[Union[List, str]]

The location name or List of location names. This is a legacy parameter, and it is deprecated. Please use the locations parameter instead.

Returns#

Union[DataFrame, Dict[str, Union[Dict[str, DataFrame], Dict[str, Any]]]]

The retrieved data.

async get_stats_predefined_intervals(sites: List | str | None, locations: List | str | None, metrics: List | str, as_dict: bool = False, headers: Dict[str, str] | None = None, timeout: int = 3600, method: str = 'GET') Dict[str, Any]#

Async version of get_stats_predefined_intervals.

to_category_value(data: pandas.DataFrame | Dict[str, Dict[str, pandas.DataFrame]], metrics_overview: pandas.DataFrame, keep_stat_in_metric_name: bool = False) pandas.DataFrame#

Categorize the data based on the metrics overview.

Parameters#

dataUnion[pd.DataFrame, Dict[str, Dict[str, pd.DataFrame]]]

The data to be categorized. It can be either a DataFrame or a dictionary of DataFrames.

metrics_overviewpd.DataFrame

A DataFrame containing the information about the metrics.

keep_stat_in_metric_namebool, optional

Whether to keep the statistic in the metric name, by default True.

Returns#

Union[pd.DataFrame, Dict[str, Dict[str, pd.DataFrame]]]

The data in category-value format, based on the metrics overview.

Notes#

The function performs the following steps: 1. Transforms the data dictionary into a DataFrame if necessary. 2. Resets the index and converts the timestamp column to datetime. 3. Melts the data to long format. 4. Merges the melted data with the metrics overview DataFrame. 5. Renames columns for consistency. 6. Extracts latitude and heading information from the metric names. 7. Extracts sub-assembly information from the metric names. 8. Reorders the columns. 9. Optionally appends the statistic to the metric name. 10. Drops the rows where the metric name is “index”, “site” or “location”.

Example#

>>> import pandas as pd
>>> from typing import Union, Dict
>>> data = {
...     'timestamp': ['2021-01-01', '2021-01-02'],
...     'mean_WF_A01_TP_SG_LAT005_DEG000': [1.0, 1.1],
...     'mean_WF_A02_TP_SG_LAT005_DEG000': [2.0, 2.1]
... }
>>> metrics_overview = pd.DataFrame({
...     'metric': ['mean_WF_A01_TP_SG_LAT005_DEG000',
...                'mean_WF_A02_TP_SG_LAT005_DEG000'],
...     'short_hand': ['TP_SG_LAT005_DEG000', 'TP_SG_LAT005_DEG000'],
...     'statistic': ['mean', 'mean'],
...     'unit': ['unit', 'unit'],
...     'site': ['WindFarm', 'WindFarm'],
...     'location': ['WFA01', 'WFA02'],
...     'data_group': ['SG', 'SG'],
...     'site_id': ['WF', 'WF'],
...     'location_id': ['A01', 'A02']
... })
>>> categorized = to_category_value(data, metrics_overview)
>>> categorized
+------------+--------------------------------+-------+------+-----------+---------------------+---------+-------------+-----+---------+-----------+----------+--------------+
| timestamp  | full_metric_name               | value | unit | statistic | short_hand          | site_id | location_id | lat | heading | site      | location | metric_group |
+============+================================+=======+======+===========+=====================+=========+=============+=====+=========+===========+==========+==============+
| 2021-01-01 | mean_WF_A01_TP_SG_LAT005_DEG000| 1.0   | unit | mean      | TP_SG_LAT005_DEG000 | WF      | A01         | 5.0 | 0.0     | WindFarm  | WFA01    | SG           |
+------------+--------------------------------+-------+------+-----------+---------------------+---------+-------------+-----+---------+-----------+----------+--------------+
| 2021-01-02 | mean_WF_A01_TP_SG_LAT005_DEG000| 1.1   | unit | mean      | TP_SG_LAT005_DEG000 | WF      | A01         | 5.0 | 0.0     | WindFarm  | WFA01    | SG           |
+------------+--------------------------------+-------+------+-----------+---------------------+---------+-------------+-----+---------+-----------+----------+--------------+
| 2021-01-01 | mean_WF_A02_TP_SG_LAT005_DEG000| 2.0   | unit | mean      | TP_SG_LAT005_DEG000 | WF      | A02         | 5.0 | 0.0     | WindFarm  | WFA02    | SG           |
+------------+--------------------------------+-------+------+-----------+---------------------+---------+-------------+-----+---------+-----------+----------+--------------+
| 2021-01-02 | mean_WF_A02_TP_SG_LAT005_DEG000| 2.1   | unit | mean      | TP_SG_LAT005_DEG000 | WF      | A02         | 5.0 | 0.0     | WindFarm  | WFA02    | SG           |
+------------+--------------------------------+-------+------+-----------+---------------------+---------+-------------+-----+---------+-----------+----------+--------------+
to_star_schema(data: pandas.DataFrame | Dict[str, List[Dict[str, Any]]], metrics_overview: pandas.DataFrame | None = None, as_dict: bool = False, convert_object_columns_to_string: bool = False, _username: str | None = None, _password: str | None = None) Dict[str, Any] | pandas.DataFrame | None#

Transforms the data and metrics_overview into a star schema format for analytical purposes.

Parameters#

dataUnion[pd.DataFrame, Dict[str, list[dict[str, Any]]]]

A DataFrame or dictionary representing the raw data. The keys are column column names, and the values are lists of data. Must include a “timestamp” column or have indices that can be converted to timestamps.

metrics_overviewpd.DataFrame

A DataFrame containing metadata for metrics, including the following required columns: - | ‘metric’: The metric names (must match column names in data). - | ‘short_hand’: Short descriptive names for the metrics. - | ‘description’: Detailed descriptions of the metrics. - | ‘statistic’: Aggregation or statistical operation (e.g., mean, | std). - | ‘unit_str’: The units for the metrics. - | ‘location’: Location identifiers. - | ‘site’: Windfarm identifiers. - | ‘data_group’: Grouping of data (e.g., “scada”).

as_dictbool, optional

If True, the data will be returned as a dictionary. Default is False.

convert_object_columns_to_stringbool, optional

If True, convert object columns in the DataFrame to string. This feature is useful if importing the DataFrame within a database so that the ‘value’ column can be stored as a float, since the non-float values will be stored as NULL. Default is False.

_usernameOptional[str]

The username for authentication. If None, the username will be inferred from the environment variables.

_passwordOptional[str]

The password for authentication. If None, the password will be inferred from the environment variables.

Returns#

dict[str, pd.DataFrame]

A dictionary containing the following tables:

  • ‘FactData’: The fact table linking metrics to timestamps,
    locations, metric IDs, and their values as columns.
  • ‘FactPivotData’: The fact table in pivot format, i.e. containing
    timestamp, location, and “statistic” + “short_hand” metric names
    as columns. This pivoted format is the ones used generally by
    BI tools and databases such as InfluxDB.
  • ‘DimMetric’: Dimension table for metrics, including metric ID,
    short name, description, statistic, and unit.
  • ‘DimWindFarm’: Dimension table for wind farms, including
    locations and sites.
  • ‘DimCalendar’: Dimension table for time, including date parts
    (year, month, day, hour, minute).
  • ‘DimDataGroup’: Dimension table for data groups.

Raises#

ValueError

If required columns are missing in data or metrics_overview.

KeyError

If the metric column in metrics_overview contains values not present in data.

Example#

>>> import pandas as pd
>>> data = {
...     'timestamp': ['2020-01-01T00:00:00Z', '2020-01-01T00:10:00Z'],
...     'mean_WF_A01_winddirection': [257.445, 262.03],
...     'std_WF_A01_windspeed': [1.5165, 1.7966]
... }
>>> metrics_overview = pd.DataFrame({
...     'metric': ['mean_WF_A01_winddirection', 'std_WF_A01_windspeed'],
...     'short_hand': ['winddirection', 'windspeed'],
...     'description': ['Wind direction', 'Wind speed'],
...     'statistic': ['mean', 'std'],
...     'unit_str': ['°', 'm/s'],
...     'location': ['WFA01', 'WFMA4'],
...     'site': ['windfarm', 'windfarm'],
...     'data_group': ['scada', 'scada']
... })
>>> result = to_star_schema(data, metrics_overview)
>>> for key, df in result.items():
...     print(f"{key}: {df.to_markdown()}")