Preprocessing Module

Here are contained the functions related to the preprocessing of time series prior to any model training.

This module includes functions relating to the pre-processing of raw price time series. They are used to create a dataset that can be used for deep learning using long short term memory networks.

Author: Oliver Boom Github Alias: OliverJBoom

Foresight.preprocessing.clean_data(df, n_std=20)[source]

Removes any outliers that are further than a chosen number of standard deviations from the mean.

These values are most likely wrongly inputted data, and so are forward filled.

Parameters:
  • df (pd.DataFrame) – A time series
  • n_std (int) – The number of standard deviations from the mean
Returns:

The cleaned time series

Return type:

pd.DataFrame

Foresight.preprocessing.clean_dict_gen(universe_dict, verbose=True)[source]

Generates a dictionary of cleaned DataFrames

Parameters:
  • universe_dict (dict) – The dictionary of time series
  • verbose (bool) – Whether to display the included instruments
Returns:

The cleaned dictionary of time series

Return type:

dict

Foresight.preprocessing.column_rename(universe_dict)[source]

Appends the name of the instrument to the columns. To help keep track of the instruments in the full dataset.

Parameters:universe_dict (dict) – The dictionary of time series
Returns:The dictionary of time series
Return type:dict
Foresight.preprocessing.dimension_reduce(data_X, n_dim, verbose=True)[source]

Performing PCA to reduce the dimensionality of the data.

Parameters:
  • data_X (np.array) – The dataset to perform reduction on
  • n_dim (int) – Number of dimensions to reduce to
  • verbose (bool) – Whether to display the explained variance
Returns:

The reduced dataset

Return type:

np.array

Foresight.preprocessing.dimension_selector(data_X, thresh=0.98, verbose=True)[source]

Calculated the number of dimensions required to reach a threshold level of variance.

Completes a PCA reduction to an increasing number of dimensions and calculates the total variance achieved for each reduction. If the reduction is above the threshold then that number of dimensions is returned

Parameters:
  • data_X (np.array) – The dataset to perform reduction on
  • thresh (float) – The amount of variance that must be contained the in reduced dataset
  • verbose (bool) – Whether to display the number of dimensions
Returns:

The column dimensionality required to contain the threshold variance

Return type:

int

Foresight.preprocessing.feature_spawn(df)[source]

Takes a time series and spawns several new features that explicitly detail information about the series.

The DataFrame spawned contains the following features spawned for each column in the input DataFrame:

Exponentially Weighted Moving Average of various Half Lives:
1 day, 1 week, 1 month, 1 quarter, 6 months, 1 year
Rolling vol of different window sizes:
1 week, 1 month, 1 quarter
Parameters:df (pd.DataFrame) – The dataset of independent variables
Returns:The DataFrame containing spawned features
Return type:pd.DataFrame
Foresight.preprocessing.generate_dataset(universe_dict, price_only=True, lg_only=False)[source]

Generates the full dataset.

Parameters:
  • universe_dict (dict) – The dictionary of time series
  • lag (int) – The lag in days between series
  • lg_only (bool) – Whether to return a dataset of log returns only
  • price_only (bool) – Whether to return a dataset of raw prices only
Returns:

The time series

Return type:

pd.DataFrame

Foresight.preprocessing.generate_lg_return(df_full, lag=1)[source]

Creates the log return series for each column in the DataFrame and returns the full dataset with log returns.

Parameters:
  • df_full (pd.DataFrame) – The time series
  • lag (int) – The lag between the series (in days)
Returns:

The DataFrame of time series with log returns

Return type:

pd.DataFrame

Foresight.preprocessing.log_returns(series, lag=1)[source]

Calculates the log returns between adjacent close prices. A constant lag is used across the whole series. E.g a lag of one means a day to day log return.

Parameters:
  • series (np.array) – Prices to calculate the log returns on
  • lag (int) – The lag between the series (in days)
Returns:

The series of log returns

Return type:

np.array

Foresight.preprocessing.price_rename(universe_dict)[source]

Renaming the column of the DataFrame values to price. This is actually the market closing price of the time series.

Parameters:universe_dict (dict) – The dictionary of time series
Returns:The dictionary of renamed time series
Return type:dict
Foresight.preprocessing.slice_series(data_X, data_y, series_len, dataset_pct=1.0)[source]

Slices the train and target dataset time series.

Turns each time series into a series of time series, with each series displaced by one step forward to the previous series. And for each of these windows there is an accompanying target value

The effect of this is to create an array of time series (which is the depth equal to the amount of instruments in the dataset) with each entry in this array having a target series in the data_y array

The resulting data_X array shape: [amount of rolling windows, length of each series, number of instruments]

The resulting data_y array shape: [amount of rolling windows, number of instruments]

Parameters:
  • data_X (np.array) – The dataset of time series
  • data_y (np.array) – The target dataset of time series
  • series_len (int) – The length of each time series window
  • dataset_pct (float) – The percentage of the full dataset to include
Returns:

Return type:

Foresight.preprocessing.truncate_window_length(universe_dict)[source]

Chopping the length of all of the DataFrames to ensure that they are all between the same dates.

Parameters:universe_dict (dict) – The dictionary of time series
Returns:the dictionary of truncated time series
Return type:dict
Foresight.preprocessing.universe_select(path, commodity_name, custom_list=None)[source]

Selects the financial time series relevant for the commodities selected.

Parameters:
  • path (string) – path to the folder containing csvs
  • commodity_name (string) – the name of the metal/s being inspected
  • custom_list (list) – the names of csvs to be included in the dataset
Returns:

The time series relevant to the commodities

Return type:

dict