Preprocessing Module¶
Here are contained the functions related to the preprocessing of time series prior to any model training.
This module includes functions relating to the preprocessing of raw price time series. They are used to create a dataset that can be used for deep learning using long short term memory networks.
Author: Oliver Boom Github Alias: OliverJBoom

Foresight.preprocessing.
clean_data
(df, n_std=20)[source]¶ Removes any outliers that are further than a chosen number of standard deviations from the mean.
These values are most likely wrongly inputted data, and so are forward filled.
Parameters:  df (pd.DataFrame) – A time series
 n_std (int) – The number of standard deviations from the mean
Returns: The cleaned time series
Return type: pd.DataFrame

Foresight.preprocessing.
clean_dict_gen
(universe_dict, verbose=True)[source]¶ Generates a dictionary of cleaned DataFrames
Parameters:  universe_dict (dict) – The dictionary of time series
 verbose (bool) – Whether to display the included instruments
Returns: The cleaned dictionary of time series
Return type: dict

Foresight.preprocessing.
column_rename
(universe_dict)[source]¶ Appends the name of the instrument to the columns. To help keep track of the instruments in the full dataset.
Parameters: universe_dict (dict) – The dictionary of time series Returns: The dictionary of time series Return type: dict

Foresight.preprocessing.
dimension_reduce
(data_X, n_dim, verbose=True)[source]¶ Performing PCA to reduce the dimensionality of the data.
Parameters:  data_X (np.array) – The dataset to perform reduction on
 n_dim (int) – Number of dimensions to reduce to
 verbose (bool) – Whether to display the explained variance
Returns: The reduced dataset
Return type: np.array

Foresight.preprocessing.
dimension_selector
(data_X, thresh=0.98, verbose=True)[source]¶ Calculated the number of dimensions required to reach a threshold level of variance.
Completes a PCA reduction to an increasing number of dimensions and calculates the total variance achieved for each reduction. If the reduction is above the threshold then that number of dimensions is returned
Parameters:  data_X (np.array) – The dataset to perform reduction on
 thresh (float) – The amount of variance that must be contained the in reduced dataset
 verbose (bool) – Whether to display the number of dimensions
Returns: The column dimensionality required to contain the threshold variance
Return type: int

Foresight.preprocessing.
feature_spawn
(df)[source]¶ Takes a time series and spawns several new features that explicitly detail information about the series.
The DataFrame spawned contains the following features spawned for each column in the input DataFrame:
 Exponentially Weighted Moving Average of various Half Lives:
 1 day, 1 week, 1 month, 1 quarter, 6 months, 1 year
 Rolling vol of different window sizes:
 1 week, 1 month, 1 quarter
Parameters: df (pd.DataFrame) – The dataset of independent variables Returns: The DataFrame containing spawned features Return type: pd.DataFrame

Foresight.preprocessing.
generate_dataset
(universe_dict, price_only=True, lg_only=False)[source]¶ Generates the full dataset.
Parameters:  universe_dict (dict) – The dictionary of time series
 lag (int) – The lag in days between series
 lg_only (bool) – Whether to return a dataset of log returns only
 price_only (bool) – Whether to return a dataset of raw prices only
Returns: The time series
Return type: pd.DataFrame

Foresight.preprocessing.
generate_lg_return
(df_full, lag=1)[source]¶ Creates the log return series for each column in the DataFrame and returns the full dataset with log returns.
Parameters:  df_full (pd.DataFrame) – The time series
 lag (int) – The lag between the series (in days)
Returns: The DataFrame of time series with log returns
Return type: pd.DataFrame

Foresight.preprocessing.
log_returns
(series, lag=1)[source]¶ Calculates the log returns between adjacent close prices. A constant lag is used across the whole series. E.g a lag of one means a day to day log return.
Parameters:  series (np.array) – Prices to calculate the log returns on
 lag (int) – The lag between the series (in days)
Returns: The series of log returns
Return type: np.array

Foresight.preprocessing.
price_rename
(universe_dict)[source]¶ Renaming the column of the DataFrame values to price. This is actually the market closing price of the time series.
Parameters: universe_dict (dict) – The dictionary of time series Returns: The dictionary of renamed time series Return type: dict

Foresight.preprocessing.
slice_series
(data_X, data_y, series_len, dataset_pct=1.0)[source]¶ Slices the train and target dataset time series.
Turns each time series into a series of time series, with each series displaced by one step forward to the previous series. And for each of these windows there is an accompanying target value
The effect of this is to create an array of time series (which is the depth equal to the amount of instruments in the dataset) with each entry in this array having a target series in the data_y array
The resulting data_X array shape: [amount of rolling windows, length of each series, number of instruments]
The resulting data_y array shape: [amount of rolling windows, number of instruments]
Parameters:  data_X (np.array) – The dataset of time series
 data_y (np.array) – The target dataset of time series
 series_len (int) – The length of each time series window
 dataset_pct (float) – The percentage of the full dataset to include
Returns: Return type:

Foresight.preprocessing.
truncate_window_length
(universe_dict)[source]¶ Chopping the length of all of the DataFrames to ensure that they are all between the same dates.
Parameters: universe_dict (dict) – The dictionary of time series Returns: the dictionary of truncated time series Return type: dict

Foresight.preprocessing.
universe_select
(path, commodity_name, custom_list=None)[source]¶ Selects the financial time series relevant for the commodities selected.
Parameters:  path (string) – path to the folder containing csvs
 commodity_name (string) – the name of the metal/s being inspected
 custom_list (list) – the names of csvs to be included in the dataset
Returns: The time series relevant to the commodities
Return type: dict