Forecasting day-ahead spot electricity prices using deep neural networks with attention mechanism

This paper presents a novel approach to forecast hourly day-ahead electricity prices. In recent years, many predictive models based on statistical methods andmachine learning (deep learning) techniques have been proposed. However, the approach presented in this paper focuses on the problem of constructing a fair and unbiased model. In this considered case, unbiased means that the model can increase prediction accuracy and decrease categorical bias across different data clusters. For this purpose, a model combining techniques such as long short-term memory (LSTM) recurrent neural network, attention mechanism, and clustering is created. The proposed model’s main feature is that the attention weights for LSTM hidden states are calculated considering a context vector given for each sample individually as the cluster center to which the sample belongs. In training mode, the samples are iteratively (one time per epoch) clustered based on representation vectors given by the attention mechanism. In the empirical study, the proposed model was applied and evaluated on the Nord Pool market data. To confirm that the model decreases categorical bias, the obtained results were compared with results of similar LSTM models but without the proposed attention mechanism.


INTRODUCTION
Since the early 1990s, energy markets have play increasingly important roles in the power systems worldwide because of the deregulation process. Forecasting energy demand and day-ahead prices is a vital issue for all market participants.Accurate day-ahead price forecasting in the spot market helps the power suppliers adjust their bidding strategies to achieve the maximum benefit. On the other hand, consumers can derive a plan to maximize their utilities using the electricity purchased from the pool or use self-production to protect themselves against high prices [1] .
Time series of electricity prices tend to have complex features such as nonstationarity, nonlinearity, and high volatility, making energy price forecasting difficult. One of the widely used and most powerful model groups is the time series models. Weron [2][3][4] reviewed the approaches to modeling and forecasting day-ahead electricity prices. He also found that an approach where each hour is forecasted separately gives better results than an approach where forecasts are made for the whole day at once. However, both approaches are equally popular. Common statistical methods are: autoregressive (AR) and autoregressive with exogenous inputs (ARX) models [5] , double seasonal Holt-Winter (DSHW) models [6] , threshold ARX (TARX) models [7,8] , autoregressive integrated moving average (ARIMA) models [9,10] , semi/nonparametric models [5,11] , generalized autoregressive conditional heteroscedasticity (GARCH)-based models [12][13][14] , and dynamic regression (DR) and transfer function (TF) models [15] . Next to statistical models, computational intelligence techniques are widely used in electricity price forecasting due to their strong nonlinear modeling capabilities. Szkuta et al. [16] proposed a three-layered ANN with back-propagation for modeling and predicting the Victorian electricity market data. Wang et al. [17] proposed a neural-network-based approach to predict system marginal prices, also considering weekend and public holidays as input. The cascaded neural network structure for market-clearing price prediction in the New England market was presented by Zhang et al. [18] . Over the last decade, several innovations have been introduced in the field of neural networks that have led to deep learning development. Forecasting electricity prices using deep learning techniques, e.g., deep recurrent neural networks, is also presented in many papers [19][20][21][22][23][24] .
Electricity prices display a set of relatively unique attributes: a constant balance between production and consumption [25] ; dependence of the consumption on time, e.g., the hour of the day, day of the week, and time of the year; load and generation that is influenced by external weather conditions [25] ; and influence of neighboring markets [4] . Due to these characteristics, as shown in many studies, errors of forecasting are different in different groups of data [22,26] . Natural data groups are those resulting from data division by time, e.g., according to the seasons, months, days of the week, or hour of the day. Other groups that are more difficult to identify are those resulting from the division of data according to external factors, such as weather conditions (temperature or wind force) and fuel prices, e.g. natural gas, oil, and coal. Each of these groups may be represented by a different number of samples in the dataset in practice. It has also been shown that algorithms trained based on biased data lead to algorithmic discrimination [27,28] . Recently, comparative tests have emerged to quantify discrimination [29,30] , as well as datasets designed to evaluate these algorithms [31] . Therefore, this article takes the challenge of integrating debiasing capabilities directly into a model during a training process that adjusts automatically and unattended to training data deficiencies. This approach includes a comprehensive deep learning algorithm that simultaneously learns to forecast electricity prices for the next day and the clustering of the training data in an unsupervised manner.

METHODS
Before proposing the deep learning algorithm for prediction and the training procedure, the problem specification, LSTM recurrent network, and attention mechanism are introduced.

Problem Statement
Consider the problem of prediction future values. Let D = {( ( ) , ( ) )} =1 be a set of paired training data samples consisting of features (present and past values) ∈ R and future values ∈ R . The aim is to find a functional mapping of : → parameterized by that minimizes some loss L( ) over given training dataset. In other words, we consider solving the following optimization problem: For new test sample, ( , ), the predictor should outputˆ= ( ) whereˆis almost equal to . Now, assume that each sample also has an associated latent vector ∈ R which represents hidden features of the sample [32] . The notion of a biased predictor can be formalized as follows [33] : Definition 1 A predictor, ( ), is biased if its prediction changes after being exposed to additional sensitive feature inputs. It means that a predictor is fair with respect to a set of latent features, , if: ( ) = ( , ).
A good example to understand this is the facial detection problem considered by Amini et al. [33] . When deciding whether an image contains a face or not, a person' s skin color, gender, and even age are the primary latent variables and should not influence the classifier' s decision. To ensure the reliability of the classifier with respect to different latent variables, the dataset should contain roughly uniform samples in the hidden space.
In other words, the training dataset should equally represent different categories over the latent space. Note that this is different from claiming that the dataset should be balanced for the classes. Moreover, in time series forecasting, it is an even more natural situation due to the lack of division into classes. However, methods proposed in the literature [33][34][35] to generate training data that are more "fair" by resampling or generating new samples are difficult to apply to time series.

LSTM recurrent network
The LSTM was introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 [36] . Unlike traditional recurrent neural networks, an LSTM network is well-suited to learn from experiences to identify and predict the time series when there are very long time lags with unknown size. The main feature of LSTM is the ability to remove or add information to the cell state, carefully regulated by three different structures called gates, namely input, forget, and output gates. As shown in Figure 1, the state of each cell ( −1 ) passes through the LSTM cell to generate a state for the next step ( ). Gates are a way to let information along the state flow optionally. They have been composed of a or ℎ neural net layer and a pointwise multiplication operation.
The mathematical functions of three gates are defined as: where is the input gate, which controls how much information of input ( ) and previous hidden state (ℎ −1 ) is allowed to pass into the memory cell; is the forget gate, which controls how much information is forgotten before passing though the cell; is the output gate, which controls how much information from the current memory cell can be output to the hidden state; represents the cell state generated as an additional variable for the cell; is the weight matrix; and is the biases to each layer. The symbol ⊙ represents the operation of pointwise multiplication [22] . The detailed structure within a LSTM cell [22] .

Encoder-decoder with attention
Based on LSTM units, encoder-decoder networks [37] have become popular due to their success in machine translation. The main idea is to encode the source sentence as a fixed-length vector and use the decoder to generate a translation. One problem with encoder-decoder networks is that their performance will deteriorate rapidly as the input sequence' s length increases [38] . In time series analysis, especially when we work with highfrequency time series, this could be a concern. To resolve this issue, the attention-based encoder-decoder network [39] employs an attention mechanism to select parts of hidden states across all the time steps. Attention is a mechanism that provides a richer encoding of the source sequence to construct a context vector that the decoder can then use. The main difference between the encoder-decoder with attention mechanism and the encoder-decoder model is that a different context vector is computed for every time step of the decoder. Let ℎ , = 1, 2, . . . , be a hidden states of encoder; then, the context vector is computed as a weighted sum of these hidden states ℎ The weight of each hidden state ℎ is computed by is an alignment model that scores how well the inputs around position and the output at position match. The score is based on the previous hidden state −1 of the decoder and the th hidden state of the input sentence. The model could be a feedforward neural network that is jointly trained with all the other components of the system.

LSTM with attention for forecasting electricity prices
In this work, we propose to apply the LSTM deep neural network (LSTM-DNN) with a specific attention mechanism to predict the electricity day-ahead price. The architecture of the proposed model is shown in Figure 2. Figure 3 shows the high volatility of the Nord Pool market' s electricity price in the SE1 region. Figure 3 shows that all prices are positive, and sometimes extremely high prices appear. The extremely high prices can be caused by shortages of power supply in the system. However, those extreme values of the price occur infrequently. For instance, for the SE1 region, during seven years from 22 January 2013 to 31 December 2019, prices higher than 66 EUR/MWh (higher than mean plus three sigmas) occur less than 1% of the time. Therefore, to reduce the effect of abnormal events on the prediction performance, we refine the extreme prices into specific values. Its neighbor prices interpolate the prices higher than the mean plus three sigmas. After redefining the prices, we also transform prices by using the natural logarithm as follows:

Preprocessing and input/output
where is the electricity price on day at time step .
As inputs, we can use various variables: historical prices or loads, weather conditions, holidays, the day of the week, oil prices, etc. Our research assumes that the price discounts everything, so all the factors mentioned above should already be included in the price. Hence, we use as input only historical prices. The actual price values on day are denoted as: whereˆis the predicted price at time step . can be 24 for an hourly market (as in our case) and 48 for a half-hourly market. As an input to the LSTM cell to predict the prices for the day + 1, we use all the prices in the L days before:

LSTM Encoder with attention and clustering
To generate a vector of latent features for each sample, each sample is projected into feature space by feeding it through the LSTM encoder with an attention mechanism twice. The LSTM encoder is a simple LSTM model that has a single hidden layer of LSTM units. This model returns the sequences of hidden states ( = 1, 2, . . . , ) and it has one hyperparameter _ , which determines the dimension of the vectors , , and . The model used to perform the Attn block' s attention score is the traditional multilayer perceptron (MLP) neural network with one hidden layer with _ neurons and hyperbolic tangent (tanh) as the activation function. The input of the model is concatenation of vectors and . The output is the vector of attention weights. Each example is passed through the attention mechanism with the context vector set to zero in the first run. Next, on the set of vectors , generated in this way, the clustering algorithm (k-means) is performed with a given hyperparameter for the number of clusters ( _ ). Then, in the second run, each sample is passed through an attention mechanism with the context vector set to the cluster center to which the sample belongs. Finally, the vector generated in this way for each sample is passed to the predictive model.

Predictor
As a simple prediction model, the model for predicting day-ahead prices is the MLP neural network with one hidden layer with the rectified linear unit (ReLu) activation function in the hidden layer and a linear activation function in the output layer. The input of the model is vector and the output is the vector of day-ahead prices that we intend to forecast. The model has one hyperparameter _ , which determines the number of neurons of the hidden layer.

Training
In the proposed model, all parts of the model are jointly trained by minimizing the mean absolute percentage error (MAPE), defined as the average absolute difference between the actual value and the forecast value divided by the actual value: where anˆare actual and forecast price on day at time step , respectively.
The mean absolute percentage error' s choice over the mean square error is made for a simple reason: because electricity prices have large spikes, the Euclidean norm would emphasize the spiky prices. As an optimizer, we choose the Adam algorithm [40] , a stochastic gradient descent method [41] that uses adaptive learning rates. With the Adam algorithm, the training procedure also considers early stopping [42] by monitoring the error on the validation dataset to avoid overfitting.

EMPIRICAL STUDY
In this section, we perform the empirical study to evaluate the proposed model and analyze the various models' obtained results. Our goal is to confirm that the proposed attention mechanism with clustering improves forecasts' accuracy and makes the model more unbiased. To do so, we evaluate three architecture of the model: • The simple vanilla LSTM model is the proposed model without attention mechanism and clustering; the vector passed to predictive model is set to the last hidden state of LSTM, = .  • The LSTM encoder with attention is the proposed model with attention but without clustering; the context vector is set to zero. • The proposed model, which is described in the previous sections.

Data
For this research, we consider the public Nord Pool 1 day-ahead market covering electricity prices from six countries divided into 14 regions, namely Sweden (SE1, SE2, SE3, and SE4), Finland (FI), Norway (Oslo, Kr.sand, Bergen, Molde, Tr.heim, and Troms), Estonia (EE), Lithuania (LV), and Latvia (LT), in the period from January 2013 to December 2019. The data are prepared using preprocessing techniques described in Section Preprocessing and Input/Output, including a deal with too high prices and log-transformation of prices.
The data are divided into three sets: There are 24 electricity prices per day. Hence, the training dataset comprises 602,808 data points to predict. Both validation and test datasets comprise 122,640 data points to predict each.

Hyperparameters
The hyperparameters that should be chosen for the model are described above with the proposed model' s architecture. To choose optimal values for hyperparameters, we conducted a grid search over tunable parameters. As a result, for the sake of conciseness in this paper, we present the results obtained for optimal configurations of hyperparameters: = 21, _ = 64, _ = 128, and three different _ ∈ {10, 20, 30}, to show the impact of numbers of clusters.

Results
To compare and analyze the various predictors' predictive accuracy, we compute their MAPE on the test set. Models were re-trained from scratch five times each for added statistical robustness of results. It is important to note that the predictors are not re-estimated when new data are available, i.e., the models were trained on data from 2013 to 2017, while the test data cover 2019. The obtained results are listed in Table 1 and shown in Figure 4. The example of forecasted prices is illustrated in Figure 5.
To demonstrate debiasing, we quantified prediction performance on individual categories. Specifically, we considered different data groups resulting from data division by region and date and time (seasons, months, day  weeks, hours, and peaks). From the results shown in Table 1, we can make various observations. As expected, the column "Overall" shows that adding an attention mechanism improved predictions, and adding clustering improved them even more. There is also a dependency that increasing the number of clusters improves prediction. Moreover, for almost all divisions, the proposed model turned out to be the most unbiased predictor. As we can see, the standard deviation is then the smallest and, in most cases, decreases as the number of clusters increases. The only exception is the division into hours, which may result from the fact that forecasts are made for the whole day, not individually for each hour. As shown in Figure 4, a greater force of debiasing (increasing _ ) improved the predictions for the "spring" category. This suggests that our model may debias for a qualitative feature such as the season, which has a significant impact on its usefulness in improving forecasting models' reliability. Contrary to the trend observed with spring days, the prediction errors in the "summer" category increase with an increasing number of clusters; we suspect that it may be related to other external factors. Additionally, the "autumn" and "winter" categories' errors remained almost constant for both the biased and debiased models and were much better than those of the other categories. This suggests that our proposed model does not sacrifice performance on categories that already have high precision. As confirmed by Figure 4, the overall precision increased with an increased debiasing power (increasing _ ). Error bars (standard error of the mean) are shown in order to visualize the statistical significance of differences between the trained models. It is also worth noting that the differences in the quality of forecasts between the categories are significant, confirming the need to develop methods to eliminate these issues.

CONCLUSION
In this paper, the LSTM deep neural network with attention mechanism and clustering is devised for electricity market day-ahead price forecasting, which considers a context vector given for each sample individually as the cluster center to which the sample belongs. By learning the latent variables in an unsupervised manner, we can scale this approach to large datasets without labeling them in a training set. We applied our proposed model to forecasting day-ahead electricity prices. Given a biased training dataset, our models show increased prediction accuracy and decreased categorical bias across various data categories compared to similar models but without the proposed mechanisms. The next step in our research will be to also include external factors (e.g., production, consumption, weather conditions, and oil prices) as input data and to extend the model with a decoder module based on the Variational Autoencoder model. These activities could contribute to achieving even better predictions and improve the learning phase of latent structures in the data.