Introduction
Prerequisites
2.1 Import Libraries
2.2 Import Dataset
2.3 Preview Datasset
2.4 Dataset Structure
2.5 Time Series Preprocessing
Missing Values
Descriptive Statistics
Examine Stationarity
5.1 Definition of stationarity
5.2 Examine Stationarity Graphically
5.3 Examine Stationarity with Statistical Tests
5.3.1 ADF test
5.3.2 PP test
5.3.3 KPSS test
Identify Model
Build Time Series Model
Compare Models
Forecast Future Unemployment
9.1 ARIMA (0,2,1) forecasts
9.2 ARIMA (9,2,1) forecasts
Results

Introduction

[…editing…]

Background

Time series

Procedure

Short Answer

If you are in a hurry, I predicted that unemployment on Greece is expected to further reduce. It will range between 10% - 13% in February, 2023.

Prerequisites

Import Libraries

For this analysis we will need standard libraries for importing and processing my data, such as readr (Wickham, Hester, & Bryan, 2022) and dplyr (Wickham, François, Henry, Müller, & Vaughan, 2023). The kableExtra (Zhu, 2021) package was used to print the results in table format, while the flextable (Gohel & Skintzos, 2022) package was used to print the results of the Dickey-Fuller and KPSS tests.

Then, due to the nature of the data (time series) it was deemed necessary to use relevant libraries such as lubridate(Spinu, Grolemund, & Wickham, 2021), tseries(Trapletti & Hornik, 2022) & forecast(R. Hyndman et al., 2022) packages.

Finally, the ggplot2 (Wickham et al., 2024) package was used to create some visualizations, as well as an auxiliary package, ggtext (Wilke & Wiernik, 2022), for further formatting those.

Show the code

# General purpose R libraries
library(dplyr)
library(readr)
library(kableExtra)
library(flextable)


# Graphs
library(ggplot2)
library(ggtext) # Add support for HTML/CSS on ggplot

# Time Series 

library(lubridate)
library(tseries)
library(forecast)

# Other settings
options(digits=2) # print only 2 decimals

Import dataset

After loading the libraries I am able to use the commands of the readr package to import my data. My data is in .csv format, so I’ll use the read_csv() command (Wickham et al., 2022) to import them.

Additionally, I choose not to include EA-19 values (as I investigate Greece’s unemployment).

Show the code

unemployment <- read_csv("data/unemployment.csv") %>%
  select(LOCATION, TIME, Value) %>% filter(LOCATION != "EA19")

Preview Dataset

Show the code

#| label: tbl-preview-dataset
#| tbl-cap: "Preview Dataset (first 6 rows)"
#| 
preview_dataset = head(unemployment, 10)
kbl(preview_dataset, 
    align = 'c',
    booktabs = T,
    centering = T,
    valign = T) %>%
  kable_paper() %>%
  scroll_box(width = "600px", height = "250px")

LOCATION	TIME	Value
GRC	1998-04	11
GRC	1998-05	11
GRC	1998-06	11
GRC	1998-07	11
GRC	1998-08	11
GRC	1998-09	11
GRC	1998-10	11
GRC	1998-11	11
GRC	1998-12	11
GRC	1999-01	11

Dataset Structure

Our dataset is consisted by 3 variables (columns). More specifically, concerning my variables, are as follows :

Variable	Property	Description
`LOCATION`	qualitative (nominal)	Specific country’s statistics
`TIME`	qualitative (ordinal)	Month of the reported data
`Value`	quantitative (continuous)	Unemployment at the given Time and Country

Thus, my sample has 3 variables, of which 2 are qualitative and 1 is quantitative property.

Time Series Preprocessing

The TIME variable needs to be a Date variable which is not fulfilled on our case.

Show the code

sapply(unemployment, class) %>% kbl() %>% kable_styling(full_width = F, position = "center")

	x
LOCATION	character
TIME	character
Value	numeric

So, above I see that I have dates in a format “YYYY-MM” (Year - Month) and they are considered as characters. With the help of lubridate package I will convert my time series on a Date format.

Show the code

unemployment$TIME <- lubridate::ym(unemployment$TIME)

And let’s check again :

Show the code

sapply(unemployment, class) %>% kbl() %>% kable_styling(full_width = F, position = "center")

	x
LOCATION	character
TIME	Date
Value	numeric

And now I got the Date format. I am able to continue my analysis.

Missing Values

Show the code

sum(is.na(unemployment))

Good news! On this dataset there are 0 missing values, in total.

In case the dataset had missing values, I would first look at which variables those were.In a second phase, it might be necessary to replace the missing values.

Descriptive Statistics

The unemployment data for Greece refer to the period from April 1998 το August 2022. Regarding Europe we have data from January 2000 to August 2022.

The last 20 years have seen particularly large changes mainly in Greece, due to the domestic economic crisis. There is a noticeable change between September 2010 (10.1%) and September 2013 (28.1 %), in terms of unemployment in Greece. For the record, below you can find a table with the months that presented the highest unemployment in Greece. As you will notice the five highest values were observed in 2013.

Show the code

unemployment %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")

	TIME	Value
289	2013-11-01	28
290	2013-04-01	28
291	2013-05-01	28
292	2013-07-01	28
293	2013-09-01	28

Accordingly, the highest unemployment rates observed at the European level are the following:

Show the code

unemployment %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value)  %>%
  as.data.frame() %>% arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")

	TIME	Value
268	2013-06-01	12
269	2013-01-01	12
270	2013-02-01	12
271	2013-03-01	12
272	2013-04-01	12

The lowest unemployment ever recorded in Greece was in the period of 2008:

Show the code

unemployment %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")

TIME	Value
2008-05-01	7.4
2008-06-01	7.6
2008-07-01	7.6
2008-10-01	7.6
2008-01-01	7.7

On the other hand, at the moment in Europe we are at the lowest levels of the last 20 years, with unemployment hovering around 6%.

Show the code

unemployment %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value) %>%
  as.data.frame() %>% arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")

TIME	Value
2022-07-01	6.0
2022-08-01	6.0
2022-04-01	6.1
2022-05-01	6.1
2022-06-01	6.1

Finally, it is evident that in the case of Greece the changes are more intense than in Europe.

All of the above can be summarized by the following graph:

Show the code

grc_unemployment <- unemployment %>% filter(LOCATION == "GRC")

unemployment %>%
  ggplot(aes(x= TIME, y=Value, color = LOCATION)) +
  geom_line() + 
  geom_vline(aes(xintercept = as.POSIXct(as.Date("2010-11-03"))), color = "#000000", size = 1, linetype = "dashed")+
  scale_color_manual(values=c('black', 'dodgerblue2', 'orange')) +
  scale_x_date(date_labels = "%m-%Y") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Examine stationarity

Definition of stationarity

An important concept in studying time series is stationarity. A time series is called stationary (“Applied Time Series Analysis,” n.d.) if:

E(X_t) = \text{constant}
Var(X_t) = \text{constant}
Cov(X_t, X_s) = \text{constant}

Examine Stationarity Graphically

It is apparent that there is a big variation on the values of unemployment. This time series is not stationary and the differencing is justified.

Show the code

plot(grc_unemployment$Value,type = "l")

Here we can see a big improvement in comparison with original data. I have some concerns about points close to 150 (mildly upwards trend) and 250 (outlier).

Show the code

grc_unemployment_diff1 <- diff(grc_unemployment$Value, differences = 1)

plot(grc_unemployment_diff1,type = "l")

Given the concerns of above, I made also a second difference plot. It seems to solve the problem on points close to 150.

Show the code

grc_unemployment_diff2<- diff(grc_unemployment$Value, differences = 2)

plot(grc_unemployment_diff2, type = "l")

Examine Stationarity with Statistical tests

The graphical interpretation of stationarity can be beneficial for a quick assessment on topic of stationarity. However it can be considered a subjective metric, which leads on a non consistent decision (someone may consider the second figure as stationary and some others not.

Thankfully, there are some statistical tests which can help us on our decisions. Some commonly used are :

Augmented Dickey-Fuller (ADF) test
Phillips- Perron (PP) test
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test

Summary

Show the code

summary_stationarity_results <- data.frame(
                             "Type" = c("levels", "Diff(GRC)", "Diff2(GRC)"),
                             "ADF test" = c("Non Stationary", "Stationary", "Stationary"),
                              "PP test" = c("Non Stationary", "Stationary", "Stationary"),
                             "KPSS test" =c("Non Stationary", "Non Stationary", "Stationary")
)


summary_stationarity_results  %>% kbl() %>% kable_styling()

Type	ADF.test	PP.test	KPSS.test
levels	Non Stationary	Non Stationary	Non Stationary
Diff(GRC)	Stationary	Stationary	Non Stationary
Diff2(GRC)	Stationary	Stationary	Stationary

ADF test

\begin{array}{l} H_0 : \text{Time series is not stationary} \\ H_1 : \text{Alternatively} \end{array} \equiv \begin{array}{l} H_0 : \text{There is a unit root} \\ H_1 : \text{Alternatively} \end{array}

Show the code

adf.test(grc_unemployment$Value) %>% as_flextable()

statistic	p.value	parameter	method	alternative
-0.9	0.9485	6.0	Augmented Dickey-Fuller Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Συνεπώς, είναι προφανές από τα αποτελέσματα του στατιστικού ελέγχου Dickey Fuller ότι η χρονοσειρά μου δεν είναι στάσιμη. Θα πρέπει να εφαρμόσω τον έλεγχο Dickey-Fuller στις δοαφορές.

Show the code

adf.test(grc_unemployment_diff1) %>% as_flextable()

statistic	p.value	parameter	method	alternative
-3.5	0.0424 *	6.0	Augmented Dickey-Fuller Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

And finally the results for the second differences are :

Show the code

adf.test(grc_unemployment_diff2) %>% as_flextable()

Warning in adf.test(grc_unemployment_diff2): p-value smaller than printed
p-value

statistic	p.value	parameter	method	alternative
-9.5	0.0100 **	6.0	Augmented Dickey-Fuller Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

PP test

Show the code

pp.test(grc_unemployment$Value,) %>% as_flextable()

statistic	p.value	parameter	method	alternative
0.2	0.9900	5.0	Phillips-Perron Unit Root Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Show the code

pp.test(grc_unemployment_diff1) %>% as_flextable()

statistic	p.value	parameter	method	alternative
-279.8	0.0100 **	5.0	Phillips-Perron Unit Root Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Show the code

pp.test(grc_unemployment_diff2) %>% as_flextable()

statistic	p.value	parameter	method	alternative
-337.1	0.0100 **	5.0	Phillips-Perron Unit Root Test	stationary
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

KPSS test

\begin{array}{l} H_0 : \text{Time series is stationary} \\ H_1 : \text{Alternatively} \end{array} \equiv \begin{array}{l} H_0 : \text{There is not unit root} \\ H_1 : \text{Alternatively} \end{array}

Show the code

kpss.test(grc_unemployment$Value,) %>% as_flextable()

statistic	p.value	parameter	method
2.5	0.0100 **	5.0	KPSS Test for Level Stationarity
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Show the code

kpss.test(grc_unemployment_diff1) %>% as_flextable()

statistic	p.value	parameter	method
0.6	0.0231 *	5.0	KPSS Test for Level Stationarity
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Show the code

kpss.test(grc_unemployment_diff2) %>% as_flextable()

statistic	p.value	parameter	method
0.0	0.1000 .	5.0	KPSS Test for Level Stationarity
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Identify Model

Show the code

par(mfrow = c(1, 2))

acf(grc_unemployment$Value)
pacf(grc_unemployment$Value )

Show the code

par(mfrow = c(1, 2))

acf(grc_unemployment_diff1 )
pacf(grc_unemployment_diff1 )

Show the code

par(mfrow = c(1, 2))

acf(grc_unemployment_diff2 )
pacf(grc_unemployment_diff2 )

Build Time Series Model

Show the code

grc_unemployment = grc_unemployment%>% select(Value)
auto_model <- auto.arima(grc_unemployment, trace = T,seasonal = TRUE)


 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 302
 ARIMA(0,2,0)                    : 474
 ARIMA(1,2,0)                    : 405
 ARIMA(0,2,1)                    : 298
 ARIMA(1,2,1)                    : 301
 ARIMA(0,2,2)                    : 300
 ARIMA(1,2,2)                    : 302

 Now re-fitting the best model(s) without approximations...

 ARIMA(0,2,1)                    : 297

 Best model: ARIMA(0,2,1)

Show the code

auto_model

Series: grc_unemployment 
ARIMA(0,2,1) 

Coefficients:
         ma1
      -0.908
s.e.   0.023

sigma^2 = 0.16:  log likelihood = -146
AIC=297   AICc=297   BIC=304

Show the code

arimaModel_1=arima(grc_unemployment, order=c(0,1,2))
arimaModel_1


Call:
arima(x = grc_unemployment, order = c(0, 1, 2))

Coefficients:
        ma1    ma2
      0.145  0.076
s.e.  0.059  0.057

sigma^2 estimated as 0.171:  log likelihood = -157,  aic = 320

Show the code

arimaModel_2=arima(grc_unemployment, order=c(1,1,2))
arimaModel_2


Call:
arima(x = grc_unemployment, order = c(1, 1, 2))

Coefficients:
        ar1     ma1    ma2
      0.971  -0.905  0.017
s.e.  0.019   0.067  0.063

sigma^2 estimated as 0.157:  log likelihood = -144,  aic = 296

Show the code

arimaModel_3=arima(grc_unemployment, order=c(9,2,1))
arimaModel_3


Call:
arima(x = grc_unemployment, order = c(9, 2, 1))

Coefficients:
        ar1    ar2    ar3    ar4    ar5    ar6     ar7     ar8     ar9    ma1
      -0.79  -0.71  -0.67  -0.69  -0.59  -0.38  -0.316  -0.446  -0.270  -0.15
s.e.   0.17   0.15   0.15   0.14   0.14   0.12   0.088   0.072   0.069   0.18

sigma^2 estimated as 0.139:  log likelihood = -127,  aic = 276

Compare Models

After building some ARIMA models, I should decide which is the best one to make my estimations. One metric to evaluate those models is AIC (Akaike Information Criterion). The lower the value of AIC, the better my model.

Show the code

accuracy_table <- data.frame(
                             "Name of Model" = c("Auto Model", "ARIMA Candidate #1", "ARIMA Candidate #2", "ARIMA Candidate #3"),
                             Model = c("ARIMA(0,2,1)", "ARIMA(0,1,2)", "ARIMA(1,1,2)", "ARIMA(9,2,1)"),
                             AIC =c(auto_model$aic, arimaModel_1$aic, arimaModel_2$aic, arimaModel_3$aic)
)


accuracy_table %>% kbl() %>% kable_styling()

Name.of.Model	Model	AIC
Auto Model	ARIMA(0,2,1)	297
ARIMA Candidate #1	ARIMA(0,1,2)	320
ARIMA Candidate #2	ARIMA(1,1,2)	296
ARIMA Candidate #3	ARIMA(9,2,1)	276

So, the best model is the Auto Model (ARIMA(9,2,1)), which has the lowest AIC value.

Checking best models

Forecast Future Unemployment

Previously, I identify which is the best model. Now, I will use this model in order to predict unemployment for the next 6 months. It should be recalled that the last available value was from August of 2022 (12.2%). Therefore, I will make a prediction for unemployment in Greece until February 2023.

ARIMA (0,2,1) forecasts

Show the code

forecast(auto_model,6) %>% autoplot()

Show the code

forecast(auto_model,6) %>% kbl() %>% kable_paper()

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
294	12	11.5	13	11.2	13
295	12	11.1	13	10.7	13
296	12	10.7	13	10.2	13
297	12	10.3	13	9.7	13
298	11	10.0	13	9.2	13
299	11	9.6	13	8.8	14

ARIMA (9,2,1) forecasts

Show the code

forecast(arimaModel_3,6) %>% autoplot()

Show the code

forecast(arimaModel_3,6) %>% kbl() %>% kable_paper()

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
294	12	11	12	11.1	13
295	12	11	13	10.9	13
296	12	11	13	10.7	13
297	12	11	13	10.1	13
298	12	10	13	9.8	13
299	12	10	13	9.4	14

Results

Given the diagram as well as the forecast table (of the best performing model, candidate #3), I conclude that a reduction in unemployment in Greece is expected in the next period of time (in the next six months). More specifically, Greece’s unemployment in February 2023 will range between 10% (9.4%) and 13% (14%) with an 80% (95%) probability (based on ARIMA(9,2,1) model).

Acknowledgements

Image by PublicDomainPictures from Pixabay

References

Applied time series analysis. (n.d.). Retrieved October 19, 2022, from https://online.stat.psu.edu/stat510/

Gohel, D., & Skintzos, P. (2022). Flextable: Functions for tabular reporting. Retrieved from https://ardata-fr.github.io/flextable-book/

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from https://www.jstatsoft.org/v40/i03/

Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice. OTexts.

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 26(3), 1–22. https://doi.org/10.18637/jss.v027.i03

Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., Kuroptev, K., … Yasmeen, F. (2022). Forecast: Forecasting functions for time series and linear models. Retrieved from https://pkg.robjhyndman.com/forecast/

OECD. (2022). Unemployment rate (indicator). Retrieved October 14, 2022, from doi: 10.1787/52570002-en

R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/

R Graph Gallery. (2018). Time series. URL Https://r-Graph-Gallery.com/Time-Series.html.

Spinu, V., Grolemund, G., & Wickham, H. (2021). Lubridate: Make dealing with dates a little easier. Retrieved from https://lubridate.tidyverse.org

Trapletti, A., & Hornik, K. (2022). Tseries: Time series analysis and computational finance. Retrieved from https://CRAN.R-project.org/package=tseries

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2024). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org

Wickham, H., Hester, J., & Bryan, J. (2022). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org

Wilke, C. O., & Wiernik, B. M. (2022). Ggtext: Improved text rendering support for ggplot2. Retrieved from https://wilkelab.org/ggtext/

Zhu, H. (2021). kableExtra: Construct complex table with kable and pipe syntax. Retrieved from http://haozhu233.github.io/kableExtra/

Citation

BibTeX citation:

@online{2022,
  author = {, stesiam},
  title = {Forecasting {Unemployment} in {Greece}},
  date = {2022-10-22},
  url = {https://www.stesiam.com/english-posts/2022-10-22-Forecasting-Greek-Unemployment/},
  langid = {en}
}

For attribution, please cite this work as:

stesiam. (2022, October 22). Forecasting Unemployment in Greece. Retrieved from https://www.stesiam.com/english-posts/2022-10-22-Forecasting-Greek-Unemployment/

Table of Contents

Introduction

Background

Time series

Procedure

Short Answer

Prerequisites

Import Libraries

Import dataset

Preview Dataset

Dataset Structure

Time Series Preprocessing

Missing Values

Descriptive Statistics

Examine stationarity

Definition of stationarity

Examine Stationarity Graphically

Examine Stationarity with Statistical tests

Summary

ADF test

PP test

KPSS test

Identify Model

Build Time Series Model

Compare Models

Checking best models

Forecast Future Unemployment

ARIMA (0,2,1) forecasts

ARIMA (9,2,1) forecasts

Results

Acknowledgements

References

Citation