How to Think about Data, Data Science, and Time Series

What is Data, Data Science, and Time Series?

Published in

Towards Data Science

14 min readJul 12, 2021

Data is measurements of the world. We can capture data in may ways; we can capture data about different people’s characteristics at one point in time, collecting a sample from a population — cross sectional data — and find for example the distribution of heights i.e. what is the average height of the individuals in the sample collected. We can also collect such samples at multiple points in time and maybe see how the average height changes through time — this is panel data. Finally, we can focus on 1 individual instead of a population and collect their height measurements daily for many years and track their height’s evolution. This last data will be a time series type of data sample. The ideas developed to understand time series samples can be translated from a person’s height to the evolution of changes in various measurements of a vessel in the maritime sector. But what do we mean by “understanding” the time series? It means creating a function f(x) that captures how the data points change in time; in the presence of a certain pattern, that function would capture that pattern and in that way we would be able to describe/summarize/abstract/generalize the time series process. For example if we are measuring a child’s height every year for the first 10 years of their life and we find that it starts at 70 cm and every year it grows by 7cm to reach 140 at 10 years old, then that function would look like:

f(x) = b*x(t-1) + 7			(1)

f(x) is the next year’s observation and x is the current year’s observation; therefore the transition from one year to the next is defined by the function above (adding 7cm to the previous value). In other words, the function captures the data generating process and it is officially called a model. In the realm of data science, we use data to tell us what the models look like. This means that we do not know in advance that the evolution of a child’s height is that they gain 7 cm every year. We “train” an algorithm that finds this pattern and emits the model that informs us about this pattern. So, this model is useful for informative / research reasons, but it is also important for predictive reasons. If you know that this is the way that the child’s height evolves, one can use today’s height, add 7 and get a prediction for next year’s height! As you can see in the case of time series the observations of this one individual are correlated. If we were gathering cross sectional data ( the height of many different individuals at 1 point in time) there would be no reason to assume that their heights are correlated in any way — assuming a randomly drawn sample from the population. If we were to pick a new individual from the population, knowing about the last selected individual’s height would not be important at all. This is in contrast with the time series, where if we sample the same individual next year, knowing about the last observation (their height this year) does indeed inform us about their probable height.

Autoregressive Model

Moving on into slightly more technical material, one can figure out quickly that not all time series are the same. The height of the individual can be seen as roughly linear growing on a yearly basis, but the outside temperature in the port of Piraeus is clearly not linear! We do not see every year’s temperature being 7 degrees higher than last year’s! Here we see a seasonal pattern that repeats yearly. The data science algorithm/function that analyzes data from these temperatures is going to look a lot different than our first function. One could say that it will look like this assuming that we have monthly observations:

f(x) = b*x(t-12) (2)

This means that the observation this month is the same as 12 months ago, which makes sense; July’s temperature can be best predicted by seeing last July’s temperature versus seeing April’s temperature. Technically, this is called an autoregressive model. The informed reader will probably know what a regression is; it is this function/model we have been talking about only for the case of cross sectional data, where we find how a person’s height changes depending on the parent’s height for example. But here since we are within the time series space, height is regressed on itself at a previous point in time; hence auto-regression. As in the case of normal regression, we may have a dependent variable that we try to predict and many independent variables/features and some of them will be able to explain the variation of the dependent variable. The variables that are indeed significant are kept in the final model/function, since the data told us that they are useful predictors. In the case of time series autoregressive models, the candidate variables to be included in the autoregression are all the previous lags of observations of the person’s height and the model tells us which past periods/lags are significant to explain the current observation. In the example of temperature the model told us that the significant autoregressive variable is 12 lags no matter what timestamp we are observing now; the current value always depends on 12 periods before it. The methodology of finding the significant auto-regressor is quite similar to the cross sectional case, where we may start by exploring the correlations about different variables, keep some and then run regressions on the most correlated ones to come up with the final feature selection of a model that best represents the data. In the case of time series, by using the Box-Jenkins methodology we run autocorrelation and partial autocorrelation plots to see the correlation of a time series’ observation to the observations at various lags into the past.

In the partial autocorrelation graph above we see that this time series has the characteristic that the current value depends on the value 1 to 9 lags before (above the blue threshold). Therefore, if we want to autoregress this variable (calling it height just for continuity) we would select 9 features i.e. 9 lags of height to best represent the data in the final model. A more experienced modeler would know that a parsimonious/simple model is always preferable, so they would select only the 1st lag as a feature, as it is highly correlated and carries far more information about the next value than the other lags. So they would only choose 1 feature in a model that would look like this:

f(x) = b*x(t-1)				(3)

The autoregression would give us the value of b i.e. the coefficient that tells us that controlling for all else, the last lags value contributes that much to the current value. Now at this point it is interesting to explore this b value as it hides a whole world of peculiarities behind it. If this b value is below 1 then it means that the process is stationary! This means that it reverts to a constant mean and does not explode into a random direction. In the first model (1) that model is not stationary (it is non-stationary — unit root) since b is 1; if the person’s height is equal to last year’s height plus 7, last year’s height should be multiplied by b=1. If we had a mean reverting process, b would be less, for example 0.3, suggesting that if in 1 period we see a high value, next period we should expect to see not such a high value. In general stationary processes are easier to predict since they have a constant distribution throughout and revert to a constant mean.

This is an example of a stationary process with constant mean around 12. If a value of 13 comes, be sure that one of the following values will be back to or below 12.

Moving Average Model

We already say that 1 way to model this process is with the use of autoregression, where we find which lag is significant in explaining the value of the current observation, but there are other ways to model it as well. The second way is the Moving Average process. Here, every time period’s prediction is compared to the actual value that came up, defining the “error” value. Now, we can regress the current observation not in the lagged observation (such as in autoregression-AR), but in the lagged error instead! This tells us intuitively that if an “error”/shock occurs in one period until how many periods into the future will its impact persist; which is the same as saying the current value is explained by the error of that many lags ago. Again we follow the same regression methodology where out of all the lagged errors we find which ones are significant and include them into the model. For example, when modeling the price of oil, if there is a coronavirus induced drop in the price of oil (like the one that happened in March 2020) how long does it take for the price to get back to normal? That is the number of MA lags. So now that we know the AR model (autoregressive) and MA model (moving average), if we combine them we get the ARIMA model! This is a model where the next value of a stationary time series depends on the values certain lags behind as well as the errors certain lags behind.

ARCH / GARCH

Moving on to the other ways to model the time series, there is also the ARCH and GARCH models. This stands for Autoregressive Conditional Heteroskedasticity, and is about modeling the volatility/variance/vertical spread of the time series. So now instead of regressing the current value to a lagged value as in AR, we regress the current variance to the variance of an observation certain lags ago. The goal is to detect how long a surge in volatility persists, or how exactly it repeats itself. This is useful, because if we find that when the engine’s temperature variance increases then it increases even more for many more periods in the future, we will worry more than if we knew that 1 spike in variance is generally followed by a shrinking and a strong return to the constant mean.

σ = b*σ(t-1)				(4)

This is how these models look: standard deviation or variance of a period regressed on previous period’s s.d or variances.

Regimes

Moving on, one has to understand that the time series’ structure or underlying model can change over time. For example, in the example of the children’s height, the evolution from 0 to 18 years old is different from 18–50 years old. In other words there are 2 different “regimes” in the time series of heights that should be described by 2 different models. The first one has this 7cm yearly increase and the second one is constant; each year’s value is the same as the previous year. In other time series we may find more regimes, and regimes that are recurring in the future:

Here in the time series of the speed of the ship, we see clearly 2 different regimes, the 1 with mean 12 knots when the ship is moving and the other with mean 0 knots when the ship is at port standing still. Clearly the experienced modeler needs 2 different models for each state (traveling/not traveling). The even more experienced modeler could have a model for how long each state/regime lasts so that their predictions in the future oscillate between the 2 regimes in a timely manner. The well known model for this type of series is the Markov Switching Autoregression where the model switches from 1 autoregressive model to the other following a Markov process; i.e. a memoryless process where a transition probability matrix defines the probability that given the current state/regime the probability of the next state being Travel is P(travel) and the probability of non-travel is P(non-travel). So it is a 2x2 transition matrix capturing how states switch depending on what state we are in. The problem is that having a probability of 0.9 of staying in the same state/regime will always be 0.9 no matter how many periods you have already stayed in that regime. Therefore, a model of the number of hours staying in 1 regime is more useful.

Structural Models / Vector Autoregression

Until now, we have examined cases where we get the prediction of the next value of the time series, only using the time series itself. Of course the values of one series do not only depend on the past but on different series as well. For example the engine’s temperature not only can be predicted by the temperature in the previous period, but also from the speed of the vessel in the current period, the speed of the vessel in the last period, the RPM in the last period, the wind speed in the current period or the last period. This opens up a whole new world of possible correlations under the umbrella of multivariate autoregressions or more advanced models like Vector Autoregressions (VAR). In the multivariate time series case the model would look like this:

f(x,y) = b*x(t-1) + c*y(t-1)			(5)

This shows that a variable at one point is regressed on itself 1 lag before and on another time serie’s lagged value. Within the world of multivariate time series, both of these variables are considered endogenous i.e. variable y is not exogenous as in the normal regression case. That means that to model them we need 2 equations, 1 to show how x affects y and 1 to show how y affects x. These models are generally called structural models or simultaneous equations models with time.

X(t) = b*X(t-1) + c*Y(t-1)			(6)
Y(t) = b1*Y(t-1) + c1*X(t-1)			(7)

So, here we have 2 variables and 2 equations. The intuition is that if we were to model and predict the temperature of the engine on a vessel in a multivariate setting one could use as an explanatory variable the vessel’s speed; we would have the next temperature dependent on current speed, because the faster the vessel goes the higher the RPM which leads to increased temperature. On the other hand the 2 variables may be related in an opposite way which needs to be captured in the second equation of the system: the higher the temperature the more likely the captain of the vessel is to reduce speed so that the engine does not overheat.

In order to solve this system of equations, there are broadly 2 approaches: 1 that is based on domain expertise to “identify” the system and find the coefficients, and another atheoretical one, the Vector Autoregression approach. In the first approach, the traditional “Cowles Commission” approach for identifying Simultaneous Equations/Structural models, in order to identify the 2 different relationships between the 2 variables (2 equations) we need the help of 2 exogenous variables. If X in equation (6) is the vessel’s temperature and Y the vessel’s speed we could enhance the system to look like this:

X(t) = b*X(t-1) + c*Y(t-1) + d*F(t-1)		(8)
Y(t) = b1*Y(t-1) + c1*X(t-1) + d1*G(t-1)	(9)

What we have added here is an exogenous factor in each equation that will help us solve these equations and find the coefficients b,b1,c,c1, d and d1 (we are searching for 6 unknowns in total).

To better understand what is going on here let’s introduce a supply and demand concept where we are modeling the demand for the ship’s engine and the supply of the ship’s engine. The supply curve is upward sloping and the 2 axes are temperature and speed. So the supplier of motion (the engine) asks for more temperature as the speed increases — upward sloping supply curve. On the other hand the consumer of motion (the captain) asks for less speed as the temperature increases — downward sloping demand curve. Equation 8 is the supply curve, where an increase in speed (Y) in one period leads to an increase in temperature (X) in the next period, and equation 9 is the demand curve where an increase in the temperature in the current period leads to drop in speed (due to the captain avoiding overheating).

Imagine we have a lot of data of speed and temperature and that we know from domain experts about the 2 opposing relationships between the 2 variables, how do we disentangle the data points that were due to one relationship versus the ones that arose due to the other relationship? In other words, how do we find the shape of the demand and supply curves and how do we find the coefficients in equations 8 and 9? With the use of the mentioned exogenous factors: to find the supply curve we introduce factor G on the demand equation (9) which shifts the demand curve upwards exposing the slope of the supply curve, and introducing factor F on the supply equation which shifts the supply curve outwards and exposes the slope of the demand curve (i.e. the coefficients of the demand equation). An example of F could be sudden zero wind conditions that shift the supply curve outwards, which makes sense because now for the same engine temperature/engine power you get more speed than in rough weather conditions. An example of G would be an adoption of a less conservative attitude from the captain allowing for higher speeds even if temperatures are high.

In the second approach to solve the structural model using Vector Autoregression, we do not need to use any exogenous variables or restrictions to identify the system, we just accept all variables as endogenous, regress on their lags and proceed to use impulse response functions, Granger causality and forecasting to get accurate predictions of the variables of interest. This Vector Autoregression if run on benchmark data can give us the coefficients of how much each variable and its lags contribute to each other time series under normal conditions, and then if run on a period of “anomaly” it can highlight that the coefficients have changed therefore indicating that a variable has changed and does not affect the dependent variable the same anymore. This is a robust way of detecting if a series is experiencing an anomaly.

Vector Error Correction Model

As an extension of the VAR model, one can leverage the Concept of Cointegration and build an Error Correction or Vector Error Correction Model (VECM). Cointegration occurs when 2 or more non-stationary variables, when regressed on one another, have a stationary residual time series. Two time series that are trend stationary, which means that they have a trend, if the “difference” between them is stationary i.e. with constant mean, then the 2 series are cointegrated. This means that in the long run the 2 series are correlated; the cointegrating relationship captures the long run relationship. The VECM combines the short run deviations together with the cointegrating relationship in 1 model. So one can imagine that we could have the long run (benchmark) relationship of the variables in the cointegrating part of the VECM, and the short term deviations (anomalies) captured in the remaining part of the model. That could be a novel way to use VECM.

Principal Component Analysis

Additionally, we could use Principal Component Analysis or Dynamic Factor models, to find which time series co-move and can be represented by only 1 latent variable instead. The factor loadings i.e. the selection of the variables to be included under every latent factor would therefore propose which variables are “correlated” under normal conditions.

Correlations

Otherwise a simpler approach is to find correlations among the time series (ignoring time as an explanatory dimension).