Time Series Analysis
Last updated
Last updated
by Selva Prabhakaran | Posted on February 13, 2019
Time series is a sequence of observations recorded at regular time intervals. This guide walks you through the process of analyzing the characteristics of a given time series in python.Time Series Analysis in Python – A Comprehensive Guide. Photo by Daniel Ferrandiz.
[columnize]
What is a Time Series?
How to import Time Series in Python?
What is panel data?
Visualizing a Time Series
Patterns in a Time Series
Additive and multiplicative Time Series
How to decompose a Time Series into its components?
Stationary and non-stationary Time Series
How to make a Time Series stationary?
How to test for stationarity?
What is the difference between white noise and a stationary series?
How to detrend a Time Series?
How to deseasonalize a Time Series?
How to test for seasonality of a Time Series?
How to treat missing values in a Time Series?
What is autocorrelation and partial autocorrelation functions?
How to compute partial autocorrelation function?
Lag Plots
How to estimate the forecastability of a Time Series?
Why and How to smoothen a Time Series?
How to use Granger Causality test to know if one Time Series is helpful in forecasting another?
What Next
[/columnize]
Time series is a sequence of observations recorded at regular time intervals.
Depending on the frequency of observations, a time series may typically be hourly, daily, weekly, monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as well, like, number of clicks and user visits every minute etc.
Why even analyze a time series?
Because it is the preparatory step before you develop a forecast of the series.
Besides, time series forecasting has enormous commercial significance because stuff that is important to a business like demand and sales, number of visitors to a website, stock price etc are essentially time series data.
So what does analyzing a time series involve?
Time series analysis involves understanding various aspects about the inherent nature of the series so that you are better informed to create meaningful and accurate forecasts.
So how to import time series data?
The data for a time series typically stores in .csv
files or other spreadsheet formats and contains two columns: the date and the measured value.
Let’s use the read_csv()
in pandas package to read the time series dataset (a csv file on Australian Drug Sales) as a pandas dataframe. Adding the parse_dates=['date']
argument will make the date column to be parsed as a date field.
Alternately, you can import it as a pandas Series with the date as index. You just need to specify the index_col
argument in the pd.read_csv()
to do this.
Note, in the series, the ‘value’ column is placed higher than date to imply that it is a series.
Panel data is also a time based dataset.
The difference is that, in addition to time series, it also contains one or more related variables that are measured for the same time periods.
Typically, the columns present in panel data contain explanatory variables that can be helpful in predicting the Y, provided those columns will be available at the future forecasting period.
An example of panel data is shown below.
Let’s use matplotlib to visualise the series.
Since all values are positive, you can show this on both sides of the Y axis to emphasize the growth.
Since its a monthly time series and follows a certain repetitive pattern every year, you can plot each year as a separate line in the same plot. This lets you compare the year wise patterns side-by-side.
Seasonal Plot of a Time Series
There is a steep fall in drug sales every February, rising again in March, falling again in April and so on. Clearly, the pattern repeats within a given year, every year.
However, as years progress, the drug sales increase overall. You can nicely visualize this trend and how it varies each year in a nice year-wise boxplot. Likewise, you can do a month-wise boxplot to visualize the monthly distributions.
Boxplot of Month-wise (Seasonal) and Year-wise (trend) Distribution
You can group the data at seasonal intervals and see how the values are distributed within a given year or month and how it compares over time.
The boxplots make the year-wise and month-wise distributions evident. Also, in a month-wise boxplot, the months of December and January clearly has higher drug sales, which can be attributed to the holiday discounts season.
So far, we have seen the similarities to identify the pattern. Now, how to find out any deviations from the usual pattern?
Any time series may be split into the following components: Base Level + Trend + Seasonality + Error
A trend is observed when there is an increasing or decreasing slope observed in the time series. Whereas seasonality is observed when there is a distinct repeated pattern observed between regular intervals due to seasonal factors. It could be because of the month of the year, the day of the month, weekdays or even time of the day.
However, It is not mandatory that all time series must have a trend and/or seasonality. A time series may not have a distinct trend but have a seasonality. The opposite can also be true.
So, a time series may be imagined as a combination of the trend, seasonality and the error terms.
Another aspect to consider is the cyclic behaviour. It happens when the rise and fall pattern in the series does not happen in fixed calendar-based intervals. Care should be taken to not confuse ‘cyclic’ effect with ‘seasonal’ effect.
So, How to diffentiate between a ‘cyclic’ vs ‘seasonal’ pattern?
If the patterns are not of fixed calendar based frequencies, then it is cyclic. Because, unlike the seasonality, cyclic effects are typically influenced by the business and other socio-economic factors.
Depending on the nature of the trend and seasonality, a time series can be modeled as an additive or multiplicative, wherein, each observation in the series can be expressed as either a sum or a product of the components:
Additive time series: Value = Base Level + Trend + Seasonality + Error
Multiplicative Time Series: Value = Base Level x Trend x Seasonality x Error
You can do a classical decomposition of a time series by considering the series as an additive or multiplicative combination of the base level, trend, seasonal index and the residual.
The seasonal_decompose
in statsmodels
implements this conveniently.
Decompose
Setting extrapolate_trend='freq'
takes care of any missing values in the trend and residuals at the beginning of the series.
If you look at the residuals of the additive decomposition closely, it has some pattern left over. The multiplicative decomposition, however, looks quite random which is good. So ideally, multiplicative decomposition should be preferred for this particular series.
The numerical output of the trend, seasonal and residual components are stored in the result_mul
output itself. Let’s extract them and put it in a dataframe.
If you check, the product of seas
, trend
and resid
columns should exactly equal to the actual_values
.
Stationarity is a property of a time series. A stationary series is one where the values of the series is not a function of time.
That is, the statistical properties of the series like mean, variance and autocorrelation are constant over time. Autocorrelation of the series is nothing but the correlation of the series with its previous values, more on this coming up.
A stationary time series id devoid of seasonal effects as well.
So how to identify if a series is stationary or not? Let’s plot some examples to make it clear:Stationary and Non-Stationary Time Series
So why does a stationary series matter? why am I even talking about it?
I will come to that in a bit, but understand that it is possible to make nearly any time series stationary by applying a suitable transformation. Most statistical forecasting methods are designed to work on a stationary time series. The first step in the forecasting process is typically to do some transformation to convert a non-stationary series to stationary.
You can make series stationary by:
Differencing the Series (once or more)
Take the log of the series
Take the nth root of the series
Combination of the above
The most common and convenient method to stationarize the series is by differencing the series at least once until it becomes approximately stationary.
So what is differencing?
If Y_t
is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.
If the first difference doesn’t make a series stationary, you can go for the second differencing. And so on.
For example, consider the following series: [1, 5, 2, 12, 20]
First differencing gives: [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]
Second differencing gives: [-3-4, -10-3, 8-10] = [-7, -13, -2]
Forecasting a stationary series is relatively easy and the forecasts are more reliable.
An important reason is, autoregressive forecasting models are essentially linear regression models that utilize the lag(s) of the series itself as predictors.
We know that linear regression works best if the predictors (X variables) are not correlated against each other. So, stationarizing the series solves this problem since it removes any persistent autocorrelation, thereby making the predictors(lags of the series) in the forecasting models nearly independent.
Now that we’ve established that stationarizing the series important, how do you check if a given series is stationary or not?
The stationarity of a series can be established by looking at the plot of the series like we did earlier.
Another method is to split the series into 2 or more contiguous parts and computing the summary statistics like the mean, variance and the autocorrelation. If the stats are quite different, then the series is not likely to be stationary.
Nevertheless, you need a method to quantitatively determine if a given series is stationary or not. This can be done using statistical tests called ‘Unit Root Tests’. There are multiple variations of this, where the tests check if a time series is non-stationary and possess a unit root.
There are multiple implementations of Unit Root tests like:
Augmented Dickey Fuller test (ADH Test)
Kwiatkowski-Phillips-Schmidt-Shin – KPSS test (trend stationary)
Philips Perron test (PP Test)
The most commonly used is the ADF test, where the null hypothesis is the time series possesses a unit root and is non-stationary. So, id the P-Value in ADH test is less than the significance level (0.05), you reject the null hypothesis.
The KPSS test, on the other hand, is used to test for trend stationarity. The null hypothesis and the P-Value interpretation is just the opposite of ADH test. The below code implements these two tests using statsmodels
package in python.
Like a stationary series, the white noise is also not a function of time, that is its mean and variance does not change over time. But the difference is, the white noise is completely random with a mean of 0.
In white noise there is no pattern whatsoever. If you consider the sound signals in an FM radio as a time series, the blank sound you hear between the channels is white noise.
Mathematically, a sequence of completely random numbers with mean zero is a white noise.
Detrending a time series is to remove the trend component from a time series. But how to extract the trend? There are multiple approaches.
Subtract the line of best fit from the time series. The line of best fit may be obtained from a linear regression model with the time steps as the predictor. For more complex trends, you may want to use quadratic terms (x^2) in the model.
Subtract the trend component obtained from time series decomposition we saw earlier.
Subtract the mean
Apply a filter like Baxter-King filter(statsmodels.tsa.filters.bkfilter) or the Hodrick-Prescott Filter (statsmodels.tsa.filters.hpfilter) to remove the moving average trend lines or the cyclical components.
Let’s implement the first two methods.
There are multiple approaches to deseasonalize a time series as well. Below are a few:
If dividing by the seasonal index does not work well, try taking a log of the series and then do the deseasonalizing. You can later restore to the original scale by taking an exponential.
The common way is to plot the series and check for repeatable patterns in fixed time intervals. So, the types of seasonality is determined by the clock or the calendar:
Hour of day
Day of month
Weekly
Monthly
Yearly
However, if you want a more definitive inspection of the seasonality, use the Autocorrelation Function (ACF) plot. More on the ACF in the upcoming sections. But when there is a strong seasonal pattern, the ACF plot usually reveals definitive repeated spikes at the multiples of the seasonal window.
For example, the drug sales time series is a monthly series with patterns repeating every year. So, you can see spikes at 12th, 24th, 36th.. lines.
I must caution you that in real word datasets such strong patterns is hardly noticed and can get distorted by any noise, so you need a careful eye to capture these patterns.
Alternately, if you want a statistical test, the CHTest can determine if seasonal differencing is required to stationarize the series.
Sometimes, your time series will have missing dates/times. That means, the data was not captured or was not available for those periods. It could so happen the measurement was zero on those days, in which case, case you may fill up those periods with zero.
Secondly, when it comes to time series, you should typically NOT replace missing values with the mean of the series, especially if the series is not stationary. What you could do instead for a quick and dirty workaround is to forward-fill the previous value.
However, depending on the nature of the series, you want to try out multiple approaches before concluding. Some effective alternatives to imputation are:
Backward Fill
Linear Interpolation
Quadratic interpolation
Mean of nearest neighbors
Mean of seasonal couterparts
To measure the imputation performance, I manually introduce missing values to the time series, impute it with above approaches and then measure the mean squared error of the imputed against the actual values.
You could also consider the following approaches depending on how accurate you want the imputations to be.
If you have explanatory variables use a prediction model like the random forest or k-Nearest Neighbors to predict it.
If you have enough past observations, forecast the missing values.
If you have enough future observations, backcast the missing values
Forecast of counterparts from previous cycles.
Autocorrelation is simply the correlation of a series with its own lags. If a series is significantly autocorrelated, that means, the previous values of the series (lags) may be helpful in predicting the current value.
Partial Autocorrelation also conveys similar information but it conveys the pure correlation of a series and its lag, excluding the correlation contributions from the intermediate lags.
So how to compute partial autocorrelation?
The partial autocorrelation of lag (k) of a series is the coefficient of that lag in the autoregression equation of Y. The autoregressive equation of Y is nothing but the linear regression of Y with its own lags as predictors.
A Lag plot is a scatter plot of a time series against a lag of itself. It is normally used to check for autocorrelation. If there is any pattern existing in the series like the one you see below, the series is autocorrelated. If there is no such pattern, the series is likely to be random white noise.
In below example on Sunspots area time series, the plots get more and more scattered as the n_lag increases.
The more regular and repeatable patterns a time series has, the easier it is to forecast. The ‘Approximate Entropy’ can be used to quantify the regularity and unpredictability of fluctuations in a time series.
The higher the approximate entropy, the more difficult it is to forecast it.
Another better alternate is the ‘Sample Entropy’.
Sample Entropy is similar to approximate entropy but is more consistent in estimating the complexity even for smaller time series. For example, a random time series with fewer data points can have a lower ‘approximate entropy’ than a more ‘regular’ time series, whereas, a longer random time series will have a higher ‘approximate entropy’.
Sample Entropy handles this problem nicely. See the demonstration below.
Smoothening of a time series may be useful in:
Reducing the effect of noise in a signal get a fair approximation of the noise-filtered series.
The smoothed version of series can be used as a feature to explain the original series itself.
Visualize the underlying trend better
So how to smoothen a series? Let’s discuss the following methods:
Take a moving average
Do a LOESS smoothing (Localized Regression)
Do a LOWESS smoothing (Locally Weighted Regression)
Moving average is nothing but the average of a rolling window of defined width. But you must choose the window-width wisely, because, large window-size will over-smooth the series. For example, a window-size equal to the seasonal duration (ex: 12 for a month-wise series), will effectively nullify the seasonal effect.
LOESS, short for ‘LOcalized regrESSion’ fits multiple regressions in the local neighborhood of each point. It is implemented in the statsmodels
package, where you can control the degree of smoothing using frac
argument which specifies the percentage of data points nearby that should be considered to fit a regression model.
Granger causality test is used to determine if one time series will be useful to forecast another.
How does Granger causality test work?
It is based on the idea that if X causes Y, then the forecast of Y based on previous values of Y AND the previous values of X should outperform the forecast of Y based on previous values of Y alone.
So, understand that Granger causality should not be used to test if a lag of Y causes Y. Instead, it is generally used on exogenous (not Y lag) variables only.
It is nicely implemented in the statsmodel package.
It accepts a 2D array with 2 columns as the main argument. The values are in the first column and the predictor (X) is in the second column.
The Null hypothesis is: the series in the second column, does not Granger cause the series in the first. If the P-Values are less than a significance level (0.05) then you reject the null hypothesis and conclude that the said lag of X is indeed useful.
The second argument maxlag
says till how many lags of Y should be included in the test.
In the above case, the P-Values are Zero for all tests. So the ‘month’ indeed can be used to forecast the Air Passengers.
That’s it for now. We started from the very basics and understood various characteristics of a time series. Once the analysis is done the next step is to begin forecasting.
In the next post, I will walk you through the in-depth process of building time series forecasting models using ARIMA. See you soon.
Reference : https://www.machinelearningplus.com/time-series/time-series-analysis-python/
For Example, if Y_t
is the current series and Y_t-1
is the lag 1 of Y
, then the partial autocorrelation of lag 3 (Y_t-3
) is the coefficient $\alpha_3$ of Y_t-3
in the following equation:Autoregression Equation