2  Data Wrangling

Data wrangling involves cleaning and restructuring raw data into a more organized and usable form. The special characteristics of time series means that it is helpful to make use of a special data structure to store time series data.

A time series comprises a series of measurements along with information about when those measurements were taken (the time index). The tsibble package implements a convenient data structure, the tsibble, for storing time series data. To illustrate how they work, we create a simple example as follows.

example1 <- tsibble(
1  year = 2015:2019,
2  y = c(123, 39, 78, 52, 110),
3  index = year
)

str(example1)
# output:
# tbl_ts [5 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
# ...
1
Creates a vector of years.
2
y is the vector of values in the time series.
3
The index column refers back to the year column that was just created.

Applying str() to the created object shows that it has class tbl_ts (tsibble). It also shows that it also inherits from tbl_df (tibble) and base R data.frame. This means that all dplyr verbs (such as filter, select, and mutate) work as per normal. However, there are now 3 types of columns in a tsibble:

  1. Measurement variables. Note that there could be more than one type of measurement at each time point.
  2. The index variable: a single column denoting the time point of each measurement.
  3. Key variables: a set of columns whose unique combinations define a single time series.

We discuss the second and third types in more detail.

2.1 The index variable

The index column is what introduces the temporal component of the data. It is what associates the measurements in each row with a specific time point. The time point should be from a time class in R. The time points could correspond to yearly, monthly, weekly, daily and sub-daily intervals. Consider the examples below, noting that the annotation “[1Y]”, “[1M]”, etc. tells us the interval between observations of each time series.

aus_airpassengers
# A tsibble: 47 x 2 [1Y]
    Year Passengers
   <dbl>      <dbl>
 1  1970       7.32
 2  1971       7.33
 3  1972       7.80
 4  1973       9.38
 5  1974      10.7 
 6  1975      11.1 
 7  1976      10.9 
 8  1977      11.3 
 9  1978      12.1 
10  1979      13.0 
# ℹ 37 more rows

Total annual air passengers (in millions) including domestic and international aircraft passengers of air carriers registered in Australia.

us_employment
# A tsibble: 143,412 x 4 [1M]
# Key:       Series_ID [148]
      Month Series_ID     Title         Employed
      <mth> <chr>         <chr>            <dbl>
 1 1939 Jan CEU0500000001 Total Private    25338
 2 1939 Feb CEU0500000001 Total Private    25447
 3 1939 Mar CEU0500000001 Total Private    25833
 4 1939 Apr CEU0500000001 Total Private    25801
 5 1939 May CEU0500000001 Total Private    26113
 6 1939 Jun CEU0500000001 Total Private    26485
 7 1939 Jul CEU0500000001 Total Private    26481
 8 1939 Aug CEU0500000001 Total Private    26848
 9 1939 Sep CEU0500000001 Total Private    27468
10 1939 Oct CEU0500000001 Total Private    27830
# ℹ 143,402 more rows

US employment data from January 1939 to June 2019. Each ‘Series_ID’ represents different sectors of the economy.

aus_arrivals
# A tsibble: 508 x 3 [1Q]
# Key:       Origin [4]
   Quarter Origin Arrivals
     <qtr> <chr>     <int>
 1 1981 Q1 Japan     14763
 2 1981 Q2 Japan      9321
 3 1981 Q3 Japan     10166
 4 1981 Q4 Japan     19509
 5 1982 Q1 Japan     17117
 6 1982 Q2 Japan     10617
 7 1982 Q3 Japan     11737
 8 1982 Q4 Japan     20961
 9 1983 Q1 Japan     20671
10 1983 Q2 Japan     12235
# ℹ 498 more rows

Quarterly international arrivals to Australia from Japan, New Zealand, UK and the US. 1981Q1 - 2012Q3.

us_gasoline
# A tsibble: 1,355 x 2 [1W]
       Week Barrels
     <week>   <dbl>
 1 1991 W06    6.62
 2 1991 W07    6.43
 3 1991 W08    6.58
 4 1991 W09    7.22
 5 1991 W10    6.88
 6 1991 W11    6.95
 7 1991 W12    7.33
 8 1991 W13    6.78
 9 1991 W14    7.50
10 1991 W15    6.92
# ℹ 1,345 more rows

Million barrels per day, beginning Week 6, 1991, ending Week 3, 2017.

The easiest way to create such a column is to use one of the convenience functions from the lubridate package in R.

lubridate functions for index column
Frequency Function
Annual Use integers in R
Quarterly yearquarter()
Monthly yearmonth()
Weekly yearweek()
Daily as_date(),ymd()
Sub-daily as_datetime(), ymd_hms()

2.2 Key variables

A tsibble can store multiple time series in a single tsibble. It makes sense to do this when the timepoints overlap, and we wish to compare and contrast different time series. Consider the dataset us_employment, containing US employment data from January 1939 to June 2019.

us_employment
# Output:
# A tsibble: 143,412 x 4 [1M]
# # Key:       Series_ID [148]
#       Month Series_ID     Title         Employed
#       <mth> <chr>         <chr>            <dbl>

The printed output states that the tsibble’s key is determined by a single column: Series_ID Inspecting the metadata by running help(us_employment) tells us that each ‘Series_ID’ represents different sectors of the economy. There can of course be multiple key columns, such as in olympic_running, which records the fastest running time for each Olympic event.

olympic_running
# Output:
# # A tsibble: 312 x 4 [4Y]
# # Key:       Length, Sex [14]
#     Year Length Sex    Time
#    <int>  <int> <chr> <dbl>

The unique combinations of key columns is the number of time series in the tsibble.

Note
  • There could be multiple measurement columns and multiple key columns in a tsibble.
  • There could be columns that are neither measurement nor key columns in a tsibble.
  • There must be exactly one index column in a tsibble.

2.3 Creating tsibbles

Time series data in the wild does not automatically occur as tsibbles. They are instead stored as csv files or in other formats, and hence have to be converted to tsibble format. The function as_tsibble() converts a tibble object into a tsibble. It takes two arguments, index and key, which are used to specify the index and key columns respectively. This often has to be preceded by several dplyr function calls on the tibble to make sure it has the appropriate format. For an example on how to do this, refer to Chapter 2.1 in Hyndman and Athanasopoulos (2018).