2

I need to extract a year corresponding to a time period and put it into a new column in a data frame. The tricky part is that the years I need to extract are not calendar years. My 'standardized' years start the 1st of July of a given calendar year and end up the 30th of June of the subsequent calendar year. So, if an event occurred anytime within this time period, the standardized year is the first calendar year in this time period involving two calendar years. For example, if events occurred the 1st of July 2019, the 25th of December 2019, and the 30th of June 2020, the 'standardized' year for all these three events is 2019 (i.e. the first year within the time period between 1st of Jul 2019 and 30th June 2020). How can I extract such standardized years in R and assigned it to a new column in the data frame?

My data file is very large, but as a simplified example, here are some events happening in specific dates

dat <- as.Date(c("2-Feb-18", "24-May-10", "30-Dec-19","1-Jul-20"),"%d-%b-%y")
dat <- as.data.frame(dat)
names(dat)[1] <- "events"
dat
      events
1 2018-02-02
2 2010-05-24
3 2019-12-30
4 2020-07-01

In this case the column I want to create 'standardized_year' should look like this

      events standardized_year
1 2018-02-02              2017
2 2010-05-24              2009
3 2019-12-30              2019
4 2020-07-01              2020

In the first row, the standardized year is 2017 because 2-Feb-18 is within 1st of Jul 2017 and 30th Jun 2018, so the first year is extracted within the 'standardized' year. The same criteria for all other values.

Is there a way to do this automatically in R for a large amount of events in a dataframe?

Any help would be much appreciated. Thanks

2 Answers2

2

You could extract the actual year from first four substrings, then compare if it is smaller than all dates of the standardized year starting in the actual year. The standardized year we get from sequences with ISOdate.

standardized_year <- sapply(dat$events, function(x) {
  x <- as.POSIXct(x)
  y <- as.numeric(substr(x, 1, 4))
  ifelse(all(x < seq(ISOdate(y, 7, 1, 0), ISOdate(y + 1, 6, 30, 0), "day")), y - 1, y)
})
dat <- cbind(dat, standardized_year)
dat
#       events standardized_year
# 1 2018-02-02              2017
# 2 2010-05-24              2009
# 3 2019-12-30              2019
# 4 2020-07-01              2020
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

I guess you can do something likes this:

library(dplyr)
library(lubridate)

events <- events %>% 
     mutate(events = dmy(events), 
           standardized_year = ifelse(month(events)>=7, year(events), year(events)-1))
     
José
  • 921
  • 14
  • 21