I am trying to use dateparser to parse dates with years earlier that 1000, with less than four digits.
import dateparser
value = "july 900"
result = dateparser.parse(value)
result is None # True
At first I thought is related to the problem mentioned here: Use datetime.strftime() on years before 1900? ("require year >= 1900"), because some of the times with certain inputs (like just 900) the result was the current day and month combined with the year 1900.
But after some more trials with random dates and relative expressions, I noticed dateparser can output dates earlier than 1000, then I figured out that if I zero-pad the year, the result will be correct.
import dateparser
value = "july 0900"
result = dateparser.parse(value)
result is None # False
result # datetime.datetime(900, 7, 4, 0, 0)
I have found this in my search for a solution:
https://github.com/scrapinghub/dateparser/issues/410
but the final comment left me with more questions than answers, as I have failed to find a way to pass a custom parser to the internal user of dateutil.parser of dateparser.
My current solution is to look for regex 3 digits year patterns, using something similar to this: (.* +| *|.+[\/\-.]{1,})([1-9][0-9]{2,})( *| +.*|[\/\-.]{1,}.+) and pad them in place.
Is there a better way to do this?
EDIT:
Is there also an elegant solution to parse dates before our era (e.g. BC)? (it seems that the dateparser settings key SUPPORT_BEFORE_COMMON_ERA doesn't do much in this regard, and all other seemed to be unrelated)
So that this can be used for an archeological dating site.