I am a new user of Dask and RapidsAI.
An exerpt of my data (in csv format):
Symbol,Date,Open,High,Low,Close,Volume
AADR,17-Oct-2017 09:00,57.47,58.3844,57.3645,58.3844,2094
AADR,17-Oct-2017 10:00,57.27,57.2856,57.25,57.27,627
AADR,17-Oct-2017 11:00,56.99,56.99,56.99,56.99,100
AADR,17-Oct-2017 12:00,56.98,57.05,56.98,57.05,200
AADR,17-Oct-2017 13:00,57.14,57.16,57.14,57.16,700
AADR,17-Oct-2017 14:00,57.13,57.13,57.13,57.13,100
AADR,17-Oct-2017 15:00,57.07,57.07,57.07,57.07,200
AAMC,17-Oct-2017 09:00,87,87,87,87,100
AAU,17-Oct-2017 09:00,1.1,1.13,1.0832,1.121,67790
AAU,17-Oct-2017 10:00,1.12,1.12,1.12,1.12,100
AAU,17-Oct-2017 11:00,1.125,1.125,1.125,1.125,200
AAU,17-Oct-2017 12:00,1.1332,1.15,1.1332,1.15,27439
AAU,17-Oct-2017 13:00,1.15,1.15,1.13,1.13,8200
AAU,17-Oct-2017 14:00,1.1467,1.1467,1.14,1.1467,1750
AAU,17-Oct-2017 15:00,1.1401,1.1493,1.1401,1.1493,4100
AAU,17-Oct-2017 16:00,1.13,1.13,1.13,1.13,100
ABE,17-Oct-2017 09:00,14.64,14.64,14.64,14.64,200
ABE,17-Oct-2017 10:00,14.67,14.67,14.66,14.66,1200
ABE,17-Oct-2017 11:00,14.65,14.65,14.65,14.65,600
ABE,17-Oct-2017 15:00,14.65,14.65,14.65,14.65,836
Note Date column is of type string.
I have some example stock market timeseries data (i.e., DOHLCV) in csv files and I read them into a dask_cudf dataframe (my dask.dataframe backend is cudf and read.csv is a creation dispacther that conveniently gives me a cudf.dataframe).
import dask_cudf
import cudf
from dask import dataframe as dd
ddf = dd.read_csv('path/to/my/data/*.csv')
ddf
# output
<dask_cudf.DataFrame | 450 tasks | 450 npartitions>
# test csv data above can be retrieved using following statements
# df = pd.read_clipboard(sep=",")
# cdf = cudf.from_pandas(df)
# ddf = dask_cudf.from_cudf(cdf, npartitions=2)
I then try to convert datetime string into real datetime object (np.datetime64[ns] or anything equivalent in cudf/dask world). I then failed with error.
df["Date"] = dd.to_datetime(df["Date"], format="%d-%b-%Y %H:%M").head(5)
df.set_index("Date", inplace=True) # This failed with different error, will raise in a different SO thread.
# Following statement gives me same error.
# cudf.to_datetime(df["Date"], format="%d-%b-%Y %H:%M")
Full error log is to the end.
The error message seems to suggest that I'd need to compute the dask_cudf.dataframe, turning it into a real cudf object, then I
can do as I would in pandas:
df["Date"] = cudf.to_datetime(df.Date)
df = df.set_index(df.Date)
This apparently isn't ideal and it very much is the thing that dask is for: we'd delay this and only calculate the ultimate number we need.
what is the dask/dask_cudf way to convert a string column to datetime column in dask_cudf? As far as I can see, if the backend is pandas, the conversion is done smoothly and rarely has problem.
Or, is it that cudf or GPU world in general, is not supposed to do much with date types like datetime, string ? (e.g., ideally GPU is geared towards expensive numerical computations).
My use case involves some filtering to do with string and datetime, therefore I need to set up the dataframe with proper datetime object.
Error Log
TypeError Traceback (most recent call last)
Cell In[52], line 1
----> 1 dd.to_datetime(df["Date"], format="%d-%b-%Y %H:%M").head(2)
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/dataframe/core.py:1268, in _Frame.head(self, n, npartitions, compute)
1266 # No need to warn if we're already looking at all partitions
1267 safe = npartitions != self.npartitions
-> 1268 return self._head(n=n, npartitions=npartitions, compute=compute, safe=safe)
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/dataframe/core.py:1302, in _Frame._head(self, n, npartitions, compute, safe)
1297 result = new_dd_object(
1298 graph, name, self._meta, [self.divisions[0], self.divisions[npartitions]]
1299 )
1301 if compute:
-> 1302 result = result.compute()
1303 return result
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/base.py:314, in DaskMethodsMixin.compute(self, **kwargs)
290 def compute(self, **kwargs):
291 """Compute this dask collection
292
293 This turns a lazy Dask collection into its in-memory equivalent.
(...)
312 dask.base.compute
313 """
--> 314 (result,) = compute(self, traverse=False, **kwargs)
315 return result
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
596 keys.append(x.__dask_keys__())
597 postcomputes.append(x.__dask_postcompute__())
--> 599 results = schedule(dsk, keys, **kwargs)
600 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)
86 elif isinstance(pool, multiprocessing.pool.Pool):
87 pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
90 pool.submit,
91 pool._max_workers,
92 dsk,
93 keys,
94 cache=cache,
95 get_id=_thread_get_id,
96 pack_exception=pack_exception,
97 **kwargs,
98 )
100 # Cleanup pools associated to dead threads
101 with pools_lock:
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
509 _execute_task(task, data) # Re-execute locally
510 else:
--> 511 raise_exception(exc, tb)
512 res, worker_id = loads(res_info)
513 state["cache"][key] = res
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/local.py:319, in reraise(exc, tb)
317 if exc.__traceback__ is not tb:
318 raise exc.with_traceback(tb)
--> 319 raise exc
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
222 try:
223 task, data = loads(task_info)
--> 224 result = _execute_task(task, data)
225 id = get_id()
226 result = dumps((result, id))
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
115 func, args = arg[0], arg[1:]
116 # Note: Don't assign the subtask results to a variable. numpy detects
117 # temporaries by their reference count and can execute certain
118 # operations in-place.
--> 119 return func(*(_execute_task(a, cache) for a in args))
120 elif not ishashable(arg):
121 return arg
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/optimization.py:990, in SubgraphCallable.__call__(self, *args)
988 if not len(args) == len(self.inkeys):
989 raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990 return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/core.py:149, in get(dsk, out, cache)
147 for key in toposort(dsk):
148 task = dsk[key]
--> 149 result = _execute_task(task, cache)
150 cache[key] = result
151 result = _execute_task(out, cache)
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
115 func, args = arg[0], arg[1:]
116 # Note: Don't assign the subtask results to a variable. numpy detects
117 # temporaries by their reference count and can execute certain
118 # operations in-place.
--> 119 return func(*(_execute_task(a, cache) for a in args))
120 elif not ishashable(arg):
121 return arg
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/utils.py:72, in apply(func, args, kwargs)
41 """Apply a function given its positional and keyword arguments.
42
43 Equivalent to ``func(*args, **kwargs)``
(...)
69 >>> dsk = {'task-name': task} # adds the task to a low level Dask task graph
70 """
71 if kwargs:
---> 72 return func(*args, **kwargs)
73 else:
74 return func(*args)
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/dask/dataframe/core.py:6821, in apply_and_enforce(*args, **kwargs)
6819 func = kwargs.pop("_func")
6820 meta = kwargs.pop("_meta")
-> 6821 df = func(*args, **kwargs)
6822 if is_dataframe_like(df) or is_series_like(df) or is_index_like(df):
6823 if not len(df):
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:1100, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1098 result = _convert_and_box_cache(argc, cache_array)
1099 else:
-> 1100 result = convert_listlike(argc, format)
1101 else:
1102 result = convert_listlike(np.array([arg]), format)[0]
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:413, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
410 return idx
411 raise
--> 413 arg = ensure_object(arg)
414 require_iso8601 = False
416 if infer_datetime_format and format is None:
File pandas/_libs/algos_common_helper.pxi:33, in pandas._libs.algos.ensure_object()
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/cudf/core/frame.py:451, in Frame.__array__(self, dtype)
450 def __array__(self, dtype=None):
--> 451 raise TypeError(
452 "Implicit conversion to a host NumPy array via __array__ is not "
453 "allowed, To explicitly construct a GPU matrix, consider using "
454 ".to_cupy()\nTo explicitly construct a host matrix, consider "
455 "using .to_numpy()."
456 )
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().