I'm converting pyspark data frames to pandas data frames using toPandas(). However, because some data types don't line up, pandas casts certain columns in the data frame, such as decimal fields, to object.
I'd like to run .str on my columns with actual strings, but can't see to get it to work (without explicitly finding which columns to convert first).
I run into:
AttributeError: Can only use .str accessor with string values!
I've tried df.fillna(0) and df.infer_objects(), to no avail. I can't see to get the objects to register as int64 or float64, so I can't do:
for col in df.columns:
if df[col].dtype == np.object:
# insert logic here
beforehand.
I also cannot use .str.contains, because even though the columns with numeric values are dtype objects, upon using .str it will error out. (For reference, what I'm trying to do is if the column in the data frame actually has string values, do a str.split().)
Any ideas?
Note: I am curious for an answer on the pandas side, without having to explicitly identify which columns actually have strings beforehand. One possible solution is to get the list of columns of strings on the pyspark side, and pass those as the columns to run .str methods on.
I also tried astype(str) but it won't work because some objects are arrays. I.e. if I wanted to split on _, and I had an array like ['Red_Apple', 'Orange'] in a column, doing astype(str).str.split on this column would return ['Red', 'Apple', 'Orange'], which doesn't make sense. I only want to split string columns, not turn arrays into strings and split them too.