I have 2 big parquet Dataframes and I want to join them on a userId.
What should I do to get high performance :
Should I modify the code that write those files in order to :
partitionByon the userId (very sparse).partitionByon the first N char of the userId (afaik, If data are already partitioned on the same key, the join will occur with no shuffle)
On the read side, is it better to use RDD or DataFrame ?