2

I want to access values of a particular column from a data sets that I've read from a csv file. The datasets are stored in pyspark RDD which I want to be converted into the DataFrame. I am using the below code :

from pyspark.sql import SQLContext
sqlc=SQLContext(sc)
df=sc.textFile(r'D:\Home\train.csv')
df=sqlc.createDataFrame(df)

but it is showing error:

Can not infer schema for type: <class 'str'>

First 2 rows of df are :

['"id","product_uid","product_title","search_term","relevance"',
 '2,100001,"Simpson Strong-Tie 12-Gauge Angle","angle bracket",3']

I think the first row is creating this problem. Moreover I want to create data frame which stores the values from 2nd row to last.(Not the first row because it will be the header). How can I achieve this ? I've searched for it but could not find any solution. Thanks in advance.

ahajib
  • 1,085
  • 1
  • 9
  • 15
Ishan
  • 163
  • 1
  • 2
  • 6

1 Answers1

1

To read a csv file to spark dataframe you should use spark-csv. https://github.com/databricks/spark-csv

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

How to use spark csv If you are using pyspark directly from the terminal. Instead of calling

$SPARKHOME/bin/pyspark

You have to use

$SPARKHOME/bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0 

and then use the code above.

If you are using ipython + findspark, you'll have to modify your PYSPARK_SUBMIT_ARGS (before starting ipython)

export PYSPARK_SUBMIT_ARGS=--master local[4] --packages "com.databricks:spark-csv_2.11:1.4.0" pyspark-shell 
phi
  • 163
  • 7