Data is Future: Spark - Saving Dataframes

Friday, January 4, 2019

Spark - Saving Dataframes

In my last blog, we have gone through the dataframes and some of their operations.we should note that we need to save the dataframes for further operations.

In this blog post , we will understand how to save the dataframes (Generic/Manual) form.

Generic Load/Save function

The default data format that is used in loading and saving the data source is paraquet.we can save the dataframe in parquet format by giving the dataframe name. Let us check the code for the same.

#saving the dataframes in the default location

read_file.select("name","age").write.save("dataframe_save.parquet",format=”parquet”)

We can change the default settings and can save the dataframes in other format like csv.

Let us start with our previous code that we have written for spark dataframes2. The code is available in my github repository .

Code Snippet :

from pyspark import SparkConf,SparkContext

from pyspark.sql import SQLContext

conf=SparkConf().setAppName("dataframe")

sc=SparkContext(conf=conf)

sqlcontext= SQLContext(sc)

read_file=sqlcontext.read.csv('/home/hduser/sangam/test.csv',header='true')

read_file.show()

print("The number of rows in the file are ",read_file.count())

read_file.head(2)

#Below command describe the no of columns in the dataframe and the respective columns.

print("no of columns and name of the columns",len(read_file.columns),read_file.columns)

#provides the complete statistics of the numerical columns available in the dataframe

read_file.describe().show()

#Provides the statistics of a particular column

read_file.describe('salary').show()

#Select specific column from the dataframe

read_file.select('salary','age').show()

#saving the dataframes in the default location

read_file.select("name","age").write.save("dataframe_save.csv",format="csv")

After submitting the code , we can get the output in our default location in a directory called dataframe_save.csv

once we enter the directory , we will get the file “part-00000-e90b751f-b7b9-4093-93b0-b014ef2012a8.csv”

All the related code is available in my github repository :-https://github.com/sangam92/Spark_tutorials

Data is Future

Friday, January 4, 2019

Spark - Saving Dataframes

No comments:

Post a Comment

Delta Lake - Time Travel