Friday, January 4, 2019

Spark - Saving Dataframes


In my last blog, we have gone through the dataframes and some of their operations.we should note that we need to save the dataframes for further operations.

In this blog post , we will understand how to save the dataframes (Generic/Manual) form.


Generic Load/Save function

The default data format that is used in loading and saving the data source is paraquet.we can save the dataframe in parquet format by giving the dataframe name. Let us check the code for the same.

#saving the dataframes in the default location
read_file.select("name","age").write.save("dataframe_save.parquet",format=”parquet”)


We can change the default settings and can save the dataframes in other format like csv.

Let us start with our previous code that we have written for spark dataframes2. The code is available in my github repository .

Code Snippet :

from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
conf=SparkConf().setAppName("dataframe")
sc=SparkContext(conf=conf)
sqlcontext= SQLContext(sc)
read_file=sqlcontext.read.csv('/home/hduser/sangam/test.csv',header='true')
read_file.show()

print("The number of rows in the file are ",read_file.count())
read_file.head(2)


#Below command describe the no of columns in the dataframe and the respective columns.
print("no of columns and name of the columns",len(read_file.columns),read_file.columns)


#provides the complete statistics of the numerical columns available in the dataframe
read_file.describe().show()


#Provides the statistics of a particular column
read_file.describe('salary').show()


#Select specific column from the dataframe
read_file.select('salary','age').show()

#saving the dataframes in the default location
read_file.select("name","age").write.save("dataframe_save.csv",format="csv")


After submitting the code , we can get the output in our default location in a directory called dataframe_save.csv

once we enter the directory , we will get the file “part-00000-e90b751f-b7b9-4093-93b0-b014ef2012a8.csv”


All the related code is available in my github repository :-https://github.com/sangam92/Spark_tutorials

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...