Data is Future: Variance,Standard Deviation,Coefficient of Variation and Outliers

Saturday, April 28, 2018

Variance,Standard Deviation,Coefficient of Variation and Outliers

Variance and Standard deviation are one of the most common measures of dispersion of data.Both the method can be used to define how much the data vary from the their average value or imean.
Variance is the average of the squared deviation from the mean while the standard deviation is the square root of the variance.

Calculation of Variance and Standard Deviation:-

1.) Take the mean of all the data that we have .Suppose that we have a small dataset with the values 1,2,3,4,5.
Then the mean of this dataset is 1+2+3+4+5/5 =3.
2.)Subtract each dataset with the mean and then square it.But why we will square it ?
we will find out this in the example.

By Applying the above formula ,what we get is 0.

To Overcome this problem , we work with squared deviation and divide their sum by n.
Hence , the formula for variance is :-

and the calculation for our dataset will be :-

However the above calculation is for the population not for the sample.
The Formula for Sample will be :-

And the calculation for our dataset will be like :-

3) Take the square root of the variance .

So ,the calculation of our above dataset will be :-

The variance for the sample dataset will be :-

From , the above calculation of the variance and standard deviation we can conclude the below points.

1.)The variance can be zero or grater than zero.It will be zero only in the case when all the values in the dataset are same.
2.)The variance of sample will be always greater than the population.
3.)For variance , we are calculating in the squared units but all other measurement are in the normal unit .To overcome it , we take the square root of the variance which is called standard deviation.

What does standard deviation signifies ?

1.) The higher the value of the standard deviation , more the variability in the data.
2.)The low standard deviation means that the data is closely related.

Coefficient of Variation :-

The variance and standard deviation that are calculated on the same data set but with the different units can have different variance and standard deviation.To overcome this problem the concept of coefficient of variation came into the picture.
Example :- weights calculated in ounces and pounds can have different variance and standard deviation for the same sample.

where ,
s = standard deviation
x-bar = mean of the sample.
For our data set , the coefficient of variation

Python Implementation of the above data set with numpy library:-

import numpy as np
a=[1,2,3,4,5]
print('The variance will be',np.var(a))
print('The variance will be',np.std(a))import numpy as np
a=[1,2,3,4,5]
print(np.std(a))
#output :-
The variance will be 2.0
The variance will be 1.41421356237

Note :- The calculation of standard deviation in single precision may be inaccurate.

Outliers:- The data point which is different from the normal sample dataset is called outliers.It is normally considered as the data point that come from different population or samples.

Why outliers detection is important :-

1.) It will distort the calculation of the normal statistics like mean.
2.)There can be chances that the data point is erroneous.
3.)The data point may be coming from the different samples or population.

How to handle the outliers ?

1.)Trimmed Mean.
2.)Interquartile Range
3.)Deletion of the outliers (subject to the statistician).

Further Reading:-https://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html

Data is Future

Saturday, April 28, 2018

Variance,Standard Deviation,Coefficient of Variation and Outliers

No comments:

Post a Comment

Delta Lake - Time Travel