Sunday, December 31, 2017

A tutorial on Measure of Central Tendency(Mean,Median,Mode).

As a beginner in the field of  data science , we should clear our foundation and require some basic knowledge of Statistics.

What is the Measure of Central Tendency ?
The main idea behind computing  central tendency is to find a common a value for a given set of variable.

The three common measures of central Tendency are the Arithmetic Mean , Median and Mode.

Mean:-

The arithmetic mean, or simply the mean, is more commonly known as the
average of a set of values.It is calculated by adding up all the values and dividing by the number of all the values.

The mean of a population is denoted by the Greek letter mu (σ) while the mean of a sample
is typically denoted by a bar over the variable symbol x and pronounced as x -bar.

Now ,Let us have an example :-

Suppose we have the batting score of Virat Kohli in the last 5 matches .

23,134,78,03,176

and we need to calculate the  mean of his scores.For this , we need to add the all the scores in the 5 matches and divide it by 5.


x-bar = (23 + 134 + 78 + 03 + 176)/5 = 414/5 = 82.8

Mean is considered as the easy measure of central tendency.However,mean is not the best measure for every data set.Such problems occurs when we come across some outliers.

One way to lessen the influence of outliers is by calculating a trimmed mean.As the name implies, a trimmed mean is calculated by trimming or discarding a certain percentage of the extreme values in a distribution, and calculating the mean of the remaining values.


Median:- 

The median of a data set is the middle value when the values are ranked in ascending or descending order. there are n values, the median is formally defined as the (n+1)/2th value.If n = 9, the middle value is the (9+1)/2th or fifth value.If there is an even number of values, the median is the average of the two middle values.This is formally defined as the average of the (n/2)th and ((n/2)+1)th value.
The median is a better measure of central tendency than the mean for data that is asymmetrical or contains outliers.This is because the median is based on the ranks of data points rather than their actual values: 50 percent of the data values in a distribution lie below the median, and 50 percent above the median, without regard to the actual values in question.Therefore it does not matter if the data set contains some extremely large or small values, because they will not affect the median more than less extreme values.

Mode:-

It refers to the most frequently occurring data in a given data set.It is most useful in describing a categorical data.

Example :- 2,2,3,3,4,4,4,4,4,4,5,5,5

So,the Median of the above data set is 4.

Please provide your suggestion so that i will improve my Tutorials.

Thanks :)

2 comments:

  1. When do we use different measure of central tendency?

    Can you please explain scenarios as where to use mean or median or mode?

    ReplyDelete
    Replies
    1. Thanks Harsh vardhan , i will explain these two topics in my upcoming articles.

      Delete

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...