Unfortunately, there is no weighted built-in functions in Python. If we want to get some weighted percentiles by Python, one possible method is to extend the list of data, letting the values of weight as the numbers of elements, which is discussed in a Stack Overflow poster. For example, if we have a data like,
score weight
5 2
4 3
2 4
8 1
we firstly extend the list of scores to {5, 5, 4, 4, 4, 2, 2, 2, 2, 8}, and then find the percentiles such as 10% or 50% percentile.
The limitations of this method are, (1) weight must be integers; (2) values of weight cannot be very large. What if we want to calculate the weighted percentiles of a large dataset with very large non-integer weights? In this article, I want to show you an alternative method, under Python pandas.
step1: given percentile q, (0<=q<=1), calculate p = q * sum of weights;
step2: sort the data according the column we want to calculate the weighted percentile thereof;
step3: sum up the values of weight from the first row of the sorted data to the next, until the sum is greater than p, then we have the weighted percentile.
I can define a function for weighted percentile in Python, where the input x is a two-column DataFrame with weights in the second column, and q is the percentile.
# define a function for weighted quantiles. input: x, q
# x: two-column data, the second column is weight. q: percentile
def wquantile(x,q):
xsort = x.sort_values(x.columns[0])
xsort['index'] = range(len(x))
p = q * x[x.columns[1]].sum()
pop = float(xsort[xsort.columns[1]][xsort['index']==0])
i = 0
while pop < p:
pop = pop + float(xsort[xsort.columns[1]][xsort['index']==i+1])
i = i + 1
return xsort[xsort.columns[0]][xsort['index']==i]
Let we take an example to test this function. I want to calculate the percentiles of household earning, income and wealth, using the data of 2007 Survey of Consumer Finances. The variables that I need are listed in the table below:
The data of household income and wealth is directly accessible, but the household earning should be calculated by adding the wage of labor and a proportion of business income. The data of household earning, income, and wealth are defined as follows:
I firstly import the data by pandas. We can find that there are 322 variables, 22085 observations and 4417 households. As the data is weighted, the number of households are different from number of observations.
Then I define the weighted percentile function and use it to calculate the weighted median (50% percentile) of household earning, income, wealth, and non-housing wealth.
Javier Díaz-Giménez, Andy Glover, & José-Víctor Ríos-Rull, "Facts on the Distributions of Earnings, Income, and Wealth in the United States: 2007 Update" ever studied the distribution of household earnings, income, and wealth, and had more detailed statistics and interesting findings.
score weight
5 2
4 3
2 4
8 1
we firstly extend the list of scores to {5, 5, 4, 4, 4, 2, 2, 2, 2, 8}, and then find the percentiles such as 10% or 50% percentile.
The limitations of this method are, (1) weight must be integers; (2) values of weight cannot be very large. What if we want to calculate the weighted percentiles of a large dataset with very large non-integer weights? In this article, I want to show you an alternative method, under Python pandas.
step1: given percentile q, (0<=q<=1), calculate p = q * sum of weights;
step2: sort the data according the column we want to calculate the weighted percentile thereof;
step3: sum up the values of weight from the first row of the sorted data to the next, until the sum is greater than p, then we have the weighted percentile.
I can define a function for weighted percentile in Python, where the input x is a two-column DataFrame with weights in the second column, and q is the percentile.
# define a function for weighted quantiles. input: x, q
# x: two-column data, the second column is weight. q: percentile
def wquantile(x,q):
xsort = x.sort_values(x.columns[0])
xsort['index'] = range(len(x))
p = q * x[x.columns[1]].sum()
pop = float(xsort[xsort.columns[1]][xsort['index']==0])
i = 0
while pop < p:
pop = pop + float(xsort[xsort.columns[1]][xsort['index']==i+1])
i = i + 1
return xsort[xsort.columns[0]][xsort['index']==i]
Let we take an example to test this function. I want to calculate the percentiles of household earning, income and wealth, using the data of 2007 Survey of Consumer Finances. The variables that I need are listed in the table below:
The data of household income and wealth is directly accessible, but the household earning should be calculated by adding the wage of labor and a proportion of business income. The data of household earning, income, and wealth are defined as follows:
I firstly import the data by pandas. We can find that there are 322 variables, 22085 observations and 4417 households. As the data is weighted, the number of households are different from number of observations.
Then I define the weighted percentile function and use it to calculate the weighted median (50% percentile) of household earning, income, wealth, and non-housing wealth.
Javier Díaz-Giménez, Andy Glover, & José-Víctor Ríos-Rull, "Facts on the Distributions of Earnings, Income, and Wealth in the United States: 2007 Update" ever studied the distribution of household earnings, income, and wealth, and had more detailed statistics and interesting findings.
Awesome post with lots of data and I have bookmarked this page for my reference. Share more ideas frequently.
ReplyDeleteccna course in Chennai
Python Training in Chennai
R Programming Training in Chennai
AWS Training in Chennai
DevOps Training in Chennai
Angularjs Training in Chennai
RPA Training in Chennai
Blue Prism Training in Chennai
Data Science Course in Chennai
Nice article
ReplyDeletePython Course in Nagpur