groupby with agg function that refers to other groups (in pandas) [on hold]
Consider this dataset, that has the product purchases ('upc') of
households in different quarters
df_test = pd.DataFrame( {
'household_code': [1,1,1,1,2,2,2,2,3,3,3,3],
'quarter': [120,120,121,121,122,122,120,120,120,121,120,120,121,121],
'upc': ['a','b','b','c','b', 'c','a','b','d','c','a','b','b','a']
} );
df_test
household_code quarter upc
0 1 120 a
1 1 120 b
2 1 121 b
3 1 121 c
4 2 120 a
5 2 120 b
6 2 120 d
7 2 121 c
8 3 120 a
9 3 120 b
10 3 121 b
11 3 121 a
I want to get the % of intersection of products from one quarter to
another by household. For example, household 1 buys (a, b) in quarter 120
and (b, c) in quarter 121. The intersection is just b so the intersection
in quarter 121 for that household will be 0.5. (The result for quarter 120
is not defined, so I'll just put a zero or a NaN there) To be explicit,
the result should be:
household_code quarter perc_int
0 1 120 0.0
1 1 121 0.5
2 2 120 0.0
3 2 121 0.0
4 3 120 0.0
5 3 121 1.0
(Note that the real dataset has many more quarters).
The following piece of code does the trick, but is rather slow and very
ugly. I hope somebody will show me a more pythonic/pandotic way of doing
stuff.
# Function to select upcs from a given quarter
def set_upcs_in_q(df, quarter):
return set(df[df['quarter'] == quarter].upc )
def count_intersection(df = df_test):
'''
Fixing a household_code, goes through quarters and counts
the intersections
'''
quarter_list = np.array(list(set(df['quarter'] )))
this_period_set = set_upcs_in_q(df = df, quarter = quarter_list[0])
count_inters = np.zeros( (quarter_list.shape[0], ) )
for index in xrange(quarter_list.shape[0] - 1):
next_period_set = set_upcs_in_q(df = df, quarter =
quarter_list[index + 1])
intersection = this_period_set & next_period_set
count_inters[index + 1] = len(intersection)/len(this_period_set)
this_period_set = next_period_set
return count_inters
def iterate_over_households(df = df_test):
'''
Iterates through all household values and returns a new,
collapsed dataset
'''
# Create collapsed dataset
df_collapsed = df.groupby(['household_code', 'quarter']).first()
df_collapsed.reset_index(inplace = True)
del df_collapsed['upc']
df_collapsed['perc_int'] = 0.
#Get list of households
household_list = np.array(list(set(df['household_code'])))
for house in household_list:
count_list = count_intersection(df = df[df.household_code == house])
df_collapsed['perc_int'][df_collapsed.household_code == house] =
count_list
return df_collapsed
expected_result = iterate_over_households(df = df_test)
As always, thanks for your time!
EDIT: My intuition is that this can be cast by a suitable combination of
groupby and aggregate or transform with the function above. For example:
df_test.groupby('household_code').agg(count_intersection)
gives a starting point....
No comments:
Post a Comment