I am working in Pandas to create a series of data frames, each of which is an aggregated version of the previous one. (I have a statistic that has to be calculated recursively.) Let's assume that I have a series of variables that I'll be aggregating, saved in a dictionary called aggVars. The original data frame, df, has nested observations:
year area ind occ aggVars
2000 0001 001 001 ...
2000 0001 001 002 ...
2000 0001 002 001 ...
2000 0001 002 002 ...
2000 0002 001 001 ...
2000 0002 001 002 ...
2000 0002 002 001 ...
2000 0002 002 002 ...
2001 0001 001 001 ...
2001 0001 001 002 ...
.
.
.
I think you get the idea. Observations are nested in occupations, which are within industries, which are within areas, which are within years.
I can do this the dumb way:
# DUMB CODE BLOCK
df_job = df.groupby(['year', 'area', 'ind', 'occ'], as_index=False)
df_job = df_job.agg(aggVars)
df_work = df_job.groupby(['year', 'area', 'ind'], as_index=False)
df_work = df_work.agg(aggVars)
df_county = df_work.groupby(['year', 'area'], as_index=False)
df_county = df_county.agg(aggVars)
df_year = df_county.groupby(['year'], as_index=False)
df_year = df_year.agg(aggVars)
(I have intentionally used different names for the data frames, i.e., df_county instead of df_area, to reflect that the .groupby() variables in the real data do not map so neatly onto the hierarchical levels.)
I have tested this, and it works fine. But this is CLEARLY a stupid way to do this. This should be a loop of some sort. And here is where my troubles begin. I could specify a list of lists:
aggHierarchy = [['job', ['year', 'area', 'ind', 'occ']],
['work', ['year', 'area', 'ind']],
['county', ['year', 'area']],
['year', ['year']]]
And then loop over the list, something like this:
# BROKEN CODE BLOCK
old_df = 'df'
for level in aggHierarchy:
new_df = 'df_%s' % level[0]
new_df = old_df.groupby(level[1], as_index=False)
new_df = new_df.agg(aggVars)
old_df = new_df
The logic here would be to assign the new data frame's name based on the first sub-element of the aggHierarchy element, then group things using the second sub-element. But of course, this doesn't work. The loop I've written basically tries to assign a NAME for the new data frame using new_df = 'df_%s' % level[0], but all I've actually done there is create a string.
Furthermore, Stack Overflow is full of people pointing out that using lists to assign variable names in a loop is Considered Harmful. I get that--I can tell how janky this is. "Use a dictionary," I see people writing. But here's the thing: The aggregation of those data frames has to happen in a certain order, which (I believe) I can't specify with a dictionary. I am failing to grasp how I go from adding variable names to a dictionary, to calling them in some specified order in a loop.
Thus my question, which hopefully I've given enough background information to specify well: given a block of code like DUMB CODE BLOCK above, where I need to update variable names based on a list (or dictionary!) whose exact contents I might not know in advance...how can I create some sort of loop there?