4

I want to estimate the average income for a location. I have nested data in the following way: A block is inside a neighborhood, which is inside a zipcode, which is inside a district, which is inside a region, which is inside a state.

I want to estimate the average income at a block level, and the issue is that I don't have much data at that level. I have much more data at a state level, but it is not such a good approximation.

How would you deal with this problem? Are there any ways to incorporate the uncertainty of not having many data points at a block level? Are there any Bayesian frameworks that allow us to incorporate data of all levels? Is it possible that mixed models are able to do so?

If you explain any method, if you can provide a python package where that method is built, it'll be great!

Thanks!

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
David Masip
  • 6,136
  • 2
  • 28
  • 62

2 Answers2

2

I don't know if that is the case, but if some kind of continuity assumptions are realistic, you could try to move away from categorical variables (block) to continuous variables (longitude and latitude). Then, if you have information on two neighboring blocks, you could interpolate those values with say a spline.

Of course, this can also be fitted into a machine learning model with predictors such as average income of blocks with distance < x. And if you don't have data of nearby blocks, then your state average might be the next best approximation.

Your state level data can serve as a predictor and also as validation.

Also, plotting your data always helps get some kind of intuition.

NiklasvMoers
  • 309
  • 1
  • 7
2

One option is to move to a more rigorous geographic information system (GIS) data structure.

For example, both plus codes and H3 are designed for nested location data. If your data is reformated to either system, you can easily choose the level of precision for aggregating location data.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113