Optimize stacking time series by offsetting start times (feels like a backpack problem?)

Question

Given a time-series of data collected from a single running process that takes 8 hours to complete:

Minute	GB of Disk Space Used
0	0
1	8
2	15
3	22

...Etc. It is sampled every minute, for 8 hours. The # of GB used jumps up and down a lot, and ends back at 0, like a big spiky mountain shape. I'm making a (large) assumption that running multiple instances of this process in parallel takes up proportionally more space, but doesn't slow eachother down.

I've also got a hard upper bound that the summed area graph (the total of all space usage at any given minute) can't go above.

My personal upper-bound is 780GB (the amount of spare disk space), and if I ran 53 processes in parallel starting at the same time, they would all add up at minute 2 to (53 * 15) = 795 (which is > 780), the disk would run out of space, and everything would crash. Or if I ran 35 in parallel, they would crash at minute 3. No crashing allowed!

The goal is complete as many runs of the process as possible within the next 7 days, by "packing them in" and running them in some overlapping manner by deciding when each run of the process starts.

First Attempt: Greedy

At first I thought it was a simple greedy problem: At every minute, ask "can you start a new process without the total of all running processes exceeding the limit in the future?" If yes, great! start a new one. If not, wait a minute and try again. This leads to "start 3 processes immediately. Wait until they all finish. Then do it again for 7 days."

Second Attempt: recursive?

Done! wait... that feels wrong, backpack-packing style. What about "If you line up runs A and B just right, they make a valley you could fit C in to."

Which makes it feel like a recursive solve problem. Is there an accepted way to optimize this sort of thing?

Third Attempt: Minimize the Maximum (like Tetris?)

My best guess at this point is a simple optimization strategy:

Define a time bound (7 days)
Place the next process start time at whatever is the minimizes the "max space used on the drive, over all time"
If there are ties, move the graph to the soonest start point.
Repeat #2 until nothing more fits.

This would work out to:

First process starts immediately.
Second process starts before the first one finishes: Whenever is in the downslope of the first one exceeds the upslope of the second one.
Keep adding process start times until there isn't a place to slot in one more.

score 2 · Accepted Answer · answered May 04 '21 at 14:41

I suspect this problem is NP-hard, but haven't been able to prove it. In any case, Integer Linear Programming (ILP) is a good way to solve it.

Let $c_1, \dots, c_n$ be the series data. For each valid start time $i$, create a variable $x_i$ to hold the number of copies to start at this time and add the constraint $x_i \ge 0$. For each minute of time $i$ (now also including the 8 hours' worth after all valid starting times), add the constraint $\sum_j c_{i-j+1}x_j \le 780$. Objective: Maximise $\sum_i x_i$.

Although this will make for a fairly large ILP instance, my feeling is that will be a "nice" one that can be solved to optimality quickly by a commercial solver like CPLEX or Gurobi. But in case it's not: Even a free LP solver will solve the LP relaxation of this without difficulty; rounding down the $x_i$ gives a feasible solution. If there are few variables $\ge 1$, so that rounding gives an unhelpful answer, you can always just remove some of the variables (I suggest removing the ones with the lowest values) and re-solve.

Optimize stacking time series by offsetting start times (feels like a backpack problem?)

First Attempt: Greedy

Second Attempt: recursive?

Third Attempt: Minimize the Maximum (like Tetris?)

1 Answers1