-2

What process or logical steps do you take to predict a function from any dataset?

I don't want to predict the function using a specific dataset; I want to understand how you predict a function when approaching a new dataset.

For example, if you see [1,3,5,7,9], how would you determine which function to use to capture those data points? Then, how would you predict the function for a completely different dataset, such as [16, 202, 984, 1024, 1111]? What is common when predicting between those two different datasets?

Jam
  • 10,632
  • 3
  • 29
  • 45
  • 2
    What does it mean to "predict a function from a data set"? – Alex Kruckman Sep 30 '22 at 20:56
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 30 '22 at 20:58
  • I'd like to help, but this sounds like a question for a different stack exchange website. Maybe Stack Overflow? – Accelerator Sep 30 '22 at 20:58
  • 4
    This is very vague. Usually there is some context behind data. That context ought to motivate a choice of function (of class of functions). If you really have no context, there are of course things to try but it's hard to speak in total generality with no information. – lulu Sep 30 '22 at 21:00
  • There are infinitely many functions (uncountably infinitely many, even if just dealing with positive integers) that will match any finite data set, so some more context is needed. For example, do you want polynomial functions (of least degree), precalculus type functions with some additional "data fitting" properties, etc.? – Dave L. Renfro Sep 30 '22 at 21:10
  • This is a good question. You could use the concepts of "Interpolation equations", Least Squares among others. Also see: https://math.stackexchange.com/questions/11502/find-formula-from-values – NoChance Sep 30 '22 at 21:37
  • @Accelerator Although the question is very vague, it looks that is about basic statistics, so I think this is the right forum. – jjagmath Sep 30 '22 at 22:27

2 Answers2

2

To some extent, this is impossible in general. There are an infinite number of valid formulas, and, for any given finite dataset, there are an infinite number of formulas which match them. This is especially true since you haven't defined which operations you are considering to be primitives.

One could provide an inherent ordering of which functions are preferred using the arithmetic hierarchy, with the shortest function defining your dataset being the preferred one. However, finding that shortest function is effectively impossible algorithmically for all but the simplest cases, as it requires the halting problem to determine.

johnnyb
  • 3,722
2

This answer is about "determining a function" rather than "predicting a function".

For 1-diminsional data there are several methods as shown below. In practice, the general equation would have a form but the parameters are not known exactly. For example you may know that the data fits a linear curve of the form $y=mx$ but you don't know the value of $m$. This is obviously found by solving a simple equation.

For sequences, there is the OESI. This site is very valuable. There are other techniques such as:

0- Clever observation for the relationship between input and output.

1- Least Squares Method

2- Lagrange's Interpolation and other methods

3- Some other methods - 2

4- Curve Fitting

5- An interesting calculator that groups the following methods all in one place ( Linear regression, Quadratic regression, Cubic regression Power regression, ab-Exponential regression, Logarithmic regression Hyperbolic regression, Exponential regression) is Function approximation with regression analysis.

6- See: Stack Exchange Similar Question.

You need to determine which method gives you a function yielding the smallest error. In occasions the function obtained would produce value different from the input values (not for series)! However, if the error is not accepted, you may have to change the method used. When the function obtained by a method is producing high errors, consider using more than 1 method for each set of inputs for example, one could get a line equation of the first 2 input values, then a quadratic equation for the other three values.

Also note that, the set of values could be generated by more than one function!

In you case, using n=0,1,2,3,... the series can be generated from the formula:

$a(n) = 2*n + 1$

See: OESI for specific case.

Its important to also know that the function obtained may not be correct for values not already given. In your example, I assumed that the values obtained are 1,3,5,7,9,11 but maybe the number following 11 in a real world experiment is 11.5! So you need to consider this fact also. That is why not everyone could predict oil prices or stock values since there are external values affecting the simple time and value curves we see all the time.

For the second set, there is no series that can help readily, maybe you can resort to the Least Square Calculator with this input: (0,16), (1,202), (2,984), (3,1024), (4.1111) to get the formula:

$a(n)=380.6 n - 14.4$

Now, is this function good enough? It is not good! Need to find a different method.

NoChance
  • 6,695