I have a list of coordinate pairs. To the human eye, they form lines with a constant slope:
This is how I generated that image above:
import numpy as np
np.random.seed(42)
slope = 1.2 # all lines have the same slope
offsets = np.arange(10) # we will have 10 lines, each with different y-intercept
xslist=[]
yslist=[]
for offset in offsets:
# each line will be described by a variable number of points:
size = np.random.randint(low=50,high=100)
# eachline starts from somewhere -5 and -2 and ends between 2 and 5
xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
# add some random offset and some random noise
ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
bring all x and y points together to single arrays
xs = np.concatenate(xslist) # xs: array([-0.37261674, 0.58267626, -3.72592914 ...
ys = np.concatenate(yslist) # ys: array([-0.53638699, 0.61729781, -4.52132114,
plot results
import matplotlib.pyplot as plt
plt.scatter(xs,ys)
I can generate lots of xs and ys. In my real world scenario, I won't know which point belongs to which line, so cannot simply separate the points to different groups and just apply least squares fitting to each group.
How could I, using machine learning or otherwise, build a function which takes xs and ys as input, and returns a slope estimate of the lines on an image like above?
Why simple least squares fitting doesn't seem to work
Let's generate new data where the failure of least squares fitting is more obvious. Let's have a slope of 2.4 and y-intercepts between 0 and a few hundred.
Data generation:
import numpy as np
np.random.seed(42)
slope = 2.4
offsets = np.arange(0,500,100)
xslist=[]
yslist=[]
for offset in offsets:
size = np.random.randint(low=50,high=100)
xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
xs = np.concatenate(xslist)
ys = np.concatenate(yslist)
Least squares fitting of a line using np.polyfit():
a, b = np.polyfit(xs, ys, deg=1)
Note that I cannot fit to just one line, as I don't know which points belong to one line.
Plot results:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.scatter(xs,ys)
line_x = np.arange(-5,5,0.01)
line_y = a*line_x + b
plt.plot(line_x,line_y,c='r',linewidth=10)
plt.gca().set_aspect(1/8)
ie:
The obtained slope using least squares fitting (ie the slope of the red line) is very much different than the slope of the lines formed by black dots. (Note that the scale is different on the x and y axis.)
Printing both a (our slope estamate) and the real slope slope:
print(a)
print(slope)
get:
4.295790412452058
2.4
This error is too much for my real world application.
Function to generate mock data
As requested in the comments, here is a function to generate data similar to the above examples:
def get_data(number_of_examples):
np.random.seed(42)
list_of_xs = []
list_of_ys = []
true_slopes = []
for _ in range(number_of_examples):
slope = np.random.uniform(low=-10, high=10)
offsets = np.arange(0,
np.random.randint(low=20, high=200),
np.random.randint(low=1, high=10))
xslist=[]
yslist=[]
for offset in offsets:
size = np.random.randint(low=np.random.randint(low=40, high=60),
high=np.random.randint(low=80, high=100))
xs = np.random.uniform(low=np.random.uniform(-5,-2),
high=np.random.uniform(2,5),size=size)
ys = slope * xs + offset + \
np.random.normal(loc=0, scale=0.1, size=1) + \
np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
xs = np.concatenate(xslist)
ys = np.concatenate(yslist)
list_of_xs.append(xs)
list_of_ys.append(ys)
true_slopes.append(slope)
return list_of_xs, list_of_ys, true_slopes
Try it, get 10 examples:
list_of_xs, list_of_ys, true_slopes = data = get_data(10)
Plot results (the slope of the red line is what I am trying to predict using the coordinates of the blue dots):
for xs, ys, true_slope in zip(list_of_xs, list_of_ys, true_slopes):
plt.figure()
plt.scatter(xs, ys)
plt.plot(xs, xs*true_slope, c='r')
and so on.





