Finding the position where a picture was taken

Question

Assuming I have a photo where a few significant points are visible, and I can point them out on a map. How many points would I have to identify, in order to find the precise position? Would I also find the altitude?

A small example:

Points:

point   coordinates               elevation    pixel in photo (of 4032x3024)
red      -13.156197,-72.546324    2494.0 m     1229x1201 px
blue     -13.161653,-72.545699    2450.0 m     1050x1938 px
pink     -13.158724,-72.540743    1994.0 m     2156x2436 px
yellow   -13.166027,-72.542900    2394.0 m     1604x2545 px
green    -13.163450,-72.544715    2456.0 m     1208x2165 px

picture with annotated points:

map with points:

I think you would need 4 points for 3D triangulation (if altitude isn't an issue, you can get away with 3 points, but in this case altitude is an issue regardless of whether you actually need it). However, to be precise you would also need intimate knowledge of exactly how the lens of your camera works, in order to find the angles your points actually make with the camera. — Arthur, May 08 '19 at 14:29
Broadly speaking, if you know nothing about the camera’s intrinsics, then (ignoring lens distortion) you need at least six image-scene point correspondences to reconstruct the camera matrix $P$, for which you can then compute a QR decomposition to extract its position and attitude in world coordinates. Assumptions about the camera such as square pixels and no skew provide additional constraints on $P$ and reduce the number of known points needed. On the other hand, since this is noisy real-world data, accuracy is improved by including more point pairs. — amd, May 10 '19 at 23:21
Chapters 7 (and to some degree 8) in Hartley and Zisserman’s Multiple View Geometry In Computer Vision covers these reconstruction techniques in great detail. — amd, May 10 '19 at 23:23
I'd say a sensible assumption here is that vertical of the picture is aligned with real world vertical and that we want to find position on the map only. This greatly simplifies the problem making it essentially 2D — Radost Waszkiewicz, May 13 '19 at 21:41

score 2 · Answer 1 · answered May 17 '19 at 20:11

Taking a geo and camera reference systems as shown in the sketch, with an evident meaning of the various terms, we may consider that, prior to translation, the camera is rotated around the $z$ axis by an angle $\gamma$, and then by an angle $\alpha$ around the new axis $x'$.
The two angles are taken according to the right-hand rule, and will be normally small.

Therefore we have that the new unit vectors will have coordinates $$ \left( {{\bf i'}\,|\,{\bf j'}\,|\,{\bf k'}} \right) = {\bf R}_{\,{\bf x'}} (\alpha ){\bf R}_{\,{\bf z}} (\gamma )\left( {{\bf i}\,|\,{\bf j}\,|\,{\bf k}} \right) = {\bf R}_{\,{\bf z}} (\gamma ){\bf R}_{\,{\bf x}} (\alpha )\left( {{\bf i}\,|\,{\bf j}\,|\,{\bf k}} \right) $$ where $$ {\bf R}_{\,{\bf x}} (\alpha ) = \left( {\matrix{ 1 & 0 & 0 \cr 0 & {\cos \alpha } & { - \sin \alpha } \cr 0 & {\sin \alpha } & {\cos \alpha } \cr } } \right) \quad {\bf R}_{\,{\bf z}} (\gamma ) = \left( {\matrix{ {\cos \gamma } & { - \sin \gamma } & 0 \cr {\sin \gamma } & {\cos \gamma } & 0 \cr 0 & 0 & 1 \cr } } \right) $$

A (vertical) vector fixed in the geo-reference will tranform in the inverse way $$ \left( {\matrix{ {x'} \cr {y'} \cr {z'} \cr } } \right) = {\bf R}_{\,{\bf x}} ^{\, - \,{\bf 1}} (\alpha ){\bf R}_{\,{\bf z}} ^{\, - \,{\bf 1}} (\gamma )\left( {\matrix{ x \cr y \cr z \cr } } \right) = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma ) \left( {\matrix{ x \cr y \cr z \cr } } \right) $$

We have then to add a translation vector $$ \left( {\matrix{ {x'} \cr {y'} \cr {z'} \cr } } \right) = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma ) \left( {\left( {\matrix{ x \cr y \cr z \cr } } \right) - \left( {\matrix{ {s_{\,x} } \cr {s_{\,y} } \cr {s_{\,z} } \cr } } \right)} \right) $$ such that $$ \eqalign{ & \left( {\matrix{ 0 \cr { - f} \cr 0 \cr } } \right) = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma ) \left( {\left( {\matrix{ {c_{\,x} } \cr {c_{\,y} } \cr {c_{\,z} } \cr } } \right) - \left( {\matrix{ {s_{\,x} } \cr {s_{\,y} } \cr {s_{\,z} } \cr } } \right)} \right)\quad \Rightarrow \cr & \Rightarrow \quad \left( {\matrix{ {s_{\,x} } \cr {s_{\,y} } \cr {s_{\,z} } \cr } } \right) = \left( {\matrix{ {c_{\,x} } \cr {c_{\,y} } \cr {c_{\,z} } \cr } } \right) + {\bf R}_{\,{\bf z}} (\gamma ){\bf R}_{\,{\bf x}} (\alpha ) \left( {\matrix{ 0 \cr f \cr 0 \cr } } \right)\quad \Rightarrow \cr & \Rightarrow \quad {\bf s} = {\bf c} + f\,{\bf n} \cr} $$

The line from $\bf p_k$ through $\bf c$ has equation $$ {\bf x} = {\bf p}_{\,k} + \lambda _{\,k} \left( {{\bf c} - {\bf p}_{\,k} } \right) $$ which in the camera reference translates into $$ \eqalign{ & {\bf x}' = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma )\left( {{\bf x} - {\bf s}} \right) = \cr & = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma )\left( {{\bf p}_{\,k} + \lambda _{\,k} \left( {{\bf c} - {\bf p}_{\,k} } \right) - {\bf s}} \right) = \cr & = {\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma )\left( {\left( {\lambda _{\,k} - 1} \right)\left( {{\bf c} - {\bf p}_{\,k} } \right) - f\,{\bf n}} \right) = \cr & = \left( {\lambda _{\,k} - 1} \right){\bf R}_{\,{\bf x}} ( - \alpha ){\bf R}_{\,{\bf z}} ( - \gamma )\left( {{\bf c} - {\bf p}_{\,k} } \right) - \left( {0,f,0} \right)^{\,T} = \cr & = {\bf q}_{\,k} = \left( {x'_{\,k} ,0,z'_{\,k} } \right)^{\,T} \cr} $$ and $\lambda _{\,k} $ shall be such as to render the $y'_{\,k}$ coordinate null.

In conclusion, we have as unknown the two angles $\alpha, \, \gamma$, the three components of $\bf c$, and if it is unknown, $f$.
Each couple of corresponding points will provide two equations (net of of the introduction of the further unknown $\lambda _{\,k} $).
We need therefore at least three couples of corresponding points.

To counterbalance the error/indetermination in the coordinates the addition of further points will be beneficial and can be dealt by Least Square Error methods.

score 2 · Answer 2 · 2019-05-19T18:11:46.300

I calculate that the camera was at approximately $$\text{coordinates }(-13.1611, -72.5427),\ \text{elevation } 2515.16$$

This comes from a problem in 4 unknowns: 3 positions for the Cartesian coordinates $(P_i,P_j,P_k)$ of the camera, and 1 factor $r$ which relates camera angles to picture distances.

Specifically, I assume that if the camera is at $P$, then two points $X$ and $Y$ are separated in the picture by a distance in pixels of $${XY}_3 := r\, \tan(\angle XPY).$$ (The subscript 3 indicates that this is calculated from the 3-dimensional cordinates.) By the identity $\tan^2=-1+\sec^2$ we get $${XY}_3^2 = r^2\left(-1+\frac{1}{\cos^2(\angle XPY)}\right)$$ $${XY}_3^2 = r^2\left(-1+\frac{|P-X|^2|P-Y|^2}{((P-X)\cdot(P-Y))^2}\right)$$ We compare this with the actual 2-dimensional distance in pixels, which we call ${XY}_2$. We should have ${XY}_3={XY}_2$ for all pairs of points in the picture, and we can approximate this by minimizing $$\sum_{X,Y} ({XY}_3^2-{XY}_2^2)^2$$ starting from a guess that $P$ is the average of the points on the map, and $r$ is roughly the pixel-size of the picture.

Here is Mathematica code for the calculation, including a conversion between latitude-longitude-altitude and a Cartesian coordinate system based at the center of the earth.

red    = {-13.156197, -72.546324, 2494, 1229, 1201}
blue   = {-13.161653, -72.545699, 2450, 1050, 1938}
pink   = {-13.158724, -72.540743, 1994, 2156, 2436}
yellow = {-13.166027, -72.542900, 2394, 1604, 2545}
green  = {-13.163450, -72.544715, 2456, 1208, 2165}
data   = {red, blue, pink, yellow, green}

earthrad    = 2 10^7/Pi
phi[x_]    := x[[1]] Pi/180
theta[x_]  := x[[2]] Pi/180
meters[x_] := {Cos[theta[x]] Cos[phi[x]],
               Sin[theta[x]] Cos[phi[x]], 
               Sin[phi[x]]} (earthrad + x[[3]])

p               = {pi, pj, pk}
Norm2[x_]      := x.x
dist3d[x_, y_] := r^2 (-1 + Norm2[p - x] Norm2[p - y]/((p - x).(p - y))^2)
delta[x_, y_]  := (dist3d[meters[x], meters[y]] - Norm2[(x - y)[[4 ;; 5]]])^2

deltas = Sum[delta[data[[i]], data[[j]]], {i, 1, 4}, {j, i + 1, 5}];
avg    = meters[Mean[data]]
sol    = FindMinimum[deltas, {pi, avg[[1]]}, {pj, avg[[2]]}, {pk, avg[[3]]}, {r, 1000}]

tosphere[{x_, y_, z_}] := {ArcSin[z/Sqrt[x^2 + y^2 + z^2]] 180/Pi,
                           ArcTan[y/x] 180/Pi,
                           Sqrt[x^2 + y^2 + z^2] - earthrad}
tosphere[p /. sol[[2]]]

score 1 · Answer 3 · answered May 17 '19 at 21:49

If you have the camera and if you also have sufficient metadata about the photo, such as the zoom setting of the camera, you can deduce the angle subtended by any two points in the image relative to the actual camera position when the photo was taken. If necessary, you can determine the angle experimentally by taking additional photos of known scenes from known locations using the same camera settings.

Knowing the angle subtended by two points in the image, if you knew the plane in which the camera and the two real-world points all lay when the photo was taken, you could draw a circular arc in that plane and know the camera was somewhere on that arc. Since you do not know which plane all three points were in, however, you only know that the camera was on a surface generated by rotating the arc around the axis through the two known points.

Given three points, in theory the camera has to be simultaneously on three different surfaces that can be deduced from the three pairs of points and the angle subtended by each pair. This would usually limit you to a small number of possible locations, and you might be able to deduce the location from other clues. In practice, your data have measurement errors and you should use more points to estimate the camera location.

Four points would give you six subtended angles. Five points (as in your example) give you $\binom52=10$ subtended angles.

When you take that many angles into account you will find that due to inaccuracies in measurement and calculation, there is no exact solution. In the days of navigation with sextant and handheld instruments on a printed map you might find that three sightings gave you three possible locations. The triangle formed by those locations was called a “cocked hat” and it was assumed you were somewhere inside it. You would have a kind of three-dimensional analogue of a cocked hat, but with software I’m sure it’s possible to find a location that minimizes the errors.

Note that I assume you do all of this in three dimensions from start to finish. So the answer will automatically include knowledge of the altitude of the camera.

It probably pays to convert everything to Cartesian coordinates at the start and then back to latitude/longitude/altitude at the end. For the best accuracy, in theory you would plot all the points relative to a standard Earth geoid, but I don’t know how much difference that makes at the distance of an actual landscape photo.

Finding the position where a picture was taken

3 Answers3