For an unaligned 128-bit load, use:
movups xmm0, [v0]: move unaligned single-precision floating point for float or double data. (movupd is 1 byte longer but never makes a performance difference.)
movdqu xmm0, [v0]: move unaligned double quadword
Even if the two quadwords are split across a cache-line boundary, that's normally the best choice for throughput. (On AMD CPUs, there can be a penalty when the load doesn't fit within an aligned 32 byte block of a cache line, not just a 64-byte cache-line boundary. But on Intel, any misalignment within a 64-byte cache line is free.)
If your loads are feeding integer-SIMD instructions, you probably want movdqu, even though movups is 1 byte shorter in machine code. Some CPUs may care about "domain crossing" for different types of loads. For stores it doesn't matter, many compilers always use movups even for integer data.
See also How can I accurately benchmark unaligned access speed on x86_64 for more about the costs of unaligned loads. (SIMD and otherwise).
If there weren't contiguous, your best bet is
movq xmm0, [v0]: move quadword
movhps xmm0, [v1]: move high packed single-precision floating point. (No integer equivalent, use this anyway. Never use movhpd, it's longer for no benefit because no CPUs care about double vs. float shuffles.)
Or on an old x86, like Core2 and other old CPUs where movups was slow even when the 16 bytes all came from within the same cache line, you might use
movq xmm0, [v0]: move quadword
movhps xmm0, [v0+8]: move high packed single-precision floating point
movhps is slightly efficient than SSE4.1 pinsrq xmm0, [v1], 1 (2 uops, can't micro-fuse on Intel Sandybridge-family: 1 uop for loads ports, 1 for port 5). movhps is 1 micro-fused uop, but still needing the same back-end ports: load + shuffle.
See Agner Fog's x86 optimization guide; he has a chapter about SIMD with a big section on data movement. https://agner.org/optimize/ And see other links in https://stackoverflow.com/tags/x86/info.
To get the data back out, movups can work as a store, and so can movlps/movhps to scatter the qword halves. (But don't use movlps as a load- it merges creating a false dependency vs. movq or movsd.)
movlps is 1 byte shorter than movq, but both can store the low 64 bits of an xmm register to memory. Compilers often ignore domain-crossing (vec-int vs. vec-fp) for stores, so you should too: generally use SSE1 ...ps instructions when they're exactly equivalent for stores. (Not for reg-reg moves; Nehalem can slow down on movaps between integer SIMD like paddd, or vice versa.)
In all cases AFAIK, no CPUs care about float vs. double for anything other than actual add / multiply instructions, there aren't CPUs with separate float and double bypass-forwarding domains. The ISA design leaves that option open, but in practice there's never a penalty for saving a byte by using movups or movaps to copy around a vector of double. Or using movlps instead of movlpd. double shuffles are sometimes useful, because unpcklpd is like punpcklqdq (interleave 64-bit elements) vs. unpcklps being like punpckldq (interleave 32-bit elements).