Blanket statements blaming DMA for large buffer alignment restrictions are wrong.
Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.
However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).
As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.
Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).
So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.
Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.