This is a long shot, if you think the question is too localized, please do vote to close. I have searched on the caffe2 github repository, opened an issue asking the same question, opened another issue at the caffe2_ccp_tutorials repository because its author seems to understand it best, read the doxygen documentation on caffe2::Tensor and caffe2::CUDAContext,
and even gone through the caffe2 source code, and in specific the tensor.h, context_gpu.h and context_gpu.cc.
I understand that currently caffe2 does not allow copying device memory to a tensor. I am willing to expand the library and do a pull request in order to achieve this. My reason behind this is that I do all image pre-processing using cv::cuda::* methods which operate on device memory, and as such I think it is clearly a problem doing the pre-processing on the gpu, only to download the result back on the host, and then have it re-uploaded to the network from host to device.
Looking at the constructors of Tensor<Context> I can see that maybe only
template<class SrcContext , class ContextForCopy >
Tensor (const Tensor< SrcContext > &src, ContextForCopy *context)
might achieve what I want, but I have no idea how to set the <ContextForCopy> and then use it for construction.
Furthermore, I see that I can construct the Tensor with the correct dimensions, and then maybe using
template <typename T>
T* mutable_data()
I can assign/copy the data.
The data itself is stored in std::vector<cv::cuda::GpuMat, so I will have to iterate it, and then use either cuda::PtrStepSz or cuda::PtrStep to access the underlying device allocated data.
That is the same data that I need to copy/assign into the caffe2::Tensor<CUDAContext>.
I've been trying to find out how internally the Tensor<CPUContext> is copied to Tensor<CUDAContext> since I've seen examples of it, but I can't figure it out, although I think the method used is CopyFrom. The usual examples as already mentioned, copy from CPU to GPU:
TensorCPU tensor_cpu(...);
TensorCUDA tensor_cuda = workspace.CreateBlob("input")->GetMutable<TensorCUDA>();
tensor_cuda->ResizeLike(tensor_cpu);
tensor_cuda->ShareData(tensor_cpu);
I am quite suprised nobody has run into this task yet, and a brief search yields only one open issue where the author (@peterneher) is asking the same thing more or less.