python - How to slice memmap efficiently? -
currently working on quite huge dataset barely fits memory, use np.memmap
. @ point have split dataset training , test. have found such case when want slice np.memmap
using index array: (below can find code , mem allocations)
line # mem usage increment line contents ================================================ 7 29.340 mb 0.000 mb def my_func2(): 8 29.340 mb 0.000 mb arr_size = (1221508/4,430) 9 29.379 mb 0.039 mb big_mmap = np.memmap('big_mem_test.mmap',shape=arr_size, dtype=np.float64, mode='r') 10 38.836 mb 9.457 mb idx = range(arr_size[0]) 11 2042.605 mb 2003.770 mb sub = big_mmap[idx,:] 12 3046.766 mb 1004.160 mb sub2 = big_mmap[idx,:] 13 3046.766 mb 0.000 mb return type(sub)
but if take continous slice use rather code:
line # mem usage increment line contents ================================================ 15 29.336 mb 0.000 mb def my_func3(): 16 29.336 mb 0.000 mb arr_size = (1221508/4,430) 17 29.375 mb 0.039 mb big_mmap = np.memmap('big_mem_test.mmap',shape=arr_size, dtype=np.float64, mode='r') 18 29.457 mb 0.082 mb sub = big_mmap[0:1221508/4,:] 19 29.457 mb 0.000 mb sub2 = big_mmap[0:1221508/4,:]
notice in second example in lines 18,19 there no memory allocation , whole operation lot faster.
in first example in line 11 there alocation whole big_mmap
matrix readed during slicing. more suprising in line 12 there alocation. doing more such operation can run out of memory.
when split data set indexes rather random , not continous cannot use big_mmap[start:end,:]
notation.
my question is:
is there other method allow me slice memmap without reading whole data memory?
why whole matrix readed memory when slicing index (example one)?
why data readed , alocated again (first example line 12)?
the double-allocation seeing in first example isn't due memmap behaviour; rather, due how __getitem__
implemented numpy's ndarray class. when ndarray indexed using list (as in first example), data copied source array. when indexed using slice object, view created source array (no data copied). example:
in [2]: x = np.arange(16).reshape((4,4)) in [3]: x out[3]: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) in [4]: y = x[[0, 2], :] in [5]: y[:, :] = 100 in [6]: x out[6]: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]])
y
copy of data x
changing y
had no effect on x
. index array via slicing:
in [7]: z = x[::2, :] in [8]: z[:, :] = 100 in [9]: x out[9]: array([[100, 100, 100, 100], [ 4, 5, 6, 7], [100, 100, 100, 100], [ 12, 13, 14, 15]])
regarding first question, i'm not aware of method allow create arbitrary slices include entire array without reading entire array memory. 2 options might consider (in addition hdf5/pytables, discussed):
if accessing elements of training & test sets sequentially (rather operating on them 2 entire arrays), write small wrapper class
__getitem__
method uses index arrays pull appropriate sample memmap (i.e., training[i] returns big_mmap[training_ids[i]])split array 2 separate files, contain exclusively training or test values. use 2 separate memmap objects.
Comments
Post a Comment