python - How to slice memmap efficiently? -


currently working on quite huge dataset barely fits memory, use np.memmap. @ point have split dataset training , test. have found such case when want slice np.memmap using index array: (below can find code , mem allocations)

line #    mem usage    increment   line contents ================================================  7    29.340 mb     0.000 mb   def my_func2():  8    29.340 mb     0.000 mb       arr_size = (1221508/4,430)  9    29.379 mb     0.039 mb       big_mmap = np.memmap('big_mem_test.mmap',shape=arr_size, dtype=np.float64, mode='r')     10    38.836 mb     9.457 mb       idx = range(arr_size[0]) 11  2042.605 mb  2003.770 mb       sub = big_mmap[idx,:] 12  3046.766 mb  1004.160 mb       sub2 = big_mmap[idx,:] 13  3046.766 mb     0.000 mb       return  type(sub) 

but if take continous slice use rather code:

line #    mem usage    increment   line contents ================================================ 15    29.336 mb     0.000 mb   def my_func3(): 16    29.336 mb     0.000 mb       arr_size = (1221508/4,430) 17    29.375 mb     0.039 mb       big_mmap = np.memmap('big_mem_test.mmap',shape=arr_size, dtype=np.float64, mode='r')     18    29.457 mb     0.082 mb       sub = big_mmap[0:1221508/4,:] 19    29.457 mb     0.000 mb       sub2 = big_mmap[0:1221508/4,:]   

notice in second example in lines 18,19 there no memory allocation , whole operation lot faster.

in first example in line 11 there alocation whole big_mmap matrix readed during slicing. more suprising in line 12 there alocation. doing more such operation can run out of memory.

when split data set indexes rather random , not continous cannot use big_mmap[start:end,:] notation.

my question is:

is there other method allow me slice memmap without reading whole data memory?

why whole matrix readed memory when slicing index (example one)?

why data readed , alocated again (first example line 12)?

the double-allocation seeing in first example isn't due memmap behaviour; rather, due how __getitem__ implemented numpy's ndarray class. when ndarray indexed using list (as in first example), data copied source array. when indexed using slice object, view created source array (no data copied). example:

in [2]: x = np.arange(16).reshape((4,4))  in [3]: x out[3]:  array([[ 0,  1,  2,  3],        [ 4,  5,  6,  7],        [ 8,  9, 10, 11],        [12, 13, 14, 15]])  in [4]: y = x[[0, 2], :]  in [5]: y[:, :] = 100  in [6]: x out[6]:  array([[ 0,  1,  2,  3],        [ 4,  5,  6,  7],        [ 8,  9, 10, 11],        [12, 13, 14, 15]]) 

y copy of data x changing y had no effect on x. index array via slicing:

in [7]: z = x[::2, :]  in [8]: z[:, :] = 100  in [9]: x out[9]:  array([[100, 100, 100, 100],        [  4,   5,   6,   7],        [100, 100, 100, 100],        [ 12,  13,  14,  15]]) 

regarding first question, i'm not aware of method allow create arbitrary slices include entire array without reading entire array memory. 2 options might consider (in addition hdf5/pytables, discussed):

  1. if accessing elements of training & test sets sequentially (rather operating on them 2 entire arrays), write small wrapper class __getitem__ method uses index arrays pull appropriate sample memmap (i.e., training[i] returns big_mmap[training_ids[i]])

  2. split array 2 separate files, contain exclusively training or test values. use 2 separate memmap objects.


Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

javascript - storing input from prompt in array and displaying the array -