matlab - Can I read a gigantic text file with Parallel Computing? -


i have multiple text files 2gb in size (approximately 70 million lines). have quad-core machine , access parallel computing toolbox.

typically might open file , read lines so:

f = fopen('file.txt'); l = fgets(f); while ~ isempty(l)     % l     l = fgets(f); end 

i wanted distribute "do l" across 4 cores, of course requires use of parfor loop. require "slurp" 2gb file (to borrow perl term) matlab priori, instead of processing on fly. don't need l, result of processing.

is there way read lines out of text file parallel computing?

edit: it's worth mentioning can find exact number of lines ahead of time (!wc -l mygiantfile.txt).

edit2: structure of file follows:

15 1180 62444 e0e0 049c f3ec 104 

so 3 decimal numbers, 3 hex numbers, , 1 decimal number. repeat 70 million lines.

as requested, i'm showing example of memory-mapped files using memmapfile class.

since didn't provide exact format of data file, create own. data creating table of n rows, each consisting of 4 columns:

  • first double scalar value
  • second single value
  • third fixed-length string representing uint32 in hex notation (e.g: d091bb44)
  • fourth column uint8 value

the code generate random data, , write binary file structured described above:

% random data n = 10; data = [...     num2cell(rand(n,1)), ...     num2cell(rand(n,1,'single')), ...     cellstr(dec2hex(randi(intmax('uint32'), [n,1]),8)), ...     num2cell(randi([0 255], [n,1], 'uint8')) ... ];  % write binary file fid = fopen('file.bin', 'wb'); i=1:n     fwrite(fid, data{i,1}, 'double');     fwrite(fid, data{i,2}, 'single');     fwrite(fid, data{i,3}, 'char');     fwrite(fid, data{i,4}, 'uint8'); end fclose(fid); 

here resulting file viewed in hex editor:

binary file viewed in hex editor

we can confirm first record (note system uses little-endian byte ordering):

>> num2hex(data{1,1}) ans = 3fd4d780d56f2ca6  >> num2hex(data{1,2}) ans = 3ddd473e  >> arrayfun(@dec2hex, double(data{1,3}), 'uniformoutput',false) ans =      '46'    '35'    '36'    '32'    '37'    '35'    '32'    '46'  >> dec2hex(data{1,4}) ans = c0 

next open file using memory-mapping:

m = memmapfile('file.bin', 'offset',0, 'repeat',inf, 'writable',false, ...     'format',{         'double', [1 1], 'd';         'single', [1 1], 's';         'uint8' , [1 8], 'h';      % since doesnt directly support char         'uint8' , [1 1], 'i'}); 

now can access records ordinary structure array:

>> rec = m.data;      % 10x1 struct array  >> rec(1)             % same as: data(1,:) ans =      d: 0.3257     s: 0.1080     h: [70 53 54 50 55 53 50 70]     i: 192  >> rec(4).d           % same as: data{4,1} ans =     0.5799  >> char(rec(10).h)    % same as: data{10,3} ans = 2b2f493f 

the benefit large data files, can restrict mapping "viewing window" small subset of records, , move view along file:

% read records 2 at-a-time numrec = 10;                       % total number of records lenrec = 8*1 + 4*1 + 1*8 + 1*1;    % length of each record in bytes numrecperview = 2;                 % how many records in viewing window  m.repeat = numrecperview; i=1:(numrec/numrecperview)     % move window along file     m.offset = (i-1) * numrecperview*lenrec;      % read 2 records in window:     %for j=1:numrecperview, m.data(j), end     m.data(1)     m.data(2) end 

access portion of file using memory-mapping


Comments

Popular posts from this blog

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

html - How to style widget with post count different than without post count -

url rewriting - How to redirect a http POST with urlrewritefilter -