matlab - Can I read a gigantic text file with Parallel Computing? -
i have multiple text files 2gb in size (approximately 70 million lines). have quad-core machine , access parallel computing toolbox.
typically might open file , read lines so:
f = fopen('file.txt'); l = fgets(f); while ~ isempty(l) % l l = fgets(f); end i wanted distribute "do l" across 4 cores, of course requires use of parfor loop. require "slurp" 2gb file (to borrow perl term) matlab priori, instead of processing on fly. don't need l, result of processing.
is there way read lines out of text file parallel computing?
edit: it's worth mentioning can find exact number of lines ahead of time (!wc -l mygiantfile.txt).
edit2: structure of file follows:
15 1180 62444 e0e0 049c f3ec 104 so 3 decimal numbers, 3 hex numbers, , 1 decimal number. repeat 70 million lines.
as requested, i'm showing example of memory-mapped files using memmapfile class.
since didn't provide exact format of data file, create own. data creating table of n rows, each consisting of 4 columns:
- first
doublescalar value - second
singlevalue - third fixed-length string representing
uint32in hex notation (e.g:d091bb44) - fourth column
uint8value
the code generate random data, , write binary file structured described above:
% random data n = 10; data = [... num2cell(rand(n,1)), ... num2cell(rand(n,1,'single')), ... cellstr(dec2hex(randi(intmax('uint32'), [n,1]),8)), ... num2cell(randi([0 255], [n,1], 'uint8')) ... ]; % write binary file fid = fopen('file.bin', 'wb'); i=1:n fwrite(fid, data{i,1}, 'double'); fwrite(fid, data{i,2}, 'single'); fwrite(fid, data{i,3}, 'char'); fwrite(fid, data{i,4}, 'uint8'); end fclose(fid); here resulting file viewed in hex editor:

we can confirm first record (note system uses little-endian byte ordering):
>> num2hex(data{1,1}) ans = 3fd4d780d56f2ca6 >> num2hex(data{1,2}) ans = 3ddd473e >> arrayfun(@dec2hex, double(data{1,3}), 'uniformoutput',false) ans = '46' '35' '36' '32' '37' '35' '32' '46' >> dec2hex(data{1,4}) ans = c0 next open file using memory-mapping:
m = memmapfile('file.bin', 'offset',0, 'repeat',inf, 'writable',false, ... 'format',{ 'double', [1 1], 'd'; 'single', [1 1], 's'; 'uint8' , [1 8], 'h'; % since doesnt directly support char 'uint8' , [1 1], 'i'}); now can access records ordinary structure array:
>> rec = m.data; % 10x1 struct array >> rec(1) % same as: data(1,:) ans = d: 0.3257 s: 0.1080 h: [70 53 54 50 55 53 50 70] i: 192 >> rec(4).d % same as: data{4,1} ans = 0.5799 >> char(rec(10).h) % same as: data{10,3} ans = 2b2f493f the benefit large data files, can restrict mapping "viewing window" small subset of records, , move view along file:
% read records 2 at-a-time numrec = 10; % total number of records lenrec = 8*1 + 4*1 + 1*8 + 1*1; % length of each record in bytes numrecperview = 2; % how many records in viewing window m.repeat = numrecperview; i=1:(numrec/numrecperview) % move window along file m.offset = (i-1) * numrecperview*lenrec; % read 2 records in window: %for j=1:numrecperview, m.data(j), end m.data(1) m.data(2) end 
Comments
Post a Comment