Haskell avoiding stack overflow in folds without sacrificing performance -
the following piece of code experiences stack overflow large inputs:
{-# language derivedatatypeable, overloadedstrings #-} import qualified data.bytestring.lazy.char8 l gentweets :: l.bytestring -> l.bytestring gentweets text | l.null text = "" | otherwise = l.intercalate "\n\n" $ gentweets' $ l.words text gentweets' txt = foldr p [] txt p word [] = [word] p word words@(w:ws) | l.length word + l.length w <= 139 = (word `l.append` " " `l.append` w):ws | otherwise = word:words
i assume predicate building list of thunks, i'm not sure why, or how fix it.
the equivalent code using foldl'
runs fine, takes forever, since appends constantly, , uses ton of memory.
import data.list (foldl') gentweetsstrict :: l.bytestring -> l.bytestring gentweetsstrict text | l.null text = "" | otherwise = l.intercalate "\n\n" $ gentweetsstrict' $ l.words text gentweetsstrict' txt = foldl' p [] txt p [] word = [word] p words word | l.length word + l.length (last words) <= 139 = init words ++ [last words `l.append` " " `l.append` word] | otherwise = words ++ [word]
what causing first snippet build thunks, , can avoided? possible write second snippet doesn't rely on (++)
?
l.length word + l.length (last words) <= 139
this problem. on each iteration, you're traversing accumulator list, , then
init words ++ [last words `l.append` " " `l.append` word]
appending @ end. going take long time (proportional length of accumulator list). better solution generate output list lazily, interleaving processing reading input stream (you don't need read whole input output first 140-character tweet).
the following version of program processes relatively large file (/usr/share/dict/words
) in under 1 second time, while using o(1) space:
{-# language overloadedstrings, bangpatterns #-} module main import qualified data.bytestring.lazy.char8 l import data.int (int64) gentweets :: l.bytestring -> l.bytestring gentweets text | l.null text = "" | otherwise = l.intercalate "\n\n" $ totweets $ l.words text -- concatenate words 139-character tweets. totweets :: [l.bytestring] -> [l.bytestring] totweets [] = [] totweets [w] = [w] totweets (w:ws) = go (l.length w, w) ws -- main loop. notice how output tweet (cur_str) generated -- possible, enabling l.writefile consume before whole -- input processed. go :: (int64, l.bytestring) -> [l.bytestring] -> [l.bytestring] go (_cur_len, !cur_str) [] = [cur_str] go (!cur_len, !cur_str) (w:ws) | lw + cur_len <= 139 = go (cur_len + lw + 1, cur_str `l.append` " " `l.append` w) ws | otherwise = cur_str : go (lw, w) ws lw = l.length w -- notice use of lazy i/o. main :: io () main = dict <- l.readfile "/usr/share/dict/words" l.writefile "tweets" (gentweets dict)
Comments
Post a Comment