With 10.07 we fixed bug 588288 which allowed us to set the maximum number of lines of (each) log file that will be parsed.
Initially we'd thought this would help us solve the memory issue when running the parser against the backlog of ppa access logs, but after trialling with logparser_max_parsed_lines set to 100, the PPA log parser still had to be killed as it consumed too much memory.
Going back to the code, there are a lot of other improvements that could be made. One that stands out is that currently *all* log files with new lines to parse are opened during get_files_to_parse(), being returned in a dict. This means that when we run the parser against the backlog of ppa access files, there are over 2600 files being opened at once. It would be great to instead use a generator, and limit the number of lines processed to all files, rather than for each file.
Note: there is also a comment related to the librarian logfile parser in the docstring at:
which, applying it to the PPA log file parser, implies that we could additionally update the script to clear the storm cache
(store._cache.clear()) at some regular interval (such as after each file is processed). This will reduce the benefit of the cache of course, but will limit the amount of ram storm consumes during the process.
With 10.07 we fixed bug 588288 which allowed us to set the maximum number of lines of (each) log file that will be parsed.
Initially we'd thought this would help us solve the memory issue when running the parser against the backlog of ppa access logs, but after trialling with logparser_ max_parsed_ lines set to 100, the PPA log parser still had to be killed as it consumed too much memory.
Going back to the code, there are a lot of other improvements that could be made. One that stands out is that currently *all* log files with new lines to parse are opened during get_files_ to_parse( ), being returned in a dict. This means that when we run the parser against the backlog of ppa access files, there are over 2600 files being opened at once. It would be great to instead use a generator, and limit the number of lines processed to all files, rather than for each file.
Note: there is also a comment related to the librarian logfile parser in the docstring at:
cronscripts/ parse-librarian -apache- access- logs.py:
which, applying it to the PPA log file parser, implies that we could additionally update the script to clear the storm cache _cache. clear() ) at some regular interval (such as after each file is processed). This will reduce the benefit of the cache of course, but will limit the amount of ram storm consumes during the process.
(store.