Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?
Here's how one file looks: BEGIN_META stuff to discard END_META BEGIN_DB header to discard data I wish to extract END_DB I'd like to be able
Solution 1:
Something like this might work:
import itertools
defchunks(it):
whileTrue:
it = itertools.dropwhile(lambda x: 'BEGIN_DB'notin x, it)
it = itertools.dropwhile(lambda x: x.strip(), it)
next(it)
yield itertools.takewhile(lambda x: 'END_DB'notin x, it)
For example:
src = """
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
1data I
1wish to
1extract
END_DB
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
2data I
2wish to
2extract
END_DB
"""
src = iter(src.splitlines())
for chunk in chunks(src):
for line in chunk:
print line.strip()
print
Solution 2:
You can separate your functions more programmatically to make your programming logic make more sense and to make your code more modular and flexible. Try to stay away from saying something like
state = "some string"
Because what happens if in the future you want to add something to this module, then you need to know what parameters your variable "state" takes and what happens when it changes values. You're not guaranteed to remember this information and this can set you up for some hassles. Writing functions to mimic this behavior is cleaner and easier to implement.
defread_stdin():
with sys.stdin as f:
for line in f:
yield line
defsearch_line_for_start_db(line):
if"BEGIN DB"in line:
search_db_for_info()
defsearch_db_for_info()
while"END_DB"notin new_line:
new_line = read_line.next()
ifnot new_line.strip():
# Put your information somewhere
raw_tables.append(line)
read_line = read_stdin()
raw_tables = []
whileTrue:
try:
search_line_for_start_db(read_line.next())
Except: #Your stdin stream has finished being readbreak#end your program
Post a Comment for "Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?"