Skip to content Skip to sidebar Skip to footer

Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?

Here's how one file looks: BEGIN_META stuff to discard END_META BEGIN_DB header to discard data I wish to extract END_DB I'd like to be able

Solution 1:

Something like this might work:

import itertools

defchunks(it):
    whileTrue:
        it = itertools.dropwhile(lambda x: 'BEGIN_DB'notin x, it)
        it = itertools.dropwhile(lambda x: x.strip(), it)
        next(it)
        yield itertools.takewhile(lambda x: 'END_DB'notin x, it)

For example:

src = """
BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    1data I
    1wish to
    1extract
 END_DB


BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    2data I
    2wish to
    2extract
 END_DB
"""


src = iter(src.splitlines())
for chunk in chunks(src):
    for line in chunk:
        print line.strip()
    print

Solution 2:

You can separate your functions more programmatically to make your programming logic make more sense and to make your code more modular and flexible. Try to stay away from saying something like

state = "some string"

Because what happens if in the future you want to add something to this module, then you need to know what parameters your variable "state" takes and what happens when it changes values. You're not guaranteed to remember this information and this can set you up for some hassles. Writing functions to mimic this behavior is cleaner and easier to implement.

defread_stdin():
    with sys.stdin as f:
        for line in f:
            yield line

defsearch_line_for_start_db(line):
    if"BEGIN DB"in line:
        search_db_for_info()

defsearch_db_for_info()
    while"END_DB"notin new_line: 
        new_line = read_line.next()
        ifnot new_line.strip():
            # Put your information somewhere
            raw_tables.append(line)

read_line = read_stdin()
raw_tables = []
whileTrue:
    try:
        search_line_for_start_db(read_line.next())
    Except: #Your stdin stream has finished being readbreak#end your program

Post a Comment for "Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?"