Skip to content Skip to sidebar Skip to footer

Build A Simple Parser That Is Able To Parse Different Date Formats Using Pyparse

I am building a simple parser that takes a query like the following: 'show fizi commits from 1/1/2010 to 11/2/2006' So far I have: class QueryParser(object): def parser(self, stmn

Solution 1:

A simple approach is to require the date be quoted. A rough example is something like this, but you'll need to adjust to fit in with your current grammar if needs be:

from pyparsing import CaselessKeyword, quotedString, removeQuotes
from dateutil.parser import parse as parse_date

dp = (
    CaselessKeyword('from') + quotedString.setParseAction(removeQuotes)('from') +
    CaselessKeyword('to') + quotedString.setParseAction(removeQuotes)('to')
)

res = dp.parseString('from "jan 20" to "apr 5"')
from_date = parse_date(res['from'])
to_date = parse_date(res['to'])
# from_date, to_date == (datetime.datetime(2015, 1, 20, 0, 0), datetime.datetime(2015, 4, 5, 0, 0))

Solution 2:

I suggest using something like sqlparse that already handles all the weird edge cases for you. It might be a better option in the long term, if you have to deal with more advanced cases.

EDIT: Why not just parse the date blocks as strings? Like so:

from pyparsing import CaselessKeyword, Word, Combine, Optional, alphas, nums

class QueryParser(object):

    def parser(self, stmnt):

        keywords = ["select", "from", "to", "show", "commits", "where",
                    "groupby", "order by", "and", "or"]

        [select, _from, _to, show, commits, where, groupby, orderby, _and, _or]\
            = [CaselessKeyword(word) for word in keywords]

        user = Word(alphas + "." + alphas)
        user2 = Combine(user + "'s")

        startdate = Word(alphas + nums + "/")
        enddate = Word(alphas + nums + "/")

        bnf = (
            (show | select) + (user | user2).setResultsName("user") +
            (commits).setResultsName("stats") +
            Optional(
                _from + startdate.setResultsName("start") +
                _to + enddate.setResultsName("end"))
            )

        a = bnf.parseString(stmnt)
        return a

This gives me something like:

In [3]: q.parser("show fizi commits from 1/1/2010 to 11/2/2006")
Out[3]: (['show', 'fizi', 'commits', 'from', '1/1/2010', 'to', '11/2/2006'], {'start': [('1/1/2010', 4)], 'end': [('11/2/2006', 6)], 'stats': [('commits', 2)], 'user': [('fizi', 1)]})

Then you can use libraries like delorean or arrow that try to deal intelligently with the date part - or just use regular old dateutil.


Solution 3:

You can make the pyparsing parser very lenient in what it matches, and then have a parse action do the more rigorous value checking. This is especially easy if your date strings are all non-whitespace characters.

For example, say we wanted to parse for a month name, but for some reason did not want our parser expression to just do `oneOf('January February March ...etc.'). We could put in a placeholder that will just parse a Word group of characters up to the next non-eligible character (whitespace, or punctuation).

monthName = Word(alphas.upper(), alphas.lower())

So here our month starts with a capitalized letter, followed by 0 or more lowercase letters. Obviously this will match many non-month names, so we will add a parse action to do additional validation:

def validate_month(tokens):
    import calendar
    monthname = tokens[0]
    print "check if %s is a valid month name" % monthname
    if monthname not in calendar.month_name:
        raise ParseException(monthname + " is not a valid month abbreviation")

monthName.setParseAction(validate_month)

If we do these two statements:

print monthName.parseString("January")
print monthName.parseString("Foo")

we get

check if January is a valid month name
['January']
check if Foo is a valid month name
Traceback (most recent call last):
  File "dd.py", line 15, in <module>
    print monthName.parseString("Foo")
  File "c:\python27\lib\site-packages\pyparsing.py", line 1125, in parseString
    raise exc
pyparsing.ParseException: Foo is not a valid month abbreviation (at char 0), (line:1, col:1)

(Once you are done testing, you can remove the print statement from the middle of the parse action - I just included it to show that it was being called during the parsing process.)

If you can get away with a space-delimited date format, then you could write your parser as:

date = Word(nums,nums+'/-')

and then you could accept 1/1/2001, 29-10-1929 and so forth. Again, you will also match strings like 32237--/234//234/7, obviously not a valid date, so you could write a validating parse action to check the string's validity. In the parse action, you could implement your own validating logic, or call out to an external library. (You will have to be wary of dates like '4/3/2013' if you are being tolerant of different locales, since there is variety in month-first vs. date-first options, and this string could easily mean April 3rd or March 4th.) You can also have the parse action do the actual conversion for you, so that when you process the parsed tokens, the string will be an actual Python datetime.


Post a Comment for "Build A Simple Parser That Is Able To Parse Different Date Formats Using Pyparse"