Pyparsing: Extract Variable Length, Variable Content, Variable Whitespace Substring
I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another numb
Solution 1:
Here is a sample to pull out the patient data and any matching Gleason data.
from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON SCORE: 3 + 3 = 6' == gleason
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11 S11-4444 20/111-22-3333' == patientData
partMatch = patientData("patientData") | gleason("gleason")
lastPatientData = None
for match in partMatch.searchString(data):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
lastPatientData.patientData, match.gleason
)
Prints:
01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)
Solution 2:
Take a look at the SkipTo parse element in pyparsing. If you define a pyparsing structure for the num+num=num part, you should be able to use SkipTo to skip anything between "Gleason" and that. Roughly like this (untested pseuo-pyparsing):
score = num + "+" + num + "=" num
Gleason = "Gleason" + SkipTo(score) + score
PyParsing by default skips whitespace anyway, and with SkipTo you can skip anything that doesn't match your desired format.
Solution 3:
gleason = re.compile("gleason\d+\d=\d")
scores = set()
for record in records:
for line in record.lower().split("\n"):
if "gleason" in line:
scores.add(gleason.match(line.replace(" ", "")).group(0)[7:])
Or something
Post a Comment for "Pyparsing: Extract Variable Length, Variable Content, Variable Whitespace Substring"