How To Filter Specific Rows From A Huge Csv File Using Python Script
Is there an efficient way in python to load only specific rows from a huge csv file into the memory (for further processing) without burdening the memory? E.g: Let's say I want to
Solution 1:
import csv
filter_countries = {'US': 1}
withopen('data.tsv', 'r') as f_name:
for line in csv.DictReader(f_name, delimiter='\t'):
if line['country'] notin filter_countries:
print(line)
Solution 2:
You still need to process every row in the file in order to check your clause. However, it's unnecessary to load all file into memory so you can use streams as following:
import csv
with open('huge.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
for row in spamreader:
if row[0] == '2015/03/01':
continue
# Process data here
If you need just to have a list of matched rows it's faster and even simpler to use list comprehension as follow:
import csv
withopen('huge.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
rows= [rowforrowin spamreader if row[0] =='2015/03/01']
Solution 3:
If the dates can appear anywhere you will have to parse the whole file:
import csv
defget_rows(k, fle):
withopen(fle) as f:
next(f)
for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
if row[0] == k:
yield row
for row in get_rows("2015/03/02", "in.txt"):
print(row)
You could use the multiprocessing to speed up the parsing splitting the data into chunks. There are some ideas here
Post a Comment for "How To Filter Specific Rows From A Huge Csv File Using Python Script"