Skip to content Skip to sidebar Skip to footer

How To Filter Specific Rows From A Huge Csv File Using Python Script

Is there an efficient way in python to load only specific rows from a huge csv file into the memory (for further processing) without burdening the memory? E.g: Let's say I want to

Solution 1:

import csv

filter_countries = {'US': 1}
withopen('data.tsv', 'r') as f_name:
    for line in csv.DictReader(f_name, delimiter='\t'):
        if line['country'] notin filter_countries:
            print(line)

Solution 2:

You still need to process every row in the file in order to check your clause. However, it's unnecessary to load all file into memory so you can use streams as following:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in spamreader:
        if row[0] == '2015/03/01':
            continue

        # Process data here

If you need just to have a list of matched rows it's faster and even simpler to use list comprehension as follow:

import csv
withopen('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    rows= [rowforrowin spamreader if row[0] =='2015/03/01']

Solution 3:

If the dates can appear anywhere you will have to parse the whole file:

import csv

defget_rows(k, fle):
    withopen(fle) as f:
        next(f)
        for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
            if row[0] == k:
                yield row


for row in get_rows("2015/03/02", "in.txt"):
    print(row)

You could use the multiprocessing to speed up the parsing splitting the data into chunks. There are some ideas here

Post a Comment for "How To Filter Specific Rows From A Huge Csv File Using Python Script"