Skip to content Skip to sidebar Skip to footer

How Do I Read A Gzipped Parquet File From S3 Into Python Using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I wa

Solution 1:

The solution is actually quite straightforward.

import boto3 # For read+push to S3 bucketimport pandas as pd # Reading parquetsfrom io import BytesIO # Converting bytes to bytes input fileimport pyarrow # Fast reading of parquets# Set up your S3 client# Ideally your Access Key and Secret Access Key are stored in a file already# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
                  aws_access_key_id=ACCESS_KEY_HERE,
                  aws_secret_access_key=SECRET_ACCESS_KEY_HERE)

# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)

# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()

# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))

Solution 2:

If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:

import boto3
import pandas as pd
import io

s3 = boto3.resource(service_name='s3', region_name='XXXX',
                    aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

If you are using Glue job you may refer to the second solution of Corey in the Glue script:

df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')

In case you want to read a .json file (using an IDE in your laptop/PC):

object = s3.Object(bucket_name='bucket_name',
                key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

Post a Comment for "How Do I Read A Gzipped Parquet File From S3 Into Python Using Boto3?"