How to iterate over files in an S3 bucket?

I have a large number of files (> 1000) stored in an S3 bucket and I would like to iterate over them (for example in a loop for

) to extract data from them using boto3

.

However, I notice that according to http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects the list_objects()

class method Client

only lists up to 1000 objects:

In [1]: import boto3

In [2]: client = boto3.client('s3')

In [11]: apks = client.list_objects(Bucket='iper-apks')

In [16]: type(apks['Contents'])
Out[16]: list

In [17]: len(apks['Contents'])
Out[17]: 1000

      

However, I would like to list all objects , even if there are more than 1000 of them. How could I achieve this?

+3


source to share


2 answers


As a side note, kurt-peek boto3

has a class Paginator

that allows you to iterate over pages of s3 objects, and it can be easily used to provide an iterator over the elements in the pages:

import boto3


def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        for item in page['Contents']:
            yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i

      

Which will output something like:



{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
 u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
 u'Size': 242,
 u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
 u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
 u'Size': 238,
 u'StorageClass': 'STANDARD'}
...

      

Please note that it is list_objects

recommended instead list_objects_v2

: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

You can also do this at a lower level by calling list_objects_v2()

directly and passing the value NextContinuationToken

from the response as ContinuationToken

, and isTruncated

- in the response.

+5


source


I found out that it boto3

has a Paginator class to handle truncated results. The following worked for me:

paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')

      



after which I can use the generator page_iterator

in a loop for

.

+1


source







All Articles