How paging works in list_blobs function in Google Cloud Storage Python client library

I want to get a list of all blobs in a Google Cloud Storage bucket using the client library for Python .

According to the documentation, I have to use the function list_blobs()

. The function seems to use two arguments max_results

and page_token

to achieve paging. I'm not sure how to use them.

Specifically, where do I get it from page_token

?

I expected to list_blobs()

provide page_token

for use in subsequent calls, but I can't find any documentation on it.

It max_results

is also optional. What happens if I don't provide this? Is there a default limit? If so, what is it?

+9


source to share


3 answers


list_blobs()

uses paging, but you don't use paging page_token

to achieve it.

How it works:

The way it list_blobs()

is, it returns an iterator that iterates through all the results , paging behind the scenes . So doing this simply will help you get all the results, fetching pages as needed:

for blob in bucket.list_blobs()
    print blob.name

      

The documentation is incorrect / misleading:

As of 26/04/2017, this is what the docs say:



page_token

(str) - (optional) Opaque marker for the next "page" clump. If not, will return the first page of the blob.

This means that the result will be one page of results s page_token

, which determines which page. This is not true. The result iterator iterates through multiple pages. Which actually represents page_token

which page the iterator should START on at. This does not mean that it page_token

will start from the first page.

Good to know:

max_results

limits the total number of results returned by the iterator.

The iterator does indeed expose the pages if you need it:

for page in bucket.list_blobs().pages:
    for blob in page:
        print blob.name

      

+9


source


I'm just going to leave it here. I'm not sure if the libraries have changed in the last 2 years since this answer was posted, but if you use a prefix it for blob in bucket.list_blobs()

doesn't work correctly. It seems that getting blobs and prefixes is fundamentally different. And the use of prefixed pages is confusing.

I found a post on the github release ( here ). This works for me.

def list_gcs_directories(bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print page, page.prefixes
        prefixes.update(page.prefixes)
    return prefixes

      



Another comment on the same issue suggests the following:

def get_prefixes(bucket):
    iterator = bucket.list_blobs(delimiter="/")
    response = iterator._get_next_page_response()
    return response['prefixes']

      

Which only gives you prefixes if all of your results fit on the same page.

0


source


It was a little confusing, but I found the answer here

https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html

You can iterate over pages and call the items you want

iterator=self.bucket.list_blobs()        

self.get_files=[]        
for page in iterator.pages:
    print('    Page number: %d' % (iterator.page_number,))
    print('  Items in page: %d' % (page.num_items,))
    print('     First item: %r' % (next(page),))
    print('Items remaining: %d' % (page.remaining,))
    print('Next page token: %s' % (iterator.next_page_token,))        
    for f in page:
        self.get_files.append("gs://" + f.bucket.name + "/" + f.name)

print( "Found %d results" % (len( self.get_files))) 

      

-1


source







All Articles