How paging works in list_blobs function in Google Cloud Storage Python client library
I want to get a list of all blobs in a Google Cloud Storage bucket using the client library for Python .
According to the documentation, I have to use the function list_blobs()
. The function seems to use two arguments max_results
and page_token
to achieve paging. I'm not sure how to use them.
Specifically, where do I get it from page_token
?
I expected to list_blobs()
provide page_token
for use in subsequent calls, but I can't find any documentation on it.
It max_results
is also optional. What happens if I don't provide this? Is there a default limit? If so, what is it?
source to share
list_blobs()
uses paging, but you don't use paging page_token
to achieve it.
How it works:
The way it list_blobs()
is, it returns an iterator that iterates through all the results , paging behind the scenes . So doing this simply will help you get all the results, fetching pages as needed:
for blob in bucket.list_blobs()
print blob.name
The documentation is incorrect / misleading:
As of 26/04/2017, this is what the docs say:
page_token
(str) - (optional) Opaque marker for the next "page" clump. If not, will return the first page of the blob.
This means that the result will be one page of results s page_token
, which determines which page. This is not true. The result iterator iterates through multiple pages. Which actually represents page_token
which page the iterator should START on at. This does not mean that it page_token
will start from the first page.
Good to know:
max_results
limits the total number of results returned by the iterator.
The iterator does indeed expose the pages if you need it:
for page in bucket.list_blobs().pages:
for blob in page:
print blob.name
source to share
I'm just going to leave it here. I'm not sure if the libraries have changed in the last 2 years since this answer was posted, but if you use a prefix it for blob in bucket.list_blobs()
doesn't work correctly. It seems that getting blobs and prefixes is fundamentally different. And the use of prefixed pages is confusing.
I found a post on the github release ( here ). This works for me.
def list_gcs_directories(bucket, prefix):
# from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print page, page.prefixes
prefixes.update(page.prefixes)
return prefixes
Another comment on the same issue suggests the following:
def get_prefixes(bucket):
iterator = bucket.list_blobs(delimiter="/")
response = iterator._get_next_page_response()
return response['prefixes']
Which only gives you prefixes if all of your results fit on the same page.
source to share
It was a little confusing, but I found the answer here
https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html
You can iterate over pages and call the items you want
iterator=self.bucket.list_blobs()
self.get_files=[]
for page in iterator.pages:
print(' Page number: %d' % (iterator.page_number,))
print(' Items in page: %d' % (page.num_items,))
print(' First item: %r' % (next(page),))
print('Items remaining: %d' % (page.remaining,))
print('Next page token: %s' % (iterator.next_page_token,))
for f in page:
self.get_files.append("gs://" + f.bucket.name + "/" + f.name)
print( "Found %d results" % (len( self.get_files)))
source to share