How to get all urls on a wikipedia page
Looks like the definition of a Wikipedia API link is different from a URL? I am trying to use an API to return all urls on a specific wiki page.
I have been playing around with this request which I found from this page under generators and redirects.
source to share
I'm not sure why exactly you got confused (it will help if you explain it), but I'm pretty sure the request is not what you want. It lists links ( prop=links
) on pages that are linked ( generator=links
) from the "Title" page ( titles=Title
). It also lists only the first page of links on the first page of links (with a default minimum page size of 10).
If you want to get all links on the Title page:
- Use only
prop=links
, you don't need a generator. - Increase the limit to the maximum by adding
pllimit=max
(pl
is the prefix forlinks
) - Use the value specified in the element
query-continue
to navigate to the second (and next) page of results.
So, the request for the first page will look like this:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max
And the second (and in this case the final) page:
Another thing that might confuse you is that it links
only returns internal links (to other Wikipedia pages). To get external links use prop=extlinks
. You can also combine the two into one query:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks
source to share
Here's a Python solution that gets (and prints) all the pages that a particular page links to. It gets the maximum number of links in the first request and then checks to see if the returned JSON object has a "continue" property. If so, it adds the value "plcontinue" to the params dictionary and makes another request. (The last page of returned results will not have this property.)
import requests
session = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": "Albert Einstein",
"prop": "links",
"pllimit": "max"
}
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count = 1
page_titles = []
print("Page %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
while "continue" in data:
plcontinue = data["continue"]["plcontinue"]
params["plcontinue"] = plcontinue
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count += 1
print("\nPage %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
print("%d titles found." % len(page_titles))
This code was adapted from the code in the MediaWiki API example : links .
source to share