How to get all urls on a wikipedia page

Looks like the definition of a Wikipedia API link is different from a URL? I am trying to use an API to return all urls on a specific wiki page.

I have been playing around with this request which I found from this page under generators and redirects.

+4


source to share


2 answers


I'm not sure why exactly you got confused (it will help if you explain it), but I'm pretty sure the request is not what you want. It lists links ( prop=links

) on pages that are linked ( generator=links

) from the "Title" page ( titles=Title

). It also lists only the first page of links on the first page of links (with a default minimum page size of 10).

If you want to get all links on the Title page:

  • Use only prop=links

    , you don't need a generator.
  • Increase the limit to the maximum by adding pllimit=max

    ( pl

    is the prefix for links

    )
  • Use the value specified in the element query-continue

    to navigate to the second (and next) page of results.

So, the request for the first page will look like this:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max



And the second (and in this case the final) page:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General

Another thing that might confuse you is that it links

only returns internal links (to other Wikipedia pages). To get external links use prop=extlinks

. You can also combine the two into one query:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks

+12


source


Here's a Python solution that gets (and prints) all the pages that a particular page links to. It gets the maximum number of links in the first request and then checks to see if the returned JSON object has a "continue" property. If so, it adds the value "plcontinue" to the params dictionary and makes another request. (The last page of returned results will not have this property.)

import requests

session = requests.Session()

url = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "format": "json",
    "titles": "Albert Einstein",
    "prop": "links",
    "pllimit": "max"
}

response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]

pg_count = 1
page_titles = []

print("Page %d" % pg_count)
for key, val in pages.items():
    for link in val["links"]:
        print(link["title"])
        page_titles.append(link["title"])

while "continue" in data:
    plcontinue = data["continue"]["plcontinue"]
    params["plcontinue"] = plcontinue

    response = session.get(url=url, params=params)
    data = response.json()
    pages = data["query"]["pages"]

    pg_count += 1

    print("\nPage %d" % pg_count)
    for key, val in pages.items():
        for link in val["links"]:
            print(link["title"])
            page_titles.append(link["title"])

print("%d titles found." % len(page_titles))

      



This code was adapted from the code in the MediaWiki API example : links .

0


source







All Articles