Extract number of video or audio files from Wikipedia article

I'm trying to extract the number of video or audio files present in a Wikipedia article, I searched for an API but couldn't find one for that.

I noticed that when using the API to retrieve images for a specific page, the sound file with the .ogg extension appears in the list of images.

http://ar.wikipedia.org/w/api.php?format=xml&action=parse&page=%D8%AD%D9%88%D8%AB%D9%8A%D9%88%D9%86&prop=images&redirects=

I don't know if this case can be generalized, and can I use it to count video and audio files? Does anyone have any other way to do this?

+3


source to share


1 answer


Basically, all file types are handled the same by the API, but you can select the media type of each file and use it to filter video and audio files.

To get the media type of a file, you must use prop=imageinfo

(this will be changed to be more precise prop=fileinfo

in future releases) for each file. Since it prop=images

can be used as a generator, you can get a list of files and their media type in one API call like this

https://ar.wikipedia.org/w/api.php?action=query&generator=images&titles=%D8%AD%D9%88%D8%AB%D9%8A%D9%88%D9%86&redirects=&prop=imageinfo&iiprop=mediatype&continue=&format=xml

      

It is images

used here as a generator, returning a list of files, and the list of files, in turn, is submitted to the call imageinfo

.

For each file, you will get something like this:



"2014232": {
  "pageid": 2014232,
  "ns": 6,
  "title": "\u0645\u0644\u0641:06-Salame-Al Aadm 001.ogg",
  "imagerepository": "local",
  "imageinfo": [
    {
      "mediatype": "AUDIO"
    }
  ]
}

      

mediatype

can be any of the following (copy and paste from manual ):

UNKNOWN     // unknown format
BITMAP      // some bitmap image or image source (like psd, etc). Can't scale up.
DRAWING     // some vector drawing (SVG, WMF, PS, ...) or image source (oo-draw, etc). Can scale up.
AUDIO       // simple audio file (ogg, mp3, wav, midi, whatever)
VIDEO       // simple video file (ogg, mpg, etc; no not include formats here that may contain executable sections or scripts!)
MULTIMEDIA  // Scriptable Multimedia (flash, advanced video container formats, etc)
OFFICE      // Office Documents, Spreadsheets (office formats possibly containing apples, scripts, etc)
TEXT        // Plain text (possibly containing program code or scripts)
EXECUTABLE  // binary executable
ARCHIVE     // archive file (zip, tar, etc)

      

The default display mimetype <=> mediatype is available here , although this can be overridden on a separate wiki.

+1


source







All Articles