PHP: I want to create a page that fetches images from a forum thread, is it doable? CodeIgniter?

You have a forum (vbulletin) that has a bunch of images - how easy it would be to have a page that visits a thread, goes through each page and sends the images to the user (via ajax or whatever). I am not asking about filtering (which is easy, of course).

will we do it in a day? :)

I have a site that also uses codeigniter - would it be even easier to use it?

0


source to share


4 answers


Assuming this should be running on the server, curl + regexp are your friends .. and yes .. doable in a day ...



there are also open source HTML parsers out there that can do this cleaner

+2


source


It depends where your scraping script is running.

If it runs on the same server as the forum software, you may need to directly access the database and check image links. I'm not familiar with vbulletin, but it probably offers an api plugin that allows high level database access. This will make it easier to query for all messages in the stream.



If, however, your script is running on a different computer (or, in other words, not related to the forum software), it should act as an http client. It can fetch all pages of a stream (either automatically by looking for a NEXT link on the page, or manually by specifying all pages as parameters) and look up the html source code for image tags ( <img .../>

). You can then use a regular expression to extract the URLs of the images. Finally, the script can use these image urls to create another page displaying all of these images, or it can download them and create a batch.

In the second case, the script actually acts like a spider, so it must respect things like robots.txt or meta tags.

0


source


Do not forget to limit the sample. You don't want to overload the forum server by requesting many pages per second. The easiest way to do this is to just sleep for X seconds between each sample.

0


source


Yes, maybe a day

Since you already have a working CI setup, I would use that.

I would use the following approach:

1) Make a model in CI capable of:

  • vbulletin login (images are often added as attachments and you need to login before you can download them). Use something like snoopy .
  • collecting url for "last button" with preg_match (), parsing url with parse_url () / and parse_str () and creating links from page 1 to last page
  • collecting html from all generated links. Still using snoopy.
  • finding all images in html using preg_match_all ()
  • Loading all images. Still using snoopy.
  • moving the loaded image from the tmp directory to another directory renaming it imagename_01, imagename_02, etc. if the same image already exists.
  • storing the image name and exact byte in the db table. Then you can avoid downloading the same image more than once.

2) Create a method in the controller that collects all images

3) Set up a cronjob that collects images at regular intervals. wget -o / tmp / useless.html http: // localhost / imageminer / collect should do nicely

4) Write some code that outputs pretty html for enduser using db table to get images.

0


source







All Articles