Clear google all search results for certain criteria?

I am working on my cartographer and I need to get the complete newegg.com map

I could try to cut NE directly (which violates NE policies), but they have a lot of products that are not available on NE direct search, but only through google.com search; and I need those links too.

Here is a search string that returns 16mil results: https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr= all & as_sitesearch = newegg.com & as_occt = url & safe = off & tbs = & as_filetype = & as_rights =

I want my scraper to go through all the results and write hyperlinks to all of those results. I can discard all links from google search results, but google has a 100 page limit per query - 1000 results and again, google is not happy with this approach. :)

I'm new to this; Could you please advise / point me in the right direction? Are there any tools / methodology that could help me achieve my goals?

0


source to share


3 answers


I'm new to this; Could you please advise / point me in the right direction? Are there any tools / methodology that could help me achieve my goal?

Google is taking many steps to prevent their pages from crawling, and I'm not talking about just asking you to respect their robots.txt. I disagree with their ethics, not their T&C, not even the "simplified" version that they supplanted (but that's a separate issue).



If you want to be seen , you must allow Google to crawl your page; however, if you want to bypass Google, you have to jump over some big hoops! Namely, you need to get a bunch of proxies so you can get past the speed limit and the 302s + captcha pages they post anytime they get suspicious of your "activity".

Even though Google T&C has been heavily compounded, I would not recommend that you break it. However, if you absolutely need to fetch data, you can fetch a large list of proxies , load them into the queue, and pull the proxies out of the queue every time you want to fetch a page. If the proxy is working, put it back in the queue; otherwise, discard the proxy. It might even be possible to give a counter for each failed proxy and drop it if it exceeds a certain number of failures.

+2


source


I haven't tried it, but you can use Google Custom Search API . Of course, it starts to cost money after 100 searches a day. I guess they have to do business, p



+1


source


It might be a little late, but I think it's worth mentioning that you can clean Google professionally and not create problems with it.

This is not really a threat that I know to clean up Google.
If you are inexperienced it is not advisable, but I am not aware of any case of legal consequences and I always follow this thread.

Perhaps one of the biggest cases of scrapers happened a few years ago, when Microsoft scraped Google to use Bing. Google was able to prove this by supplying fake results that don't exist in the real world, and Bing suddenly took them away.
Google named and dishonored them that it all happened as far as I can remember.

Using the API is rarely used in real life, it costs a lot of money to use it even for a small number of results, and the free amount is quite small (40 requests an hour before being banned). Another disadvantage is that the API does not reflect real search results, in your case there may be less problems, but in most cases people want real rankings.

Now, if you don't accept Googles TOS or ignore it (they don't care about your TOS when they scrape you off at startup), you can go the other way.
Imitate a real user and get data directly from the SERP.

The key here is to send about 10 requests per hour (can be increased to 20) with each IP address (yes, you are using more than one IP address). Over the past years, this amount has not created problems with Google.
Use caching, databases, IP rotation control to avoid hitting it more often than necessary.
IP addresses should be clean, not segregated and, if possible, no offensive history.
The originally suggested proxy list would complicate this topic as you end up with unstable, unreliable IP addresses with questionable use of absurdity, share and history.

There is an open source PHP project at http://scraping.compunect.com which contains all the functions that need to run, I used it for my work, now it has been running for several years without any problems. This is a pre-built project that is mostly built to be used as a custom base for your project, but also works standalone.

Also PHP is not a bad choice, I was initially skeptical, but I ran PHP (5) as a background process for two years without a single interruption.
The performance is good enough for a project like this, so I would give it a shot. Otherwise PHP code is similar to C / JAVA .. you can see how they are done and replayed in your own project.

0


source







All Articles