Why do file_get_contents return strange characters?

I am trying to parse http://www.desi-tashan.com/category/pakistan-tvs/aaj-tv/3-idiots/ with a_get_contents file.

But it returns very unusual characters and symbols.

where as if I was parsing http://www.desi-tashan.com/ it works well. Can anyone tell why this is happening?

Is there any decoding encoding?

The page seems to be made with wordpress ..

+1


source to share


3 answers


the content you see is gzipped

you might be interested in looking at gzdecode

or zlib-decode

(note that Zlib support in PHP is not enabled by default)

Your code might look like this:



$url = 'http://www.desi-tashan.com/category/pakistan-tvs/aaj-tv/3-idiots/';
$content = file_get_contents($url);
$decoded_content = gzdecode($content); // or zlib_decode($content);

      

Another solution here on stackoverflow that adds an HTTP header Accept-Encoding

to the request indicating to the server NOT to gzip.

However, it doesn't work for www.desi-tashan.com

, the server ignores the header Accept-Encoding

and always returns gzipped content

+7


source


I've seen this happen on sites where the web server is misconfigured and sends back a compressed page, whether the client indicates it can handle it. (The client points this to a header Accept-Encoding

that file_get_contents won't send.) This usually works in web browsers as they either ask for the default compressed page or handle a gzip response even if they don't ask for it.

(By the way, if on a unix derived system, you can easily confirm that the return value is gzipped by saving it to a file and then running file . Or just look at the first couple of bytes of the result yourself - gzip data starts at 1F 8B.)

Instead of manually unzipping the content, I personally used the PHP curl library. You can configure this to request the gzipped content, and if you do, it will transparently unzip the result for you:



$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://actualidad.rt.com/actualidad');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
$content = curl_exec ($ch);

      

This is a more reliable future than manually decoding the result, as if the web server is properly configured in the future to send back plain text to clients that cannot handle gzip, this code will still request and decode the compressed version.

+4


source


You can simply use the javascript charAt method to get the string character at a specific position. Or Pretty clear, just feed the function with the filename and it will return the extension of the file you choose.

-1


source







All Articles