Reading PDF, character problems

Question

Reading PDF, character problems

I am trying to use PurePDF to collect some information inside a PDF file, but cannot read PurePDF.

Whenever PurePDF tries to read any pdf file, it says it can't find its header, I tried debugging it and noticed that the string read by bytearray fits like Japanese characters! I tried changing the endian of my pdf bytearray before passing it to PurePDF, but it didn't change anything.

The pdf file is fine as I see the header "% PDF-" whenever I open it as text, but for some reason ActionScript is getting the wrong characters, so PurePDF just can't work at all.

Any ideas?

Thank.

Update: I'm not a bytearray expert, but I decided to talk to him and execute the code through the debugger and found out that he used readInt () to get characters, I just rewrote it with readByte () and now it reads PDF! I can still see if the functions will work ... Can someone with more low level programming explain to me what might be going on? I don't think the project is broken into svn

This is the code I used, I think it is quite simple:

private function loadPdf():void
    {
        var loader:URLLoader=new URLLoader();
        loader.dataFormat=URLLoaderDataFormat.BINARY;
        loader.addEventListener(Event.COMPLETE, onLoadComplete);
        loader.load(new URLRequest(PDF_FILE));
    }

protected function onLoadComplete(event:Event):void
    {
        var data:ByteArray = URLLoader(event.target).data as ByteArray;
        pdfReader = new PdfReader(data);
        pdfReader.readPdf();
    }

+3

actionscript flash actionscript-3 bytearray

rsantos Feb 17 13 at 2:33 am

source to share

1 answer

VC.One · Answer 1 · 2013-02-18T07:57:28+0000

I haven't worked with PurePDF before, but I've used bytearray to extract information from files. What exactly do you want from this pdf? Do you want to extract only text? Also can you download the PDF link? It will be easier to help if we look at the same thing.

About Japanese text ... When you read a PDF file into bytearray, don't expect to find human readable text easily, because most of this data is for customizing the file structure, etc. The actual text and images from PDF are placed inside tags called Streams. This way you usually find a stream of text and extract it into your bytearray. To display the text correctly, use the decoder type (UTF-8, UTF-16, etc.) as indicated in the PDF data.

This link below explains better about PDF streams: ("/ Length" becomes your bytearray length, and "Filter" indicates the decoding type (encoding type, like ASCII), etc.)

http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/

It all makes sense anyway if you open your PDF in a Hex editor. Try one below if you need it. Now you can see where your streaming positions are and tell AS3 to fetch from there:

http://www.hhdsoftware.com/free-hex-editor

If there is still a problem, download the PDF and tell me what exactly you are trying to extract from the document. I'll try to give exact help for this (no promises, just trying to help). World.

Reading PDF, character problems

More articles: