/ Dictionary of Differences for Syntax Coding Problem in PDF

The Type1 font encoding /Differences

uses strings when matching values, for example 1 character is encoded into "one". It is only used for numbers and special characters.

What is the standard way to use this encoding?

How do I decode a string from a PDF that uses encoding like this?

File link: http://www.filedropper.com/open

+3


source to share


2 answers


Here's an array /Differences

in your file (and frankly, you should have just put this link, not a link to the skeevy download page):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

      

How it works, a font also has an encoding associated with it (for example, /MacRoman

or /WinANSI

). In the case of a Type 1 font, an encoding is embedded in the font. Then, given a copy of that encoding, you apply the differences to it. Start with a number (your first is 24), you make entries 24-31 inclusive /breve

, /circumflex

etc.



Type 1 fonts have a vocabulary called the name /CharStrings

, which is the association of the glyph name with the data / code that will render it. If, for example, you get a character with code 26, you loop through it in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex

. Then you look in the dictionary CharStrings

, pull out the glyph data and draw it. Any character that does not exist in the encoding must be set to /.notdef

, which will then display a shape representing an undefined character (usually empty).

Now, your problem is probably how to turn those glyph names into something more useful like Unicode?

If you look in Appendix D, you will see a set of tables that define the character sets for the standard Latin encodings. You will create a lookup table that maps standard Adobe names to Unicode. Unfortunately, the tables in Appendix D are incomplete. Luckily, Adobe has a file that defines it all for you here . There is a link in this file that is now dead, but most likely it should have been here .

+4


source


How do I decode a string from a PDF that uses encoding like this?

As the spec explains:

9.10.2 Displaying character codes in Unicode values

A compliant reader can use these methods at the specified priority to map a character code to a Unicode index. Embedded PDFs, in particular, must contain at least one of the following:

  • If the font dictionary contains a CMAP ToUnicode , use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding , MacExpertEncoding, or WinAnsiEncoding , or whatever has an encoding, whose Differences array contains only the character names taken from the Adobe standard Latin character set and the named character set in the Symbol font:

    a) Match the character code with the character name according to Table D.1 and font Differences .

    b) Find the symbol name in the Adobe Glyph list to get the corresponding Unicode value.

  • If the font is a composite font ... (not applicable in your case)

If these methods do not return a Unicode value, it is impossible to determine what the character code is, in which case the appropriate reader can choose the character code of their choice.

(ISO 32000-1)

First of all, you must look for a ToUnicode card .

If not (as is the case for your sample document), use Encoding (predefined or diff).

And unless your code maps to something correctly encoded, there is no way, according to the spec, to determine what a character code is!

If this font is embedded, you may have an exit from the embedded font program parsing, which may include its own Unicode mapping.

Otherwise, you can start guessing (or delegate OCR).




But your guess

Used only for numbers and special characters.

is already wrong. If you take a look at your example doc for example. the two fonts F25 and F26 used on the first page of your document have an array of Differences :

0 /.notdef 1 / dotaccent / c / fl / share / hungarumlaut / lslash / lslash / ogonekom / ring 10 /.notdef 11 / brevis / minus 13 /.notdef 14 / Zcaron / zcaron / gachek / dotlessi / dotlessj / f. F. / FFI / FFL 22 /.notdef 30 / grave / quotesingle / space / exclam / quotedbl / numbersign / dollar / percentage / ampersand / quoteright / parenleft / parenright / asterisk / plus / comma / hyphen / period / slash / zero / one / two / three / 4/5 / six / seven / 8/9 / colon / semicolon / Less / uniform / greater / question / in / A / B / C / D / E / F / Y / HOUR / I / J / K / L / M / N / O / P / Q / P / S / T / U / V, / W / X / Y / Z / bracketleft / backslash / bracketright / asciicircum / underscore / quoteleft / a / b / s / d / e / e / g / hour / i / j / k / l / m / p / o / p / d / p / s / t / y / v / w / x / y / r / braceleft / bar / braceright / asciitilde 127 /.notdef 130 / quotesinglbase / florin / quotedblbase / ellipsis / dagger / daggerdbl / envelope / perthousand / Scaron / guilsinglleft / OE 141 /.notdef 147 / quotedblleft / quotedblright / bullet / short shooting range / emdash / tilde / trademark / scaron / O.E. 157 /.notdef 159 / Ydieresis 160 /.notdef 161 / exclamdown / cent / pound sterling / currency / yen / brokenbar / section / diuresis / Copyright / ordfeminine / guillemotleft / logicalnot / hyphen / registered / macron / degree / plusminus / twosuperior / threesuperior / acute / mu / paragraph / periodcentered / cedilla / onesuperior / ordmasculine / guillemotright / one quarter / one half / three quarters / questiondown / Grave / Aacute / Acircumflex / Atilde / Adieresis / Ring / AE / Ccedilla / Egrave / Eacute / Ecircumflex / Edieresis / Igrave / Iacumflex / Ic Idieresis / Eth / Ntilde / Ograve / Oacute / Ocircumflex / Otilde / Odieresis / multiplication / Oslash / Ugrave / Uacute / Ucircumflex / Udieresis / Yacute / Thorn / germandbls / grave / aacute / acircumflex / atilde / ring / cork / c egrave / eacute / ecircumflex / edieresis / igrave / iacute / icircumflex / idieresis / ETH / ntilde / ograve / oacute / ocircumflex/ otilde / odieresis / share / oslash / ugrave / uacute / ucircumflex / udieresis / yacute / thorn / Ydieresis

which contains mappings for normal uppercase / A .. / Z and lowercase / a .. / z > characters.




By the way,

Type1 / Differences font encoding uses strings when matching values, for example 1 character is encoded into "one".

is not strictly correct, / ' characters are part of the corresponding display value, eg. / one , and as PDF objects these are not Strings , but Names .

+3


source







All Articles