Regex to capture the smallest group

I am trying to grab an ID for a PDF Page object that looks like this:

4 0 obj
<<
/Type /Page /
...
>>
endobj

      

The identifier is " ID 0 obj". The problem is that my file has multiple objects and so the following template grabs from the first object declaration to the first instance of the Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

      

Here is an example of my file, if you want to try it you will see that there are several objects that include the word "Page":

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false  /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf

      

What should I change to avoid making him greedy?

EDIT: Clarifications

  • I forgot to mention that I need to grab all Page object IDs .
  • As some people have told me to use a more specific regex, I must say that this is not a formal example of how objects are constructed, and it is possible too. You can see that spaces are optional and that there can be multiple tags before the Page '/ Type / Page' tag.

Example:

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj

      

  • There are tags called Pages , PageLayout , SiglePage and I don't want to write them down.
+3


source to share


6 answers


you can use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

      

See regex demo



More details

  • ^

    - the beginning of a string anchor (as a modifier, m

    it ^

    matches the beginning of a string, not a whole string)
  • (\d+) 0 obj

    - 1 or more digits (written in group 1) followed by a space 0

    ,, space and a substringobj

  • (?:(?!^\d+ 0 obj$).)*?

    - a moderate greedy token that matches any char ( .

    ) that doesn't fire the template ^\d+ 0 obj$

    as multiple times as possible
  • \/Type\s*\/Page\s

    - /Type

    , 0 + spaces (replace \s

    with \h

    to match only horizontal space), /Page

    followed by space
  • .*?

    - any 0+ characters as few as possible before the first appearance
  • endobj

    - endobj

    followed by ...
  • $

    - end of line.
+1


source


You can add a fuzzy Questionmark to a specific quantifier:

Example:

 \(.*\)

      

Matches:

test (test) test (test) <test>



Example:

 \(.*?\)

      

Matches:

test (test) test (test) test (test)

0


source


Try a more specific regex so that it doesn't match the unwanted part of the text.

preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);

      

Proof: https://regex101.com/r/HjyQpS/1

0


source


This should work:

(\d+) 0 obj[^>]+/Page$

      

demo Regex101

0


source


I would not work with regular expressions in PDF. There are several conditions when this approach will fail.

  • The page object is inside the object stream (and therefore packed, most likely with the Deflate algorithm) (this is allowed with PDF version 1.5 and higher).
  • Incremental updates within a PDF document can result in double hits on the same page.
  • The marker / page is not inside the dictionary you want to match, but inside an indirect object (never seen, but theoretically possible). For example, you have:
5 0 obj
<< /Type 6 0 R ....>>
endobj     
6 0 obj
/Page
endobj

      

Note. You also cannot expect each page to be written in order within the pdf document, as you see in the viewer.

But if you really have to do it this way, I would first map the PDF to

/ ([0-9] +) 0 obj (. +?) Endobj /

and will search the second line for a match for

// Type \ S * \ Page [\ s>] /

The optional matching for> at the end is important because you should be able to match "/ Type / Page →" as well, where / Type / Page is the last entry in the PDF dictionary.

0


source


Use this regex:

/\d+\s0\sobj.+endobj/smU

      

Note that the modifier U

renders the match non-living. See a matching example here: https://www.tinywebhut.com/regex/8

0


source







All Articles