Regex to capture the smallest group

Question

Regex to capture the smallest group

I am trying to grab an ID for a PDF Page object that looks like this:

4 0 obj
<<
/Type /Page /
...
>>
endobj

The identifier is " ID 0 obj". The problem is that my file has multiple objects and so the following template grabs from the first object declaration to the first instance of the Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

Here is an example of my file, if you want to try it you will see that there are several objects that include the word "Page":

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false  /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf

What should I change to avoid making him greedy?

EDIT: Clarifications

I forgot to mention that I need to grab all Page object IDs .
As some people have told me to use a more specific regex, I must say that this is not a formal example of how objects are constructed, and it is possible too. You can see that spaces are optional and that there can be multiple tags before the Page '/ Type / Page' tag.

Example:

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj

There are tags called Pages , PageLayout , SiglePage and I don't want to write them down.

+3

php regex

Shashimee Jul 12 17 at 13:21

source to share

6 answers

You can add a fuzzy Questionmark to a specific quantifier:

Example:

 \(.*\)

Matches:

test (test) test (test) <test>

Example:

 \(.*?\)

Matches:

test (test) test (test) test (test)

0

Bernhard Jul 12 17 at 13:23

source to share

Try a more specific regex so that it doesn't match the unwanted part of the text.

preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);

Proof: https://regex101.com/r/HjyQpS/1

0

Māris Kiseļovs Jul 12 17 at 13:30

source to share

This should work:

(\d+) 0 obj[^>]+/Page$

demo Regex101

0

BrightOne Jul 12 17 at 13:30

source to share

I would not work with regular expressions in PDF. There are several conditions when this approach will fail.

The page object is inside the object stream (and therefore packed, most likely with the Deflate algorithm) (this is allowed with PDF version 1.5 and higher).
Incremental updates within a PDF document can result in double hits on the same page.
The marker / page is not inside the dictionary you want to match, but inside an indirect object (never seen, but theoretically possible). For example, you have:

5 0 obj
<< /Type 6 0 R ....>>
endobj     
6 0 obj
/Page
endobj

Note. You also cannot expect each page to be written in order within the pdf document, as you see in the viewer.

But if you really have to do it this way, I would first map the PDF to

/ ([0-9] +) 0 obj (. +?) Endobj /

and will search the second line for a match for

// Type \ S * \ Page [\ s>] /

The optional matching for> at the end is important because you should be able to match "/ Type / Page →" as well, where / Type / Page is the last entry in the PDF dictionary.

0

Patrick Fritzsch Jul 12 '17 at 14:00

source to share

Use this regex:

/\d+\s0\sobj.+endobj/smU

Note that the modifier U

renders the match non-living. See a matching example here: https://www.tinywebhut.com/regex/8

0

Saral Jul 12 17 at 14:17

source to share

Wiktor Stribiżew · Accepted Answer · 2017-07-12T14:00:08+0000

you can use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See regex demo

More details

^

- the beginning of a string anchor (as a modifier, m

it ^

matches the beginning of a string, not a whole string)
(\d+) 0 obj

- 1 or more digits (written in group 1) followed by a space 0

,, space and a substringobj
(?:(?!^\d+ 0 obj$).)*?

- a moderate greedy token that matches any char ( .

) that doesn't fire the template ^\d+ 0 obj$

as multiple times as possible
\/Type\s*\/Page\s

- /Type

, 0 + spaces (replace \s

with \h

to match only horizontal space), /Page

followed by space
.*?

- any 0+ characters as few as possible before the first appearance
endobj

- endobj

followed by ...
$

- end of line.

Regex to capture the smallest group

More articles: