Python: how to create a regex for an inline ordered list?

Question

Python: how to create a regex for an inline ordered list?

I have a form field, most of which only contain an ordered list:

1. This item may be contain characters, symbols or numbers. 2. And this item also...

The following code does not work for validating user input (users can only enter string ordered list):

definiton_re = re.compile(r'^(?:\d\.\s(?:.+?))+$')
validate_definiton = RegexValidator(definiton_re, _("Enter a valid 'definition' in format: 1. meaning #1, 2. meaning #2...etc"), 'invalid')

PS: Here I am using the RegexValidator class from Django framework to validate the value of a form field.

0

python django regex

Sultan alotaibi 16 Aug 14 at 19:36

source to share

2 answers

Here is my solution. It works pretty well.

input = '1. List item #1, 2. List item 2, 3. List item #3.'
regex = re.compile(r'(?:^|\s)(?:\d{1,2}\.\s)(.+?)(?=(?:, \d{1,2}\.)|$)')
# Parsing.
regex.findall(input) # Result: ['List item #1', 'List item 2', 'List item #3.']
# Validation.
validate_input = RegexValidator(regex, _("Input must be in format: 1. any thing..., 2. any thing...etc"), 'invalid')
validate_input(input) # No errors.

0

Sultan alotaibi 17 Aug 14 at 17:05

source to share

Unihedron · Accepted Answer · 2014-08-17T18:11:15+0000

Nice solution from OP. To take it further, let's do some regex / golf optimizations.

(?<!\S)\d{1,2}\.\s((?:(?!,\s\d{1,2}\.),?[^,]*)+)

Here's what's new:

(?:^|\s)

Corresponds to backtracking between alternations. Here we use instead (?<!\S)

to assert that we are not before a character with no spaces.
\d{1,2}\.\s

does not have to be inside a group that is not captured.
(.+?)(?=(?:, \d{1,2}\.)|$)

too cumbersome. We'll change this bit to:
- (
  
  Capture group
- (?:
- (?!
  
  Negative view: make sure our position is NOT :
- ,\s\d{1,2}\.
  
  Comma, whitespace, then list index.
- )
- ,?[^,]*
  
  Here's an interesting optimization:
- - We match the comma, if any. Because we knew from our prediction that this position does not trigger a new list index. Therefore, we can safely assume that the remaining bit of non-comma sequences (if any) are unrelated to the next element, so we roll them over with a quantifier *
    
    and there is no backtracking.
- - This is a significant improvement over (.+?)
    
    .
- )+
  
  Keep repeating the group until a negative statement is made.
- )

You can use that instead of regex in another answer , and here's the regex demo !

Although at first glance, this problem is best solved by re.split()

parsing:

input = '1. List item #1, 2. List item 2, 3. List item #3.';
lines = re.split('(?:^|, )\d{1,2}\. ', input);
 # Gives ['', 'List item #1', 'List item 2', 'List item #3.']
if lines[0] == '':
  lines = lines[1:];
 # Throws away the first empty element from splitting.
print lines;

Here is a demo of the online code .

Unfortunately, you will need to follow the regex approach for validation, just compile the regex at the top:

regex = re.compile(r'(?<!\S)\d{1,2}\.\s((?:(?!,\s\d{1,2}\.),?[^,]*)+)')

Python: how to create a regex for an inline ordered list?

More articles: