Find the largest match area in a text file

A.txt contains lines that look like this (or is this a small part):

Green- Blue- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1
Green- Blue- 1

      

Basically, the last line is either 1 or 3. Suppose the above example goes on for a very long time, I need to find the largest number of consecutive lines that have a 1 at the end, keeping the number 3s less than or equal to some number (say 2). For example, suppose A.txt looks like this in general:

Green- Blue- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 3
Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1
Blue- Blue- 3

      

The script then writes the following lines to another text file:

Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1

      

How should I code this? Thanks in advance!

+3


source to share


4 answers


You really have no other choice to iterate over the entire file, keeping track of the largest sequence. Here's my example, encapsulated by a function: It uses a stack and iterates over file by line, so it should be efficient for large input files.

def foo(in_file, out_file, max_count):
    biggest, stack = [], []
    count = 0
    with open(in_file) as f:
        for line in f:
            if line[-2] == '3':
                count += 1
            if count > max_count:
                if len(stack) > len(biggest):
                    biggest = list(stack)
                # this line trims the list after the first element that ends with '3'
                stack = stack[stack.index(next(x for x in stack if x[-2] == '3')) + 1:]
                count = max_count
            stack.append(line)

    with open(out_file, 'w') as f:
        f.write(''.join(max(biggest, stack)))

      



Note . This will only work as intended if the file contains a blank line at the end and assumes it max_count

will always be greater than 0 (otherwise call next

throws Exception that is not handled).

+2


source


First of all, the start line is completely irrelevant. Second, there are probably 100 ways to fix this problem. I'm just going to list the one that seems best to me

You can also assume that the initial boundary will always be:

a) Beginning of the list

b) Immediately after 3

You can also assume that the end boundary will always be:

a) End of the list

b) Right in front of 3

So let's create a new one



threes = [-1, ... numbers.length + 1]

      

where ... are the line numbers of each 3. I add -1 and numbers.length + 1 to the list to "pretend" that our list is surrounded by two 3s to keep things simple.

Since it is not listed in the problem statement, we can also assume that the list will always contain at least 2 3s if possible. The reason is that this will give us the largest range.

Now we need to find the maximum range of line numbers between any two triplets.

max_range = -1 # number of lines between two 3s.
max_start = -1 # start line
max_end = -1   # end line

if len(threes) == 2: # special case here.  If the original list contains no 3s, we will take the whole list.
    max_start = threes[0]
    max_end = threes[1]
    max_range = max_end - max_start
else:
    for i in range(len(threes) - 2):
        # The general case.  Find the range between any two consecutive 3s.
        start = threes[i]
        end = threes[i + 2]
        range = end - start

        if range > max_range:
            max_start = start
            max_end = end
            max_range = range
max_start += 1
max_end -= 1
max_range -= 2

      

There are a few edge cases here, but this should get you started.

The first edge case (not defined in the problem) is what happens if I end up with [1, 1, 1, 3, 3]? Should I take 0-3, 0-4, or 0-5? All seem to be valid solutions. In this code, I am accepting 0-5 because it is not specified and it makes the code simpler.

+1


source


Something you can look into preserving indices using the combination itertools.groupby

txt = '''Green- Blue- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 3
Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1
Blue- Blue- 3'''

import operator
from itertools import groupby
str_lst = list( enumerate( txt.split('\n') ) )

grp_lst = [ list(g) for k, g in groupby( [ (k,v[-1]) for k, v in str_lst ], key=operator.itemgetter(1) ) ]
filter_lst  = [ (i[0], len(i)) for i in grp_list if i[0][1] == '1' ]

for i in grp_list:
    if i[0] == max( dict(filter_lst).items(), key=operator.itemgetter(1) )[0]:
        idx = grp_list.index(i)
        break

for i in sum( grp_lst[idx:idx+3], [] ):
    print (str_lst[i[0]][1])

      

Output:

Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1

      

+1


source


This is my decision.

First read the file and extract only the data you need, i.e. the last digit.

x = ''
for i, line in enumerate(txt.split('\n')):
    try:
        x += line[-1]
    except IndexError:
        pass

      

You will end up with a line containing all 1s and 3s as they appear after the line.

>>>print x
'1333113313'

      

At this point, you can iterate over that string and collect all possible substring that doesn't contain 3 s more than doubled. You can keep track of the index of the first letter of the string as well as its length.

results = {}
for i, n in enumerate(x):
    for idx in range(i+1, len(x)):
        if x[i:idx].count('3') <= 2:
            results[i] = len(x[i:idx])
        else:
            break

      

Finally, sort the results by length and you will get the line number that the longest sequence starts with and the number of lines it takes.

m = sorted(results.items(), key=operator.itemgetter(1))[-1]
>>>print m
(4, 5)

      

You can use this information to write the output file. This way you will save 5 lines starting at line 4.

with open('myfile.txt', 'r') as inp, open('out.txt', 'w') as out:
    for line in inp.readlines()[m[0]:m[0]+m[1]]
        out.write(line)

      

+1


source







All Articles