Design Pattern for Combining Lazy Lists

I am writing a program like this:

  • Find all files with the correct extension in a given directory
  • Foreach, find all occurrences of the given line in these files
  • Print each line

I would like to write this in a functional way, like a series of generator functions (things that only call yield return

and return one element at a time when it is lazy loaded), so my code would look like this:

IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
IEnumerable<string> matchingLines = GetMatchingLines( contents );

foreach( var lineText in matchingLines )
  Console.WriteLine( "Found: " + lineText );

      

It's ok, but what I would also like to do is print some statistics at the end. Something like that:

Found 233 matches in 150 matching files. Scanned 3,297 total files in 5.72s

      

The problem is to write code in a "pure functional" style, like above, each element is lazy loaded.
You only know how many files match the total until the final foreach loop completes, and since only one element is always yield

ed at a time, the code has no place to keep track of how many things it found earlier. If you call LINQ method matchingLines.Count()

it will re-enumerate the collection!

I can think of many ways to solve this problem, but they all seem somewhat ugly. It strikes me as something that people should have done before, and I'm sure there will be good design that shows the best way to do it.

Any ideas? Greetings

+1


source to share


6 answers


In a similar vein to the other answers, but taking a slightly more general approach ...

... why not create a Decorator class that can wrap your existing IEnumerable implementation and calculate statistics as it passes other elements through.

Here's a class Counter

that I just dumped together, but you can also create variants for other kinds of aggregation.

public class Counter<T> : IEnumerable<T>
{
    public int Count { get; private set; }

    public Counter(IEnumerable<T> source)
    {
        mSource = source;
        Count = 0;
    }

    public IEnumerator<T> GetEnumerator()
    {
        foreach (var T in mSource)
        {
            Count++;
            yield return T;
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        foreach (var T in mSource)
        {
            Count++;
            yield return T;
        }
    }

    private IEnumerable<T> mSource;
}

      

You can create three instances Counter

:

  • One to carry over GetAllFiles()

    counting the total number of files;
  • One to carry over the GetMatches()

    count of the number of matching files; and
  • One to wrap the GetMatchingLines()

    count of the number of matching lines.


The key with this approach is that you don't have multiple responsibilities on existing classes / methods - the method GetMatchingLines()

only handles the match, you also don't ask it to track statistics.

Clarification in response to comment Mitcham

:

The last code will look something like this:

var files = new Counter<string>( GetAllFiles());
var matchingFiles = new Counter<string>(GetMatches( "*.txt", files ));
var contents = GetFileContents( matchingFiles );
var linesFound = new Counter<string>(GetMatchingLines( contents ));

foreach( var lineText in linesFound )
    Console.WriteLine( "Found: " + lineText );

string message 
    = String.Format( 
        "Found {0} matches in {1} matching files. Scanned {2} files",
        linesFound.Count,
        matchingFiles.Count,
        files.Count);
Console.WriteLine(message);

      

Note that this is still a functional approach - the variables used are immutable (more like bindings than variables) and the generic function has no side effects.

+2


source


I would say that you need to encapsulate the process in a Matches class where your methods collect statistics as they progress.

public class Matcher
{
  private int totalFileCount;
  private int matchedCount;
  private DateTime start;
  private int lineCount;
  private DateTime stop;

  public IEnumerable<string> Match()
  {
     return GetMatchedFiles();
     System.Console.WriteLine(string.Format(
       "Found {0} matches in {1} matching files." + 
       " {2} total files scanned in {3}.", 
       lineCount, matchedCount, 
       totalFileCount, (stop-start).ToString());
  }

  private IEnumerable<File> GetMatchedFiles(string pattern)
  {
     foreach(File file in SomeFileRetrievalMethod())
     {
        totalFileCount++;
        if (MatchPattern(pattern,file.FileName))
        {
          matchedCount++;
          yield return file;
        }
     }
  }
}

      



I'll focus on this as I should be coding stuff, but the general idea is there. The whole point of a "pure" functional program is to have no side effects, and this type of statistical computation is a side effect.

+2


source


I can think of two ideas

  • Pass a context object and return (string + context) from your counters - purely functional solution

  • use local thread storage for statistics ( CallContext ) you can be fantastic and maintain a stack of contexts. so you will have code like this.

    using (var stats = DirStats.Create())
    {
        IEnumerable<string> allFiles = GetAllFiles();
        IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
        IEnumerable<string> contents = GetFileContents( matchingFiles );
        stats.Print()
        IEnumerable<string> matchingLines = GetMatchingLines( contents );
        stats.Print();
    } 
    
          

+2


source


If you're happy to turn your code upside down, you might be interested in Push LINQ. The basic idea is to reverse the pull model IEnumerable<T>

and turn it into a push model with observers - each part of the pipeline efficiently passes its data for any number of observers (using event handlers) that usually form new parts of the pipeline. This provides a very simple way to connect multiple aggregates to the same data.

See this blog post for details . I recently talked about this in London - my page contains a few links to sample code, slide deck, video, etc.

It's a fun little project, but needs a little ironing.

+1


source


I took Bevan's code and refactored until I was happy with it. Interesting stuff.

public class Counter
{
    public int Count { get; set; }
}

public static class CounterExtensions
{
    public static IEnumerable<T> ObserveCount<T>
      (this IEnumerable<T> source, Counter count)
    {
        foreach (T t in source)
        {
            count.Count++;
            yield return t;
        }
    }

    public static IEnumerable<T> ObserveCount<T>
      (this IEnumerable<T> source, IList<Counter> counters)
    {
        Counter c = new Counter();
        counters.Add(c);
        return source.ObserveCount(c);
    }
}


public static class CounterTest
{
    public static void Test1()
    {
        IList<Counter> counters = new List<Counter>();
  //
        IEnumerable<int> step1 =
            Enumerable.Range(0, 100).ObserveCount(counters);
  //
        IEnumerable<int> step2 =
            step1.Where(i => i % 10 == 0).ObserveCount(counters);
  //
        IEnumerable<int> step3 =
            step2.Take(3).ObserveCount(counters);
  //
        step3.ToList();
        foreach (Counter c in counters)
        {
            Console.WriteLine(c.Count);
        }
    }
}

      

Output of the expected result: 21, 3, 3

+1


source


Assuming these functions are your own, the only thing I can think of is a visitor template passing in an abstract visitor function that calls you when every thing happens. For example: pass ILineVisitor to GetFileContents (which I am assuming breaks the file into lines). ILineVisitor would have a method like OnVisitLine (String line), then you could implement ILineVisitor and force it to save the relevant statistics. Rinse and repeat with ILineMatchVisitor, IFileVisitor, etc. Or you can use one IVisitor with the OnVisit () method, which has different semantics in each case.

Your functions would have to take the visitor and call it OnVisit () at the appropriate time, which might seem annoying, but at least the visitor can be used to do a lot of interesting things besides re doing here. In fact, you might not actually write GetMatchingLines, passing the visitor who checks for a match on OnVisitLine (line of line) to GetFileContents.

Is this one of the ugly things you've already covered?

0


source







All Articles