Is there a way to optimize my Powershell function to remove pattern matches from a large file?

I have a large text file (~ 20K lines, ~ 80 characters per line). I also have a large array (~ 1500 elements) of objects containing templates that I want to remove from a large text file. Note, if a pattern from an array appears on a line in the input file, I want to delete the entire line, not just the pattern.

The input file is a CSVish with lines similar to:

A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;   

      

The pattern in the array that I am looking for on each line in the input file is like

XX000029

      

part of the line above.

My somewhat naive function to achieve this looks like this:

function Remove-IdsFromFile {
  param(
    [Parameter(Mandatory=$true,Position=0)]
    [string]$BigFile,
    [Parameter(Mandatory=$true,Position=1)]
    [Object[]]$IgnorePatterns
  )

  try{
    $FileContent = Get-Content $BigFile
  }catch{
    Write-Error $_
  }

  $IgnorePatterns | ForEach-Object {
    $IgnoreId = $_.IgnoreId
    $FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
    Write-Host $FileContent.count
  }
  $FileContent | Set-Content "CleansedBigFile.txt"
}

      

It works but is slower.

How can I make it faster?

+3


source to share


1 answer


function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
    $reader = New-Object  System.IO.StreamReader($BigFile)

    $line=$reader.ReadLine()
    while ($line -ne $null)
    {
        # Check if the line should be output to file
        If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}

        # Attempt to read the next line. 
        $line=$reader.ReadLine()
    }

    $reader.close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

      

StreamReader

is one of the preferred ways to read large text files. We also use a regular expression to construct a pattern string to match based on. With a pattern string, we use it [regex]::Escape()

as a precaution if the regex control characters are present. We have to guess because we only see one line of the template.

If $IgnorePatterns

it can be easily distinguished as strings, this should work fine. A small sample of what looks like $regex

would be:

XX000029|XX000028|XX000027

      

If $IgnorePatterns

populated from a database, you may have less control over this, but since we're using a regex, you can shrink this set of patterns by actually using a regex (instead of a simple alternative comparison) like in my example above. You can reduce this to XX00002[7-9]

eg.



I don't know if a regex will provide a performance boost with 1500 possible. It is supposed to be located here StreamReader

. However, I removed the water using Add-Content

to the outlet, which does not receive any rewards for fast (can use writing in its place).

Reader and writer

I still need to test this to make sure it works, but it just uses StreamReader

and streamwriter

. If it works better, I'm just going to replace the above code.

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
        # Prepare the StreamReader
        $reader = New-Object System.IO.StreamReader($BigFile)

        #Prepare the StreamWriter
        $writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")

        $line=$reader.ReadLine()
        while ($line -ne $null)
        {
            # Check if the line should be output to file
            If($line -notmatch $regex){$writer.WriteLine($line)}

            # Attempt to read the next line. 
            $line=$reader.ReadLine()
        }

        # Don't cross the streams!
        $reader.Close()
        $writer.Close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

      

You might need some error protection for threads, but it seems to work in place.

+4


source







All Articles