Is there a way to optimize my Powershell function to remove pattern matches from a large file?
I have a large text file (~ 20K lines, ~ 80 characters per line). I also have a large array (~ 1500 elements) of objects containing templates that I want to remove from a large text file. Note, if a pattern from an array appears on a line in the input file, I want to delete the entire line, not just the pattern.
The input file is a CSVish with lines similar to:
A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;
The pattern in the array that I am looking for on each line in the input file is like
XX000029
part of the line above.
My somewhat naive function to achieve this looks like this:
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
try{
$FileContent = Get-Content $BigFile
}catch{
Write-Error $_
}
$IgnorePatterns | ForEach-Object {
$IgnoreId = $_.IgnoreId
$FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
Write-Host $FileContent.count
}
$FileContent | Set-Content "CleansedBigFile.txt"
}
It works but is slower.
How can I make it faster?
source to share
function Remove-IdsFromFile { param( [Parameter(Mandatory=$true,Position=0)] [string]$BigFile, [Parameter(Mandatory=$true,Position=1)] [Object[]]$IgnorePatterns ) # Create the pattern matches $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|" If(Test-Path $BigFile){ $reader = New-Object System.IO.StreamReader($BigFile) $line=$reader.ReadLine() while ($line -ne $null) { # Check if the line should be output to file If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"} # Attempt to read the next line. $line=$reader.ReadLine() } $reader.close() } Else { Write-Error "Cannot locate: $BigFile" } }
StreamReader
is one of the preferred ways to read large text files. We also use a regular expression to construct a pattern string to match based on. With a pattern string, we use it [regex]::Escape()
as a precaution if the regex control characters are present. We have to guess because we only see one line of the template.
If $IgnorePatterns
it can be easily distinguished as strings, this should work fine. A small sample of what looks like $regex
would be:
XX000029|XX000028|XX000027
If $IgnorePatterns
populated from a database, you may have less control over this, but since we're using a regex, you can shrink this set of patterns by actually using a regex (instead of a simple alternative comparison) like in my example above. You can reduce this to XX00002[7-9]
eg.
I don't know if a regex will provide a performance boost with 1500 possible. It is supposed to be located here StreamReader
. However, I removed the water using Add-Content
to the outlet, which does not receive any rewards for fast (can use writing in its place).
Reader and writer
I still need to test this to make sure it works, but it just uses StreamReader
and streamwriter
. If it works better, I'm just going to replace the above code.
function Remove-IdsFromFile {
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$BigFile,
[Parameter(Mandatory=$true,Position=1)]
[Object[]]$IgnorePatterns
)
# Create the pattern matches
$regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"
If(Test-Path $BigFile){
# Prepare the StreamReader
$reader = New-Object System.IO.StreamReader($BigFile)
#Prepare the StreamWriter
$writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")
$line=$reader.ReadLine()
while ($line -ne $null)
{
# Check if the line should be output to file
If($line -notmatch $regex){$writer.WriteLine($line)}
# Attempt to read the next line.
$line=$reader.ReadLine()
}
# Don't cross the streams!
$reader.Close()
$writer.Close()
} Else {
Write-Error "Cannot locate: $BigFile"
}
}
You might need some error protection for threads, but it seems to work in place.
source to share