Comparing three arrays and getting values ββthat are only in array 1
k7s5a's helpful answer is conceptually elegant and convenient , but there is a caveat:
It doesn't scale well because the array lookup has to be done for every element $a1
.
At least for large arrays, PowerShell is better : Compare-Object
If the input arrays are ALREADY SORTED :
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject
Note:
* Compare-Object
does not require sorted input, but it can significantly improve performance - see below.
* Since Esperento57 indicates (Compare-Object $a1 ($a2 + $a3)).InputObject
enough in a particular case, but only because $a2
and $a3
do not contain elements that are not found in $a1
.
Therefore, a more general solution is to use a filter Where-Object SideIndicator -eq '<='
, since it restricts the results to objects not present in LHS ( $a1
), and not vice versa.
If the input arrays are NOT SORTRED :
Explicitly sorting the input arrays before comparing them significantly improves performance:
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) |
Where-Object SideIndicator -eq '<=').InputObject
The following example, which uses an array of 10,000 elements, illustrates the difference in performance:
$count = 10000 # Adjust this number to test scaling.
$a1 = 0..$($count-1) # With 10,000: 0..9999
$a2 = 0..$($count/2) # With 10,000: 0..5000
$a3 = $($count/2+1)..($count-3) # With 10,000: 5001..9997
$(foreach ($pass in 1..2) {
if ($pass -eq 1 ) {
$passDescr = "SORTED input"
} else {
$passDescr = "UNSORTED input"
# Shuffle the arrays.
$a1 = $a1 | Get-Random -Count ([int]::MaxValue)
$a2 = $a2 | Get-Random -Count ([int]::MaxValue)
$a3 = $a3 | Get-Random -Count ([int]::MaxValue)
}
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject, explicitly sorted first"
Timing = (Measure-Command {
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject"
Timing = (Measure-Command {
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { !($a2 + $a3).Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { ($a2 + $a3) -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
}
}) |
Group-Object TestCategory | ForEach-Object {
"`n=========== $($_.Name)`n"
$_.Group | Sort-Object Timing | Select-Object Test, @{ l='Timing'; e={ '{0:N3}' -f $_.Timing } }
}
Example output from my machine (output of missing array elements omitted):
=========== SORTED input
Test Timing
---- ------
CompareObject 0.068
CompareObject, explicitly sorted first 0.187
!.Contains(), two-pass 0.548
-notcontains, two-pass 6.186
-notcontains, two-pass, explicitly sorted first 6.972
!.Contains(), two-pass, explicitly sorted first 12.137
!.Contains(), single-pass 13.354
-notcontains, single-pass 18.379
=========== UNSORTED input
CompareObject, explicitly sorted first 0.198
CompareObject 6.617
-notcontains, two-pass 6.927
-notcontains, two-pass, explicitly sorted first 7.142
!.Contains(), two-pass 12.263
!.Contains(), two-pass, explicitly sorted first 12.641
-notcontains, single-pass 19.273
!.Contains(), single-pass 25.174
-
While the timings will change based on many factors, you can realize that it scales much better if the input is pre-sorted or sorted on demand , and the performance gap widens with the number of items.
Compare-Object
-
If you don't
Compare-Object
, performance can be slightly improved, but the inability to use sort is a fundamentally limiting factor :-
Neither
-notcontains
/-contains
nor.Contains()
can take full advantage of the preprogrammed input. -
If the input is already sorted: Using a
.Contains()
IList
.NET method rather than PowerShell-contains
/ statements-notcontains
(which used an earlier version of k7s5a's answer) improves performance. -
Concatenating arrays
$a2
and$a3
once, forward, and then using the concatenated array in the pipeline improves performance (thus arrays do not have to be concatenated on every iteration).
-
source to share