Comparing three arrays and getting values ​​that are only in array 1

I want to find all the elements of array a1, which elements are not part of array a2 and array a3.

For example:

$a1 = @(1,2,3,4,5,6,7,8)
$a2 = @(1,2,3)
$a3 = @(4,5,6,7)

      

Expected Result:

8

      

+3


source to share


2 answers


Try the following:

 $a2AndA3 = $a2 + $a3
 $notInA2AndA3 = $a1 | Where-Object {!$a2AndA3.contains($_)}

      



As one liner:

$notInA2AndA3 = $a1 | Where {!($a2 + $a3).contains($_)}

      

+5


source


k7s5a's helpful answer is conceptually elegant and convenient , but there is a caveat:

It doesn't scale well because the array lookup has to be done for every element $a1

.

At least for large arrays, PowerShell is better : Compare-Object

If the input arrays are ALREADY SORTED :

(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject

      

Note:
* Compare-Object

does not require sorted input, but it can significantly improve performance - see below.
* Since Esperento57 indicates (Compare-Object $a1 ($a2 + $a3)).InputObject

enough in a particular case, but only because $a2

and $a3

do not contain elements that are not found in $a1

.
Therefore, a more general solution is to use a filter Where-Object SideIndicator -eq '<='

, since it restricts the results to objects not present in LHS ( $a1

), and not vice versa.

If the input arrays are NOT SORTRED :

Explicitly sorting the input arrays before comparing them significantly improves performance:



(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) | 
   Where-Object SideIndicator -eq '<=').InputObject

      


The following example, which uses an array of 10,000 elements, illustrates the difference in performance:

$count = 10000                     # Adjust this number to test scaling.
$a1 = 0..$($count-1)               # With 10,000: 0..9999
$a2 = 0..$($count/2)               # With 10,000: 0..5000
$a3 = $($count/2+1)..($count-3)    # With 10,000: 5001..9997

$(foreach ($pass in 1..2) {

  if ($pass -eq 1 ) {
    $passDescr = "SORTED input"
  } else {
    $passDescr = "UNSORTED input"
    # Shuffle the arrays.
    $a1 = $a1 | Get-Random -Count ([int]::MaxValue)
    $a2 = $a2 | Get-Random -Count ([int]::MaxValue)
    $a3 = $a3 | Get-Random -Count ([int]::MaxValue)
  }

  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "CompareObject, explicitly sorted first"
    Timing = (Measure-Command {
        (Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) | Where-Object SideIndicator -eq '<=').InputObject |
        Out-Host; '---' | Out-Host
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "CompareObject"
    Timing = (Measure-Command {
        (Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject |
        Out-Host; '---' | Out-Host
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "!.Contains(), two-pass"
    Timing = (Measure-Command {
        $a2AndA3 = $a2 + $a3
        $a1 | Where-Object { !$a2AndA3.Contains($_) } | 
        Out-Host; '---' | Out-Host
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "!.Contains(), two-pass, explicitly sorted first"
    Timing = (Measure-Command {
        $a2AndA3 = $a2 + $a3 | Sort-Object
        $a1 | Sort-Object | Where-Object { !$a2AndA3.Contains($_) } | 
        Out-Host; '---' | Out-Host
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "!.Contains(), single-pass"
    Timing = (Measure-Command {
        $a1 | Where-Object { !($a2 + $a3).Contains($_) } |
        Out-Host; '---' | Out-Host
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "-notcontains, two-pass"
    Timing = (Measure-Command {
        $a2AndA3 = $a2 + $a3
        $a1 | Where-Object { $a2AndA3 -notcontains $_ } |
        Out-Host; '---' | Out-Host    
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "-notcontains, two-pass, explicitly sorted first"
    Timing = (Measure-Command {
        $a2AndA3 = $a2 + $a3 | Sort-Object
        $a1 | Sort-Object | Where-Object { $a2AndA3 -notcontains $_ } |
        Out-Host; '---' | Out-Host    
    }).TotalSeconds
  },
  [pscustomobject] @{
    TestCategory = $passDescr
    Test = "-notcontains, single-pass"
    Timing = (Measure-Command {
        $a1 | Where-Object { ($a2 + $a3) -notcontains $_ } |
        Out-Host; '---' | Out-Host    
    }).TotalSeconds
  } 
}) |
  Group-Object TestCategory | ForEach-Object {
    "`n=========== $($_.Name)`n"
    $_.Group | Sort-Object Timing | Select-Object Test, @{ l='Timing'; e={ '{0:N3}' -f $_.Timing } }
  }

      

Example output from my machine (output of missing array elements omitted):

=========== SORTED input


Test                                            Timing
----                                            ------
CompareObject                                   0.068
CompareObject, explicitly sorted first          0.187
!.Contains(), two-pass                          0.548
-notcontains, two-pass                          6.186
-notcontains, two-pass, explicitly sorted first 6.972
!.Contains(), two-pass, explicitly sorted first 12.137
!.Contains(), single-pass                       13.354
-notcontains, single-pass                       18.379

=========== UNSORTED input

CompareObject, explicitly sorted first          0.198
CompareObject                                   6.617
-notcontains, two-pass                          6.927
-notcontains, two-pass, explicitly sorted first 7.142
!.Contains(), two-pass                          12.263
!.Contains(), two-pass, explicitly sorted first 12.641
-notcontains, single-pass                       19.273
!.Contains(), single-pass                       25.174

      

  • While the timings will change based on many factors, you can realize that it scales much better if the input is pre-sorted or sorted on demand , and the performance gap widens with the number of items. Compare-Object

  • If you don't Compare-Object

    , performance can be slightly improved, but the inability to use sort is a fundamentally limiting factor
    :

    • Neither -notcontains

      / -contains

      nor .Contains()

      can take full advantage of the preprogrammed input.

    • If the input is already sorted: Using a .Contains()

      IList

      .NET method rather than PowerShell -contains

      / statements -notcontains

      (which used an earlier version of k7s5a's answer) improves performance.

    • Concatenating arrays $a2

      and $a3

      once, forward, and then using the concatenated array in the pipeline improves performance (thus arrays do not have to be concatenated on every iteration).

+3


source







All Articles