Split CSV with powershell

I have large CSV files (50-500 MB each). Running complex power shell commands on them takes forever and / or fixes memory issues.

Data processing requires grouping by common fields, for example, in column A. Therefore, assuming that the data is already sorted by this column, if I split these files randomly (i.e. Every x-thousand rows), then the corresponding records can be - continue to end in different parts. There are thousands of different groups in A, so splitting each into one file will create for many files.

How can I split it into files of 10,000 lines and not lose groups? For example. lines 1-13 will be A1 in column A, lines 14-17 will be A2 and so on, and line 9997-10012 will be A784. In this case, I would like the first file to contain lines 1-10012 and the next to start at line 10013.

Obviously, I would like to keep whole rows (not just column A), so if I pasted all the resulting files together, it would be the same as the original file.

+3


source to share


3 answers


Not tested. This assumes ColumnA is the first column and comma delimited common data. You will need to adjust the line that builds the regex to match your data.



 $count = 0

 $header = get-content file.csv -TotalCount 1

 get-content file.csv -ReadCount 1000 |
  foreach {
   #add tail entries from last batch to beginning of this batch
   $newbatch = $tail + $_ 

   #create regex to match last entry in this batch
   $regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0])) 

   #Extract everything that doesn't match the last entry to new file

     #Add header if this is not the first file
     if ($count)
       {
         $header |
           set-content "c:\somedir\filepart_$count"
        }

     $newbatch -notmatch $regex | 
      add-content "c:\somedir\filepart_$count"  

   #Extact tail entries to add to next batch
   $tail = @($newbatch -match $regex)

   #Increment file counter
   $count++ 

}

      

+3


source


This is my attempt, it got confused: -P It will load the entire file into memory, splitting it, but this is pure text. It should take less memory than imported objects, but still file size.

$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1

$start = 1

while ($start -lt $lines - 1) {
    #Set minimum $end value (last line)
    if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }

    #Value to compare. ColA is first column in my file = [0] .  ColB is second column = [1]
    $avalue = $content[$end].split(",")[0]
    #If not last line in script
    if ($end -ne $lines -1) {
        #Increase $end by 1 while ColA is the same
        while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
        #Return to last line with equal ColA value
        $end--
    }
    #Create new csv-part
    $filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
    @($csvheader, $content[$start..$end]) | Set-Content $filename

    #Fix counters
    $filepart++
    $start = $end + 1
}

      

file.csv:



ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10

      

Results (I used $minlines = 5

):

file.part1.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20

file.part2.csv:

ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60

file.part3.csv:

ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10

      

+2


source


This requires PowerShell v3 (due -append

to Export-CSV

).

Also, I am assuming you have column headers and the first column is named col1

. Adjust if necessary.

import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}

      

This will create one file for each distinct value in the first column, with that value as the filename.

0


source







All Articles