Split CSV with powershell
I have large CSV files (50-500 MB each). Running complex power shell commands on them takes forever and / or fixes memory issues.
Data processing requires grouping by common fields, for example, in column A. Therefore, assuming that the data is already sorted by this column, if I split these files randomly (i.e. Every x-thousand rows), then the corresponding records can be - continue to end in different parts. There are thousands of different groups in A, so splitting each into one file will create for many files.
How can I split it into files of 10,000 lines and not lose groups? For example. lines 1-13 will be A1 in column A, lines 14-17 will be A2 and so on, and line 9997-10012 will be A784. In this case, I would like the first file to contain lines 1-10012 and the next to start at line 10013.
Obviously, I would like to keep whole rows (not just column A), so if I pasted all the resulting files together, it would be the same as the original file.
source to share
Not tested. This assumes ColumnA is the first column and comma delimited common data. You will need to adjust the line that builds the regex to match your data.
$count = 0
$header = get-content file.csv -TotalCount 1
get-content file.csv -ReadCount 1000 |
foreach {
#add tail entries from last batch to beginning of this batch
$newbatch = $tail + $_
#create regex to match last entry in this batch
$regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0]))
#Extract everything that doesn't match the last entry to new file
#Add header if this is not the first file
if ($count)
{
$header |
set-content "c:\somedir\filepart_$count"
}
$newbatch -notmatch $regex |
add-content "c:\somedir\filepart_$count"
#Extact tail entries to add to next batch
$tail = @($newbatch -match $regex)
#Increment file counter
$count++
}
source to share
This is my attempt, it got confused: -P It will load the entire file into memory, splitting it, but this is pure text. It should take less memory than imported objects, but still file size.
$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1
$start = 1
while ($start -lt $lines - 1) {
#Set minimum $end value (last line)
if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }
#Value to compare. ColA is first column in my file = [0] . ColB is second column = [1]
$avalue = $content[$end].split(",")[0]
#If not last line in script
if ($end -ne $lines -1) {
#Increase $end by 1 while ColA is the same
while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
#Return to last line with equal ColA value
$end--
}
#Create new csv-part
$filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
@($csvheader, $content[$start..$end]) | Set-Content $filename
#Fix counters
$filepart++
$start = $end + 1
}
file.csv:
ColA,ColB,ColC A1,1,10 A1,2,20 A1,3,30 A2,1,10 A2,2,20 A3,1,10 A4,1,10 A4,2,20 A4,3,30 A4,4,40 A4,5,50 A4,6,60 A5,1,10 A6,1,10 A7,1,10
Results (I used $minlines = 5
):
file.part1.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
file.part2.csv:
ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
file.part3.csv:
ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10
source to share
This requires PowerShell v3 (due -append
to Export-CSV
).
Also, I am assuming you have column headers and the first column is named col1
. Adjust if necessary.
import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}
This will create one file for each distinct value in the first column, with that value as the filename.
source to share