How do I write a wrapper script to take the first record for a specific column?

Below is a small file for demonstration. There are two columns and I would like to write a shell script to accept the first occurrence of each name.

--- input.txt ---

Name,Count
Linux,2
Unix,10
Linux,10
Unix,4
Windows,6

      

--- desired output.txt ---

Name,Count
Linux,2
Unix,10
Windows,6

      

The real input.txt is much larger (in GB size), so something that can scale would be large.

Also, I apologize if similar questions have been asked before (I could not find a solution to this by searching).

+3


source to share


2 answers


This would do it:

awk -F, '!seen[$1]++' input.txt

      



-F,

sets the input field separator to a comma. This means that $1

on each line there is a part before the comma (Name, Linux, Unix, etc.). seen

is an array that keeps track of values $1

that have already been seen. Each time it $1

matches, it seen[$1]

increases. The string is only displayed when it seen[$1]

is 0, which is only true the first time a new key is viewed.

+6


source


You can also do it awk

like this:

awk -F, '$1 in a{next}{a[$1]}1' input.txt > output.txt

      



Also, replacement mawk

for other versions will awk

definitely provide a significant speed boost for large files.

+2


source







All Articles