How do I write a wrapper script to take the first record for a specific column?

Question

How do I write a wrapper script to take the first record for a specific column?

Below is a small file for demonstration. There are two columns and I would like to write a shell script to accept the first occurrence of each name.

--- input.txt ---

Name,Count
Linux,2
Unix,10
Linux,10
Unix,4
Windows,6

--- desired output.txt ---

Name,Count
Linux,2
Unix,10
Windows,6

The real input.txt is much larger (in GB size), so something that can scale would be large.

Also, I apologize if similar questions have been asked before (I could not find a solution to this by searching).

+3

unix shell awk

vieplivee 17 Sep 14 at 17:17

source to share

2 answers

Tom fenech · Answer 1 · 2014-09-17T17:20:32+0000

This would do it:

awk -F, '!seen[$1]++' input.txt

-F,

sets the input field separator to a comma. This means that $1

on each line there is a part before the comma (Name, Linux, Unix, etc.). seen

is an array that keeps track of values $1

that have already been seen. Each time it $1

matches, it seen[$1]

increases. The string is only displayed when it seen[$1]

is 0, which is only true the first time a new key is viewed.

John B · Answer 2 · 2014-09-17T18:45:56+0000

You can also do it awk

like this:

awk -F, '$1 in a{next}{a[$1]}1' input.txt > output.txt

Also, replacement mawk

for other versions will awk

definitely provide a significant speed boost for large files.

How do I write a wrapper script to take the first record for a specific column?

More articles: