Merge two files without pseudo-repetitions

Question

Merge two files without pseudo-repetitions

I have two text files file1.txt

and file2.txt

that contain lines of words like this: fare word word-ed wo-ded wor

and

fa-re text uncial woded wor worded

or something like that. In short, I mean a sequence of letters az, possibly accented, along with a symbol -

. My question is, how can I create a third file output.txt

from the linux command line (using awk

, sed

etc.) from these two files that satisfy the following three conditions:

If the same word occurs in two files, the third file output.txt

contains it exactly once.
If the portable version (for example, fa-re

in file2.txt) of the word in the file is in another, then only the portable version is saved in the output.txt file (for example, it is only fa-re

saved in our example).

Thus, output.txt should contain the following words: fa-re word word-ed wo-ded wor text uncial

================ Edit =========================

I also changed the files and gave the output file. I will try to manually make sure there are no words with different precision (e.g. wod-ed and wo-ded).

+3

command-line linux awk sed

usr203050 07 Aug 15 at 19:02

source to share

3 answers

Awk solution

!($1 in words) {
    split($1, f, "-")
    w = f[1] f[2]
    if (f[2])
        words[w] = $1
    else
        words[w]
}
END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}

$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word

Structure

!($1 in words) {
    ...
}

Only process the line if the first field is not already in the array as a key words

.

split($1, f, "-")

Separates the first field in an array f

using -

a delimiter. The first and second parts of the word will be in f[1]

and f[2]

respectively. If the word is not wrapped, it will be completely inside f[1]

.

w = f[1] f[2]

Assigns a defined word w

by concatenating the first and second parts of the word. If the word was not originally wrapped, the result is the same as it is f[2]

empty.

if (f[2])
    words[w] = $1
else
    words[w]

Store the defined word as a key in an array words

. If the word has been hyphenated ( f[2]

not empty), save it as a key value.

END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}

After the file has been processed, iterate through the array words

, and if the key holds the value (hyphenated word) print it, otherwise print the key (no hyphen).

+1

John B 07 Aug 15 at 20:40

source to share

This is not exactly what you asked for, but it might be better suited to what you need.

awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'

this will group all words in files by equivalence class (no hyphen match). You can get another pass from this result to get what you want.

uncial
word
woded wo-ded 
wor wor
worded word-ed
text
fa-re fare

The benefits are not manually checked if there are alternative hyphenated words and see how many different instances you have for each word. For example, this will filter the previous list for the desired output.

awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'

+1

karakfa 07 Aug 15 at 21:18

source to share

jas · Accepted Answer · 2015-08-07T20:48:53+0000

Another awk:

!($1 in a) || $1 ~ "-" { 
    key = value = $1; gsub("-","",key); a[key] = value 
}
END { for (i in a) print a[i] }

$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re

Merge two files without pseudo-repetitions

Awk solution

Structure

More articles: