Merge two files without pseudo-repetitions

I have two text files file1.txt

and file2.txt

that contain lines of words like this: fare word word-ed wo-ded wor

and

fa-re text uncial woded wor worded

or something like that. In short, I mean a sequence of letters az, possibly accented, along with a symbol -

. My question is, how can I create a third file output.txt

from the linux command line (using awk

, sed

etc.) from these two files that satisfy the following three conditions:

  • If the same word occurs in two files, the third file output.txt

    contains it exactly once.
  • If the portable version (for example, fa-re

    in file2.txt) of the word in the file is in another, then only the portable version is saved in the output.txt file (for example, it is only fa-re

    saved in our example).

Thus, output.txt should contain the following words: fa-re word word-ed wo-ded wor text uncial

================ Edit =========================

I also changed the files and gave the output file. I will try to manually make sure there are no words with different precision (e.g. wod-ed and wo-ded).

+3
command-line linux awk sed


source to share


3 answers


Another awk:



!($1 in a) || $1 ~ "-" { 
    key = value = $1; gsub("-","",key); a[key] = value 
}
END { for (i in a) print a[i] }

$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re

      

+2


source to share


Awk solution

!($1 in words) {
    split($1, f, "-")
    w = f[1] f[2]
    if (f[2])
        words[w] = $1
    else
        words[w]
}
END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}

      

$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word

      

Structure

!($1 in words) {
    ...
}

      

Only process the line if the first field is not already in the array as a key words

.


split($1, f, "-")

      

Separates the first field in an array f

using -

a delimiter. The first and second parts of the word will be in f[1]

and f[2]

respectively. If the word is not wrapped, it will be completely inside f[1]

.




w = f[1] f[2]

      

Assigns a defined word w

by concatenating the first and second parts of the word. If the word was not originally wrapped, the result is the same as it is f[2]

empty.


if (f[2])
    words[w] = $1
else
    words[w]

      

Store the defined word as a key in an array words

. If the word has been hyphenated ( f[2]

not empty), save it as a key value.


END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}

      

After the file has been processed, iterate through the array words

, and if the key holds the value (hyphenated word) print it, otherwise print the key (no hyphen).

+1


source to share


This is not exactly what you asked for, but it might be better suited to what you need.

awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'

      

this will group all words in files by equivalence class (no hyphen match). You can get another pass from this result to get what you want.

uncial
word
woded wo-ded 
wor wor
worded word-ed
text
fa-re fare

      

The benefits are not manually checked if there are alternative hyphenated words and see how many different instances you have for each word. For example, this will filter the previous list for the desired output.

awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'

      

+1


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics