Merge two files without pseudo-repetitions
I have two text files file1.txt
and file2.txt
that contain lines of words like this:
fare
word
word-ed
wo-ded
wor
and
fa-re
text
uncial
woded
wor
worded
or something like that. In short, I mean a sequence of letters az, possibly accented, along with a symbol -
. My question is, how can I create a third file output.txt
from the linux command line (using awk
, sed
etc.) from these two files that satisfy the following three conditions:
- If the same word occurs in two files, the third file
output.txt
contains it exactly once. - If the portable version (for example,
fa-re
in file2.txt) of the word in the file is in another, then only the portable version is saved in the output.txt file (for example, it is onlyfa-re
saved in our example).
Thus, output.txt should contain the following words:
fa-re
word
word-ed
wo-ded
wor
text
uncial
================ Edit =========================
I also changed the files and gave the output file. I will try to manually make sure there are no words with different precision (e.g. wod-ed and wo-ded).
Another awk:
!($1 in a) || $1 ~ "-" {
key = value = $1; gsub("-","",key); a[key] = value
}
END { for (i in a) print a[i] }
$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re
Awk solution
!($1 in words) {
split($1, f, "-")
w = f[1] f[2]
if (f[2])
words[w] = $1
else
words[w]
}
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
$ awk -f script.awk file1.txt file2.txt wor fa-re text wo-ded uncial word-ed word
Structure
!($1 in words) {
...
}
Only process the line if the first field is not already in the array as a key words
.
split($1, f, "-")
Separates the first field in an array f
using -
a delimiter. The first and second parts of the word will be in f[1]
and f[2]
respectively. If the word is not wrapped, it will be completely inside f[1]
.
w = f[1] f[2]
Assigns a defined word w
by concatenating the first and second parts of the word. If the word was not originally wrapped, the result is the same as it is f[2]
empty.
if (f[2])
words[w] = $1
else
words[w]
Store the defined word as a key in an array words
. If the word has been hyphenated ( f[2]
not empty), save it as a key value.
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
After the file has been processed, iterate through the array words
, and if the key holds the value (hyphenated word) print it, otherwise print the key (no hyphen).
This is not exactly what you asked for, but it might be better suited to what you need.
awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'
this will group all words in files by equivalence class (no hyphen match). You can get another pass from this result to get what you want.
uncial
word
woded wo-ded
wor wor
worded word-ed
text
fa-re fare
The benefits are not manually checked if there are alternative hyphenated words and see how many different instances you have for each word. For example, this will filter the previous list for the desired output.
awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'