Probability Distribution of each unique number in the array (length unknown) after excluding zeros

Part of my data file looks like

ifile.txt
1
1
3
0
6
3
0
3
3
5

      

I would like to find the probability of every number excluding zeros. for example P (1) = 2/8; P (3) = 4/8, etc.

Desired output

ofile.txt
1  0.250
3  0.500
5  0.125
6  0.125

      

Where 1st column shows unique numbers except 0 and 2nd column shows probability. I tried to follow but looks like a very long idea. I ran into a problem for the for loop as there are so many unique numbers

n=$(awk '$1 > 0 {print $0}' ifile.txt | wc -l)
for i in 1 3 5 6 .....
do
n1=$(awk '$1 == $i {print $0}' ifile.txt | wc -l)
p=$(echo $n1/$n | bc -l)
printf "%d %.3f\n" "$i $p" >> ofile.txt
done

      

+3


source to share


3 answers


Use an associative array in awk

to get the count of each unique number in a single pass.



awk '$0 != "0" { count[$0]++; total++ } 
     END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt

      

+5


source


How about sort | uniq -c

to get a numeric number at ~ n log n instead of n ^ 2 times and then run it through division by your total non-zero count from wc -l

?



+3


source


Novelocrat's suggestion can be used heresort|uniq -c

:

sed '/^0/ d' ifile.txt|sort|uniq -c >i
awk 'FNR==NR{n+=$1;next;}{print $2,$1/n}' i i

      

short description

remove numbers starting with 0 sed '/^0/ d' ifile.txt

sort|uniq -c >i

gives you i

:

   2 1
   4 3
   1 5
   1 6

      

In awk, FNR==NR{n+=$1;next;}

totals col 1 from i

to n

(will next

skip next command) and then print $2,$1/n

prints col 2 from i

and col 1 above n

.

+3


source







All Articles