Sorting a hash in Perl when keys are dynamic

Question

Sorting a hash in Perl when keys are dynamic

I have a hash like this:

my %data = (
    'B2' => {
        'one' => {
            timestamp => '00:12:30'
        },
        'two' => {
            timestamp => '00:09:30'
        }
    },
    'C3' => {
        'three' => {
            timestamp => '00:13:45'
        },
        'adam' => {
            timestamp => '00:09:30'
        }
    }
);

(The structure is actually more complicated, I simplify it here.)

I want to sort "globally" by timestamp and then keys from internal hashes (one, two, three adams). But the keys to the internal hashes are dynamic; I don't know what they will do until the data is read from the files.

I want the sorted result of the above hash to be as follows:

00:09:30,C3,adam
00:09:30,B2,two
00:12:30,B2,one
00:13:45,C3,three

I have looked at a lot of questions / answers regarding sorting hashes by keys and / or values, but I could not figure out when the key names are not known beforehand. (Or maybe I just don't understand him.)

Now I take two steps.

Compressing the hash into an array:

my @flattened;
for my $outer_key (keys %data) {
    for my $inner_key (keys %{$data{$outer_key}}) {
        push @flattened, [
            $data{$outer_key}{$inner_key}{timestamp}
            , $outer_key
            , $inner_key
        ];
    }
}

And then do the sort:

for my $ary (sort { $a->[0] cmp $b->[0] || $a->[2] cmp $b->[2] } @flattened) {
    print join ',' => @$ary;
    print "\n";
}

I am wondering if there is a more concise, elegant and efficient way to do this?

+3

sorting perl hash

user3112401 06 June 15 at 0:52

source to share

1 answer

chilemagic · Accepted Answer · 2015-06-06T18:03:59+0000

This type question might be more appropriate for File Sharing Site Programmers or Code Review . Because he is asking about implementation, I think he can be asked here. Sites tend to have overlap .

As @DondiMichaelStroma pointed out and as you already know your code works great! However, there are many ways to do this. For me, if it was in a small script, I would probably leave it as it is and move on to the next part of the project. If it was in a more professional codebase I would be making some changes.

For me, when I write for a professional codebase, I try to keep a few things in mind.

readability
Efficiency when it matters
Not gilding
Testing devices

So let's take a look at your code:

my %data = (
    'B2' => {
        'one' => {
            timestamp => '00:12:30'
        },
        'two' => {
            timestamp => '00:09:30'
        }
    },
    'C3' => {
        'three' => {
            timestamp => '00:13:45'
        },
        'adam' => {
            timestamp => '00:09:30'
        }
    }
);

The way the data is defined is excellent and well formatted. This may not be the way it %data

is built into your code, but perhaps the unit test will have a hash like this.

my @flattened;
for my $outer_key (keys %data) {
    for my $inner_key (keys %{$data{$outer_key}}) {
        push @flattened, [
            $data{$outer_key}{$inner_key}{timestamp}
            , $outer_key
            , $inner_key
        ];
    }
}
for my $ary (sort { $a->[0] cmp $b->[0] || $a->[2] cmp $b->[2] } @flattened) {
    print join ',' => @$ary;
    print "\n";
}

Variable names can be more descriptive and the array @flattened

has some redundant data in it. Print it using the Data :: Dumper , you can see that we have C3

, and B2

in a few places.

$VAR1 = [
          '00:13:45',
          'C3',
          'three'
        ];
$VAR2 = [
          '00:09:30',
          'C3',
          'adam'
        ];
$VAR3 = [
          '00:12:30',
          'B2',
          'one'
        ];
$VAR4 = [
          '00:09:30',
          'B2',
          'two'
        ];

Maybe it doesn't really matter, or maybe you want to keep the functionality of getting all data on a turnkey basis B2

.

Here's another way to store this data:

my %flattened = (
    'B2' => [['one', '00:12:30'],
             ['two', '00:09:30']],
    'C3' => [['three','00:13:45'],
             ['adam', '00:09:30']]
);

This can make sorting more complex, but it simplifies the data structure! Maybe this is getting closer to gold, or maybe you will benefit from this data structure in another part of the code. My preference is to keep the data structure simpler and add additional code if needed if needed. If you decide what you need to dump %flattened

to a log file, you may not see duplicate data.

Implementation

Design: I think we want to keep this as two different operations. This will help improve the clarity of the code, and we can test each feature separately. The first function will convert between the data formats we want to use and the second function will sort the data. These functions must be in a Perl module and we can use Test :: More to do unit testing. I don't know where we are calling these functions from, so let's assume we are calling them from main.pl

and we can put the functions in a module named Helper.pm

. These names should be more descriptive, but again I'm not sure if the app is here! Bigger names result in readable code.

main.pl

Here's what it might look like main.pl

. Despite the lack of comments, descriptive names can make it self-documenting. These names can be improved too!

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Utilities::Helper qw(sort_by_times_then_names convert_to_simple_format);

my %data = populate_data();

my @sorted_data = @{ sort_by_times_then_names( convert_to_simple_format( \%data ) ) };

print Dumper(@sorted_data);

Utilities /Helper.pm

Is it readable and elegant? I think he can use some improvements. More verbose variable names will also help in this module. However, this is easy to verify, and our main code is clean and the data structures are simple.

package Utilities::Helper;
use strict;
use warnings;

use Exporter qw(import);
our @EXPORT_OK = qw(sort_by_times_then_names convert_to_simple_format);

# We could put a comment here explaning the expected input and output formats.
sub sort_by_times_then_names {

    my ( $data_ref ) = @_;

    # Here we can use the Schwartzian Transform to sort it
    # Normally, we would just be sorting an array. But here we
    # are converting the hash into an array and then sorting it.
    # Maybe that should be broken up into two steps to make to more clear!
    #my @sorted = map  { $_ } we don't actually need this map
    my @sorted = sort {
                        $a->[2] cmp $b->[2] # sort by timestamp
                                 ||
                        $a->[1] cmp $b->[1] # then sort by name
                      }
                 map  { my $outer_key=$_;       # convert $data_ref to an array of arrays
                        map {                    # first element is the outer_key
                             [$outer_key, @{$_}] # second element is the name
                            }                    # third element is the timestamp
                            @{$data_ref->{$_}}
                      }
                      keys %{$data_ref};
    # If you want the elements in a different order in the array,
    # you could modify the above code or change it when you print it.
    return \@sorted;
}


# We could put a comment here explaining the expected input and output formats.
sub convert_to_simple_format {
    my ( $data_ref ) = @_;

    my %reformatted_data;

    # $outer_key and $inner_key could be renamed to more accurately describe what the data they are representing.
    # Are they names? IDs? Places? License plate numbers?
    # Maybe we want to keep it generic so this function can handle different kinds of data.
    # I still like the idea of using nested for loops for this logic, because it is clear and intuitive.
    for my $outer_key ( keys %{$data_ref} ) {
        for my $inner_key ( keys %{$data_ref->{$outer_key}} ) {
            push @{$reformatted_data{$outer_key}},
                 [$inner_key, $data_ref->{$outer_key}{$inner_key}{timestamp}];
        }
    }

    return \%reformatted_data;
}

1;

run_unit_tests.pl

Finally, let's implement some unit testing. This may be more than what you were looking for with this question, but I think clean seams to test are part of elegant code and I want to demonstrate that. Test :: More is really great for this. I'll even throw in the test harness and formatter so we can get an elegant result. You can use TAP :: Formatter :: Console if you don't have TAP :: Formatter :: JUnit .

#!/usr/bin/env perl
use strict;
use warnings;
use TAP::Harness;

my $harness = TAP::Harness->new({
    formatter_class => 'TAP::Formatter::JUnit',
    merge           => 1,
    verbosity       => 1,
    normalize       => 1,
    color           => 1,
    timer           => 1,
});

$harness->runtests('t/helper.t');

t /helper.t

#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
use Utilities::Helper qw(sort_by_times_then_names convert_to_simple_format);

my %data = (
    'B2' => {
        'one' => {
            timestamp => '00:12:30'
        },
        'two' => {
            timestamp => '00:09:30'
        }
    },
    'C3' => {
        'three' => {
            timestamp => '00:13:45'
        },
        'adam' => {
            timestamp => '00:09:30'
        }
    }
);

my %formatted_data = %{ convert_to_simple_format( \%data ) };

my %expected_formatted_data = (
    'B2' => [['one', '00:12:30'],
             ['two', '00:09:30']],
    'C3' => [['three','00:13:45'],
             ['adam', '00:09:30']]
);

is_deeply(\%formatted_data, \%expected_formatted_data, "convert_to_simple_format test");

my @sorted_data = @{ sort_by_times_then_names( \%formatted_data ) };

my @expected_sorted_data = ( ['C3','adam', '00:09:30'],
                             ['B2','two',  '00:09:30'],
                             ['B2','one',  '00:12:30'],
                             ['C3','thee','00:13:45'] #intentionally typo to demonstrate output
                            );

is_deeply(\@sorted_data, \@expected_sorted_data, "sort_by_times_then_names test");

done_testing;

Testable output

The good thing about testing this way is that it will tell you what is wrong when the test fails.

<testsuites>
  <testsuite failures="1"
             errors="1"
             time="0.0478239059448242"
             tests="2"
             name="helper_t">
    <testcase time="0.0452120304107666"
              name="1 - convert_to_simple_format test"></testcase>
    <testcase time="0.000266075134277344"
              name="2 - sort_by_times_then_names test">
      <failure type="TestFailed"
               message="not ok 2 - sort_by_times_then_names test"><![CDATA[not o
k 2 - sort_by_times_then_names test

#   Failed test 'sort_by_times_then_names test'
#   at t/helper.t line 45.
#     Structures begin differing at:
#          $got->[3][1] = 'three'
#     $expected->[3][1] = 'thee']]></failure>
    </testcase>
    <testcase time="0.00154280662536621" name="(teardown)" />
    <system-out><![CDATA[ok 1 - convert_to_simple_format test
not ok 2 - sort_by_times_then_names test

#   Failed test 'sort_by_times_then_names test'
#   at t/helper.t line 45.
#     Structures begin differing at:
#          $got->[3][1] = 'three'
#     $expected->[3][1] = 'thee'
1..2
]]></system-out>
    <system-err><![CDATA[Dubious, test returned 1 (wstat 256, 0x100)
]]></system-err>
    <error message="Dubious, test returned 1 (wstat 256, 0x100)" />
  </testsuite>
</testsuites>

In general, I prefer to read and understand more concisely. Sometimes you can make less efficient code that is easier to write and logically easier. Putting ugly code inside functions is a great way to hide it! Don't mess with the code to save 15ms when it runs. If your dataset is large enough that performance becomes a problem, Perl may not be the best tool for the job. If you're really looking for some concise code, post a call to the Code Golf Stack Exchange.

Sorting a hash in Perl when keys are dynamic

Implementation

main.pl

Utilities /Helper.pm

run_unit_tests.pl

t /helper.t

Testable output

More articles: