Perl Data Structures

Tips, Tricks, and Traps

G. Wade Johnson

Perl Data Structure Basics

There are 3 major types of data structures in Perl. There are a few other things that might count as data structures, but you may never run into them. We'll skip those others in this talk.

Scalar

A scalar can store integers or IEEE floating point numbers. Scalars with numeric values act pretty much the way you would expect coming from any other language.

A scalar can also store a string. The strings are not particularly limited in size by anything except available memory. Unlike some languages (Java), Perl strings are mutable.

A Perl reference is basically a pointer or reference to a data structure or code. The undef value is a special value that means the scalar has not been set to anything at this time. Perl provides ways to unset a value, so undef is not quite the same as never used.

Scalar: Operations

Most of the scalar operations are pretty much what you would expect from any dynamic language.

Perl supports all of the normal arithmetic operations. They work on integers and floats, and combinations of the two.

The two string operations are repeat and concatenate. The repeat operator (x) makes a new string that consists of the left argument repeated the number of times supplied by the right argument. The . operator, on the other hand, makes a new string by concatenating the left argument with the right argument.

Scalar: Example


    my $num = 1;
    $num += 2;

    my $str = 'hip ' x 2;
    $str .= ', hooray!'

This little piece of example code doesn't really do much of anything except show how scalar variables are set and manipulated a bit. The my keyword specifies a lexical variable, which is what most languages call a local variable.

Array: Zero or More Scalars

Arrays contain lists of scalars. They are mutable and handle memory as they grow and/or shrink.

Array: Operations

Retrieve items from an array with the $array[0] syntax. Perl arrays support easy operators for adding/removing from the beginning/left of an array or the end/right end of the array. Perl automatically handles expanding or contracting the memory used.

The for statement can be used to iterate over a list. The two names for and foreach are aliases. There is no functional difference between them.

Array: Example


   my @array = (1, 2, 3);
   push @array, 'a';
   my $one = shift @array;
   for my $e (@array)
   {
      say $e;
   }

Another simple example which shows some very simple code to set and manipulate arrays.

Hash

Called different things in different languages: associative array, hash, dictionary, hashmap, etc. You can store 0 or more scalars in a hash, indexed or mapped, by strings. The key is actually a string, not a scalar.

There are hash operations for returning the list of keys in a hash and the list of values in a hash. The order of items in either list is not guaranteed and can change over the life of the hash.

Hash: Operations

The {} characters mark the fact that we are indexing into a hash.

If you assign a list to a hash, Perl treats the list items as alternating keys and values. If you use a hash in list context, Perl returns a list that alternates keys and values. The order of the keys is not defined, but each key is followed by its value.

The exists operation can tell if a particular key is in the hash. The delete operation is used to remove a key/value pair from a hash (returning the value).

Hash: Example


    my %hash = (
        david   => 'David Shenk',
        kirsten => 'Kirsten Austin',
        mark    => 'Moranderan',
        connie  => 'Connie Ronin',
    );
    say $_ foreach sort keys %hash;
    say "David's full name is '$hash{david}'";

Another example that basically shows a little of the syntax of working with hashes.

Scalars: Implementation

Let's dig into the implementation of a scalar a little. A Perl scalar is actually a data structure of its own that contains several pieces. There are 1-3 pointers to underlying data. There is a reference count used to help with Perl's memory management. There are a handful of flags that tell Perl what the data in the structure means. Finally, there are a couple of optional lengths.

Most of the time, you don't need to know anything about this internal structure. However, once in a great while, these internals may explain behaviors that might be confusing.

Explore the internals a bit using Devel::Peek to dump scalars after performing various operations on them.

Scalar Tips

Most of these tips just follow up on the implications of the data on the previous slide.

Scalar Tips

The difference between pre-increment(decrement) and post-increment(decrement) is that the pre-form returns the new value and the post-form returns the old value of the scalar. In both cases, the operators make the same changes to the scalar.

The increment operators also work on strings. Explore the behavior in the REPL. Part of the reason for this is that the increment operation is used while managing the range operator ...

Scalar Trap

Since post-increment(decrement) return the old value when they are updating the scalar, Perl must create a new scalar to hold the old value. If you are not using the old value, this is an unnecessary create/destroy of a temporary scalar. In most circumstances, this cost is so small, you don't really need to care about it. But, in a really tight loop, it could have an impact on running code.

Array Implementation

Once again, use Devel::Peek to explore the internals of the array. Show the different pieces, but there's not as much need to describe in this case.

Discuss the allocation strategy a bit, mostly to explain why we don't do an allocation on each change in the size of the array.

Array Tips

Negative indexes can be useful rather than working with the current length. Iterating over elements is safer than trying to walk indexes. No chance of accidentally walking off the end of the array (and extending it). Also anyone reading the code can tell the code will touch all elements, without having to read the code to figure out what is happening.

Resetting the final index can be used for truncating array. It's more useful for pre-allocating a large array to avoid multiple re-allocations.

Array Traps

Cannot accidentally index off the front of the array, since that's almost never what you want. Indexing off the end of the array, extends, which is likely what you wanted. Once again, a case of Perl attempting to do what you intend, at the expense of consistency.

Array Trick: List Assignment


   ($a, $b, $c) = (1, 2, 3);
   ($a, undef, $c) = func();
   ($first, $second, @rest) = func();
   my $count = ()= func();

If the left side of an assignment is a list of variables surrounded by parentheses, Perl assigns individual values from the right-hand list into the variables in the left-hand list. Entries on the left that are specified as undef are skipped from the assignment. If there are more items on the right than on the left, the extra values are not assigned. If an item in the left-hand list is an array, it takes all remaining items.

One special feature of list assignment occurs if it is performed in scalar context, it returns the number of items from the right-hand list. This gives the somewhat odd syntax in the last item which executes func() in list context (because of the list assignment), does not save any of the returned values, but counts those values and assigns that count to $count.

Array Complexity

A little information about the cost of using arrays.

The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.

Hash Tips

Wrapping your mind around the facts of the hash keys/values not being in a defined order.

Hash Tips

Talk about missing items. There are times when just checking for a value to be defined is enough, sometimes it's not.

Hash Tricks: Sets


   my %set = map { $_ => 1 }
                 qw(Fortran C C++ Perl Forth Java
                    JavaScript Ruby Lisp Rust);
   say 'yes' if $set{'C++'};
   say 'no' unless $set{'C#'};

An example showing the use of a hash as a set of strings. The actual values don't mean anything. We just care about whether or not there is a value associated with the key. Assigning a key multiple times still ends up with only one item, obviously, making this act as a set.

Hash Tricks: Count Words


  my %words;
  my $line;
  while($line = <<>>)
  {
    ++$words{$_} foreach(split /[^A-Za-z]+/, $line);
  }

This example shows taking text from input (using the diamond operator). Using the increment operator on values, which Perl automatically handles correctly if it was unset. We key the hash with the words that we get from the split. The result is pretty concise, and not too hard to understand once you understand hashes and the basics.

Hash Trick: Transpose


   my %sounds = qw(dog bark cat meow bunny thump snake hiss);
   my %animal_makes = reverse %sounds;

One unusual side effect of the conversion between hashes and lists is the way reverse can be used to swap the keys and values in a hash. This only works well if none of the values are duplicates, of course.

Treating the hash as a list returns a list alternating keys and values. Reversing that list converts it to a list alternating values and keys. Assigning to a new hash uses the old values as keys and the old keys as the corresponding values.

Hash Traps

Keys are strings, not scalars. There are a few different side-effects of that, but mostly it's just something to be aware of. Some side effects:

  • undef becomes an empty string
  • The number 1 and the string '1' map to the same thing
  • References are converted to a string form that is not particularly useful and cannot be used to refer to the original object

Hash Complexity

A little information about the cost of using hashes.

The first two are related.

The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.

Data Structure Complexity Trap

A lot of people have the impression that hashes are faster than arrays. That is not completely true. The reality is a bit more complicated. Part of what makes the hash data structure work is a hashing function that converts a string into an index into the underlying data structure. This hashing function takes a small amount of time.

For really short arrays, searching the array for a string is faster than hashing a key and indexing. Under many circumstances the difference in time is negligible, but it certain cases it can make a difference. Unless you know the performance difference actually matters in your case, use the data structure that is the most readable.

If you are doing these lookups in a tight loop, or millions of times, it might be worth measuring the time spent and deciding if it is worth the optimization.

Data Structure Complexity Trap

One place where I have regularly seen people use a hash when it definitely has a performance impact is building a hash from a list, looking up one value, and then discarding the hash. This is never a reasonable thing to do. It doesn't matter how big the list is, or how fast the hashing function, just searching the list is always faster than this.

The problem is that building the hash requires walking the whole list and performing an operation that is likely more expensive than the comparison on every string, just to create the hash. Then, you do the hash lookup (which costs time). If you immediately throw away the hash, you have gained no benefit. Some people will tout the constant-time lookup of the hash and completely ignore the linear-time construction of the hash.

If you end up using the same hash for multiple lookups, it will eventually become faster to have used the hash. As usual, measure to figure out which trade-off works best for your code.

List Tools: map

Apply functionality to a list


  my @days = qw(Monday Tuesday Wednesday Thursday Friday Saturday Sunday);
  my @short_days = map { substr($_, 0, 3) } @days;

The map modifies a list by applying a function to each element of that list, returning a new list of the modified values. Although the code is applied to each element, it is actually more useful to treat this as modifying the list.

List Tools: grep

Filter a list


   my @lines = <<>>;
   my @long  = grep { length $_ > 100 } @lines;

The grep filters a list to generate a new list by executing the supplied code on each element. Any element that causes the code to return a true value is removed in the returned list.

List Tools: sort

Order a list


   my @words = <<>>;
   my @len_sorted = sort { length $a <=> length $b || $a cmp $b } @words;

The sort function sorts the items in a list, returning the sorted list. With no supplied comparison code, the function sorts ASCIIbetically. If a code block is supplied the two elements to compare are aliased to the variables $a and $b. The code block should return a negative number if the first element should sort before the second, a positive number if the second should sort before the first, or 0/false if the two items are equivalent.

Array Trick: splice


   my @array = qw(a b c d e f g h i);
   splice @array, 2, 3, qw(C D E E1 E2);

The splice function allows you to manipulate an array in several ways. It can remove items from an array, insert items anywhere in an array, or combine those functions to replace items in the array with different items from a list. The array functions push, pop, shift, unshift and delete can all be simply implemented in terms of splice. But, splice is more flexible.

Modules

If you want more tools for manipulating Perl data structures, check out these modules. They provide further functionality for each of the data structure types.

References

Much of the information I covered in this talk (and quite a bit more) is available on any system that has Perl installed. The first command describes the Perl data structures. The second describes many Perl operators, including the ones I mentioned here.

Schwartzian Transform


   my @files = map  { $_->[0] }
               sort { $a->[1] <=> $b->[1] }
               map  { [ $_, -s $_ ] }
               @files;

A very useful sort optimization called the Schwartzian Transform chains the sort and map functions in an interesting way. The basic idea is that we want to sort a list based on some function that is expensive to calculate.

Using the obvious approach, we will perform the expensive calculation twice for each comparison (left and right side of the comparison). For a list of length n, sort will do about n * log2(n) comparisons (for log2 is the log base 2). In other words, for 1024 items in the list, we will do 1024 * 10 or 10,240 comparisons. Which means we run the expensive function 20,480 times.

Sorting 1024 items compared as strings happens incredibly fast, but if the expensive function took 0.1 seconds to run, the sort would now take 34 minutes to run.

Using the Schwartian Transform, we convert each element in the list into a pair of the element, and the calculated value. This means we only run the expensive function one time for each item of the list. For a list of 1024 items, that would take 1 minute 42 seconds. Now, the sort does a simple comparison all of the times it needs to and the top map, converts the pairs back into the original elements.

This pattern is a little complex, but you should be able to walk through it from bottom to top and understand what it does. In the example on the slide, we are accessing the disk for each comparison, which is not only (relatively) slow, but also variable in time because of the hardware. This is a great example of a slow comparison function that I have seen used in real life.

Scalar Trick: Dual Vars

Although not used to often, this is an interesting trick. Using the dualvar function in the Scalar::Util module, you can create scalars with both a string and numeric component. An example would be a scalar that contained both the numeric code and printable message for an HTTP response:

my $status = dualvar( 404, "Not Found" );

If you use $status as a number (say by adding 0 to it or comparing it to 404, it has the numeric value. If you were to use it as a string (say by printing or interpolating it into a string), it has the string value.