There are 3 major types of data structures in Perl. There are a few other things that might count as data structures, but you may never run into them. We'll skip those others in this talk.
undef
A scalar can store integers or IEEE floating point numbers. Scalars with numeric values act pretty much the way you would expect coming from any other language.
A scalar can also store a string. The strings are not particularly limited in size by anything except available memory. Unlike some languages (Java), Perl strings are mutable.
A Perl reference is basically a pointer or reference to a data structure
or code. The undef
value is a special value that means the scalar
has not been set to anything at this time. Perl provides ways to unset
a value, so undef
is not quite the same as never used.
+
, -
, *
,
/
, %
, ++
, --
x
, .
==
, !=
, <
,
>
, <=
, >=
eq
, ne
, lt
,
gt
, le
, ge
!
, &&
, ||
,
not
, and
, or
Most of the scalar operations are pretty much what you would expect from any dynamic language.
Perl supports all of the normal arithmetic operations. They work on integers and floats, and combinations of the two.
The two string operations are repeat and concatenate. The repeat
operator (x
) makes a new string that consists of the left argument repeated
the number of times supplied by the right argument. The .
operator, on the
other hand, makes a new string by concatenating the left argument with the right argument.
my $num = 1;
$num += 2;
my $str = 'hip ' x 2;
$str .= ', hooray!'
This little piece of example code doesn't really do much of anything except show how
scalar variables are set and manipulated a bit. The my
keyword specifies a
lexical variable, which is what most languages call a local variable.
undef
Arrays contain lists of scalars. They are mutable and handle memory as they grow and/or shrink.
[]
shift
/unshift
push
/pop
for
/foreach
Retrieve items from an array with the $array[0]
syntax. Perl arrays support
easy operators for adding/removing from the beginning/left of an array or the end/right
end of the array. Perl automatically handles expanding or contracting the memory used.
The for
statement can be used to iterate over a list. The two names
for
and foreach
are aliases. There is no functional difference
between them.
my @array = (1, 2, 3);
push @array, 'a';
my $one = shift @array;
for my $e (@array)
{
say $e;
}
Another simple example which shows some very simple code to set and manipulate arrays.
Called different things in different languages: associative array, hash, dictionary, hashmap, etc. You can store 0 or more scalars in a hash, indexed or mapped, by strings. The key is actually a string, not a scalar.
There are hash operations for returning the list of keys in a hash and the list of values in a hash. The order of items in either list is not guaranteed and can change over the life of the hash.
{}
exists
delete
The {}
characters mark the fact that we are indexing into a hash.
If you assign a list to a hash, Perl treats the list items as alternating keys and values. If you use a hash in list context, Perl returns a list that alternates keys and values. The order of the keys is not defined, but each key is followed by its value.
The exists
operation can tell if a particular key is in the hash. The
delete
operation is used to remove a key/value pair from a hash (returning
the value).
my %hash = (
david => 'David Shenk',
kirsten => 'Kirsten Austin',
mark => 'Moranderan',
connie => 'Connie Ronin',
);
say $_ foreach sort keys %hash;
say "David's full name is '$hash{david}'";
Another example that basically shows a little of the syntax of working with hashes.
Let's dig into the implementation of a scalar a little. A Perl scalar is actually a data structure of its own that contains several pieces. There are 1-3 pointers to underlying data. There is a reference count used to help with Perl's memory management. There are a handful of flags that tell Perl what the data in the structure means. Finally, there are a couple of optional lengths.
Most of the time, you don't need to know anything about this internal structure. However, once in a great while, these internals may explain behaviors that might be confusing.
Explore the internals a bit using Devel::Peek
to dump scalars after
performing various operations on them.
Most of these tips just follow up on the implications of the data on the previous slide.
++$i
) vs post-increment ($i++
)"aA0"
--$i
) vs post-decrement ($i--
)The difference between pre-increment(decrement) and post-increment(decrement) is that the pre-form returns the new value and the post-form returns the old value of the scalar. In both cases, the operators make the same changes to the scalar.
The increment operators also work on strings. Explore the behavior in the REPL. Part of the reason for this is that the
increment operation is used while managing the range operator ..
.
Since post-increment(decrement) return the old value when they are updating the scalar, Perl must create a new scalar to hold the old value. If you are not using the old value, this is an unnecessary create/destroy of a temporary scalar. In most circumstances, this cost is so small, you don't really need to care about it. But, in a really tight loop, it could have an impact on running code.
Once again, use Devel::Peek
to explore the internals of the array.
Show the different pieces, but there's not as much need to describe in this case.
Discuss the allocation strategy a bit, mostly to explain why we don't do an allocation on each change in the size of the array.
$#array
gives the right-most index in array$#array
to change the size of the arrayNegative indexes can be useful rather than working with the current length. Iterating over elements is safer than trying to walk indexes. No chance of accidentally walking off the end of the array (and extending it). Also anyone reading the code can tell the code will touch all elements, without having to read the code to figure out what is happening.
Resetting the final index can be used for truncating array. It's more useful for pre-allocating a large array to avoid multiple re-allocations.
delete
removes elements from the end of the arraydelete
replaces elements with undef
if not at the endCannot accidentally index off the front of the array, since that's almost never what you want. Indexing off the end of the array, extends, which is likely what you wanted. Once again, a case of Perl attempting to do what you intend, at the expense of consistency.
($a, $b, $c) = (1, 2, 3);
($a, undef, $c) = func();
($first, $second, @rest) = func();
my $count = ()= func();
If the left side of an assignment is a list of variables surrounded by parentheses, Perl assigns
individual values from the right-hand list into the variables in the left-hand list. Entries on the
left that are specified as undef
are skipped from the assignment. If there are more
items on the right than on the left, the extra values are not assigned. If an item in the left-hand
list is an array, it takes all remaining items.
One special feature of list assignment occurs if it is performed in scalar context, it returns
the number of items from the right-hand list. This gives the somewhat odd syntax in the last item
which executes func()
in list context (because of the list assignment), does not save
any of the returned values, but counts those values and assigns that count to $count
.
A little information about the cost of using arrays.
The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.
keys
, values
, and each
return those items out of orderWrapping your mind around the facts of the hash keys/values not being in a defined order.
undef
undef
s, use exists
Talk about missing items. There are times when just checking for a value to be defined is enough, sometimes it's not.
my %set = map { $_ => 1 }
qw(Fortran C C++ Perl Forth Java
JavaScript Ruby Lisp Rust);
say 'yes' if $set{'C++'};
say 'no' unless $set{'C#'};
An example showing the use of a hash as a set of strings. The actual values don't mean anything. We just care about whether or not there is a value associated with the key. Assigning a key multiple times still ends up with only one item, obviously, making this act as a set.
my %words;
my $line;
while($line = <<>>)
{
++$words{$_} foreach(split /[^A-Za-z]+/, $line);
}
This example shows taking text from input (using the diamond operator). Using the increment operator
on values, which Perl automatically handles correctly if it was unset. We key the hash with the words
that we get from the split
. The result is pretty concise, and not too hard to understand
once you understand hashes and the basics.
my %sounds = qw(dog bark cat meow bunny thump snake hiss);
my %animal_makes = reverse %sounds;
One unusual side effect of the conversion between hashes and lists is the way reverse
can
be used to swap the keys and values in a hash. This only works well if none of the values are duplicates,
of course.
Treating the hash as a list returns a list alternating keys and values. Reversing that list converts it to a list alternating values and keys. Assigning to a new hash uses the old values as keys and the old keys as the corresponding values.
Keys are strings, not scalars. There are a few different side-effects of that, but mostly it's just something to be aware of. Some side effects:
undef
becomes an empty string1
and the string '1'
map to the same thingA little information about the cost of using hashes.
The first two are related.
The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.
A lot of people have the impression that hashes are faster than arrays. That is not completely true. The reality is a bit more complicated. Part of what makes the hash data structure work is a hashing function that converts a string into an index into the underlying data structure. This hashing function takes a small amount of time.
For really short arrays, searching the array for a string is faster than hashing a key and indexing. Under many circumstances the difference in time is negligible, but it certain cases it can make a difference. Unless you know the performance difference actually matters in your case, use the data structure that is the most readable.
If you are doing these lookups in a tight loop, or millions of times, it might be worth measuring the time spent and deciding if it is worth the optimization.
grep
over the list would be fasterOne place where I have regularly seen people use a hash when it definitely has a performance impact is building a hash from a list, looking up one value, and then discarding the hash. This is never a reasonable thing to do. It doesn't matter how big the list is, or how fast the hashing function, just searching the list is always faster than this.
The problem is that building the hash requires walking the whole list and performing an operation that is likely more expensive than the comparison on every string, just to create the hash. Then, you do the hash lookup (which costs time). If you immediately throw away the hash, you have gained no benefit. Some people will tout the constant-time lookup of the hash and completely ignore the linear-time construction of the hash.
If you end up using the same hash for multiple lookups, it will eventually become faster to have used the hash. As usual, measure to figure out which trade-off works best for your code.
Apply functionality to a list
my @days = qw(Monday Tuesday Wednesday Thursday Friday Saturday Sunday);
my @short_days = map { substr($_, 0, 3) } @days;
The map
modifies a list by applying a function to each element of that list,
returning a new list of the modified values. Although the code is applied to each element, it
is actually more useful to treat this as modifying the list.
Filter a list
my @lines = <<>>;
my @long = grep { length $_ > 100 } @lines;
The grep
filters a list to generate a new list by executing the supplied code on
each element. Any element that causes the code to return a true value is removed in the returned
list.
Order a list
my @words = <<>>;
my @len_sorted = sort { length $a <=> length $b || $a cmp $b } @words;
The sort
function sorts the items in a list, returning the sorted list. With no supplied comparison
code, the function sorts ASCIIbetically. If a code block is supplied the two elements to compare are aliased to the
variables $a
and $b
. The code block should return a negative number if the first element
should sort before the second, a positive number if the second should sort before the first, or 0/false if the two
items are equivalent.
my @array = qw(a b c d e f g h i);
splice @array, 2, 3, qw(C D E E1 E2);
The splice
function allows you to manipulate an array in several ways. It can remove items from an
array, insert items anywhere in an array, or combine those functions to replace items in the array with different
items from a list. The array functions push
, pop
, shift
, unshift
and delete
can all be simply implemented in terms of splice
. But, splice
is more flexible.
Scalar::Util
List::Util
, List::MoreUtils
Hash::Util
If you want more tools for manipulating Perl data structures, check out these modules. They provide further functionality for each of the data structure types.
Much of the information I covered in this talk (and quite a bit more) is available on any system that has Perl installed. The first command describes the Perl data structures. The second describes many Perl operators, including the ones I mentioned here.
my @files = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, -s $_ ] }
@files;
A very useful sort optimization called the Schwartzian Transform chains the sort
and map
functions in an interesting way. The basic idea is that we want to sort a list based on some function that is
expensive to calculate.
Using the obvious approach, we will perform the expensive calculation twice for each comparison (left and
right side of the comparison). For a list of length n
, sort will do about n * log2(n)
comparisons (for log2
is the log base 2). In other words, for 1024 items in the list, we will do
1024 * 10
or 10,240 comparisons. Which means we run the expensive function 20,480 times.
Sorting 1024 items compared as strings happens incredibly fast, but if the expensive function took 0.1 seconds to run, the sort would now take 34 minutes to run.
Using the Schwartian Transform, we convert each element in the list into a pair of the element, and the calculated
value. This means we only run the expensive function one time for each item of the list. For a list of 1024 items, that
would take 1 minute 42 seconds. Now, the sort
does a simple comparison all of the times it needs to and
the top map
, converts the pairs back into the original elements.
This pattern is a little complex, but you should be able to walk through it from bottom to top and understand what it does. In the example on the slide, we are accessing the disk for each comparison, which is not only (relatively) slow, but also variable in time because of the hardware. This is a great example of a slow comparison function that I have seen used in real life.
Although not used to often, this is an interesting trick. Using the dualvar
function in the
Scalar::Util
module, you can create scalars with both a string and numeric component. An example
would be a scalar that contained both the numeric code and printable message for an HTTP response:
my $status = dualvar( 404, "Not Found" );
If you use $status
as a number (say by adding 0
to it or comparing it to 404, it
has the numeric value. If you were to use it as a string (say by printing or interpolating it into a string), it
has the string value.