A first concurrency in Erlang

I've recently started learning the Erlang language using Joe Armstrong's new, in-beta, book. In the past I learned some Lisp but I ended up admiring Lisp more in the abstract than I liked it in practice and so I didn't stick at it. Lately I've had a yen to return to functional programming because so much of what I find elegant about Ruby seems to have it's roots in the FP world and I feel I've so much more to learn yet.

Getting started with a new programming language for me starts with picking a problem to solve that is useful, not too difficult, and exercises a reasonable spectrum of the languages features. In this case, because Erlang is so strong in it's support for concurrency, it seemed natural to look for a problem whose solution would be enhanced by a parallel approach. Since I have a very large collection of PDF's which includes quite a lot of duplicates, and since I don't like any of the de-dupe solutions I've tried, I thought a good first problem would be scanning for and finding duplicate PDF's across all my disks.

I spent a bit of time looking at the library that comes with Erlang and was pleased to discover the fold_files function in the filelib module. At a stroke this takes care of most of the actual work of iterating the file-system and filtering the required files. What remained was learning how to build the duplicate database in functional style and bearing in mind Erlang immutable data structures and single-assignment variables.

My approach to finding duplicates was to calculate an MD5 hash of the file size along with the first & last 16 bytes of the file which looks like this:

get_file_hash( FilePath, Size ) ->
  {ok,[Hunk1,Hunk2]} = myfile:open(
    FilePath, [read], fun( File ) -> file:pread( File, [{0,16},{Size-16,16}] ) end
  ),
  hex:hex(
    binary_to_list(
      erlang:md5(
        lists:append( integer_to_list( Size ), lists:append( Hunk1, Hunk2 ) )
      )
    )
  ).

As an aside I really like the way Erlang functions are declared, e.g.

length( [] ) -> 0;
length( [H|T] ) -> 1 + length( T ).

which uses pattern matching on the arguments to find the right clause of the function to evaluate. It's an elegant approach although I did get caught out by not realising that clauses of a function with the same name but different arity do no belong together, e.g.

find( Path ) -> expr;
find( Path, [] ) -> expr.

does not work because the two are not clauses of the same function (being instead clauses of find/1 and find/2 respectively). If you find yourself with perplexing head mismatch errors try looking for this pattern and separate the clauses properly (using a .).

My definition of the function myfile:open(), above, reflects my failure to close any files during the first attempts at iterating a large filesystem leading to emfile errors. I realised I had forgotten that it was even possible to call Ruby's File#open without passing a block because it's so natural and effective to do so, it's something I think Matz got dead right, hence:

myfile:open( Filename, Modes, Block ) when is_function( Block ) ->
  {ok,File} = file:open( Filename, Modes ),
  RetVal = Block( File ),
  file:close( File ),
  RetVal.

that demonstrates another neat aspect of Erlang function clauses, guard expressions.

Another aside is that, in Erlang, the = operator is not "equals" but "matches". This relates to the fact that Erlang, in a manner somewhat akin to Prolog, uses unification in expressions so that matching unbound variables will bind them and thereafter attempt to unify them, a simple example:

X = 2.

where X is an unbound variable will bind X to the value 2. If later you were to evaluate,

X = 3.

you would get an error because 3 does not match the value, 2, that is already bound to X. This, I think, can take some getting used to but is pretty neat. For example, if File is a filespec record then:

#filespec{hash=Key,size=Size} = File

will bind the variables Key and Size (in parallel) to the hash and size attributes of the record bound to File.

I was surprised at how quickly I arrived at a solution to my basic problem. It was almost as fast I could imagine doing it in Ruby. However what the basic solution lacked was that it took no advantage of Erlangs famous concurrency primitives, spawn & co. Initially I was quite unsure about how to parallelize my solution since so much of the work was going on in fold_files. After some pondering I concluded that what I was looking for was futures. I looked through the library that comes with Erlang but couldn't recognize was I was looking for, so I cooked up the following:

-module(futures).
-export([new/1,value/1]).

new( Fun ) ->
  Pid = spawn( fun oneshot/0 ),
  Pid ! { self(), Fun },
  Pid.

value( Pid ) ->
  receive
    { Pid, { ok, Result } } -> Result
  end.

oneshot() ->
  receive
    { Sender, Fun } ->
      Sender ! { self(), { ok, Fun() } };
    Any ->
      io:format( "Oneshot(~p) received unknown message ~p~n", [self(),Any] )
  end.

Which, it turns out, isn't a shocking travesty ;-) I did get some useful suggestions from the fine folks in #erlang including looking at Joe Armstrong's definition of pmap. Now I was able to make essentially a two line change to have a fully concurrent solution:

scan( Path ) ->
  lists:map(
    fun( Future ) -> futures:value( Future ) end,
    filelib:fold_files(
      Path,
      ".*\.pdf",
      true,
      fun( F, Acc ) -> [
        futures:new( fun() ->
          file_info:file_info( F )
        end ) | Acc ] end,
      []
    )
  ).

Creating the future spawns an Erlang process that runs the embedded function (that does all the nasty mucking about with files and MD5 hashing). Later the future is referenced for it's value which will block until a value is available (but by which time all the processes have been spawned off anyway). Crucial to this being efficient is Erlangs ability to cope with hundreds of thousands of processes (I was able to run tests creating millions of processes on my MacBook Pro CD2 with 2GB memory). For more about that take a look here.

The way I have defined the future module highlights what is for me, so far, the only thing I don't like about Erlang (and probably all functional languages). I miss the ability to wrap the future into a nicely packaged class. Calling:

Pid = futures:new( fun .... end ).
futures:value( Pid ).

seems less elegant than (hypothetically):

future = Future.new do ... end
future.value()

Something about having to treat the generic process id as a future... maybe it's a small point but modules do expose the details more and feel somewhat clunky after using Ruby classes for such a long time. This last is apparent in things like:

lists:map( fun ... end, List )

vs.

list.map do ... end

as well. There have to be tradeoff's I suppose.

I've still got a few bugs to work through (the odd future call seemingly to returnin a pid instead of the expected result which is a bit weird) but otherwise I have quickly arived at 80% of a fast neat solution to my problem. The last step of which will be to have it create a customisable script for handling the duplicate files.

I'm beginning to like Erlang a lot.

A first concurrency in Erlang

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112