Erlang processes and messaging architecture

The challenge I have in hand is to read the lines of a large file, process them, and return ordered results.

My algorithm:

  • start with a master process that will evaluate the workload (recorded on the first line of the file).
  • handle workflows: each worker will read a part of the file using pread / 3, process that part, and send the results to the master
  • master gets all sub-results, sorting and returning so basically no communication between workers is required.

My questions:

  • How to find the optimal balance between the number of erlang processes and the number of cores? so if I create one process for each cpu core, would i use it with my cpu?
  • How does pread / 3 reach the specified string; does it iterate over all lines in the file? and is pread / 3 a good plan for reading files in parallel?
  • Is it better to send one big message from process A to B or send N small messages? I found part of the answer in the link below, but I would appreciate further development of the
    erlang messaging architecture
+3


source to share


1 answer


  • Erlang processes are cheap. You are free (and encouraged) to use more than any number of cores. There may be an upper limit to what is practical for your problem (loading 1TB of data in one process per row is a lot depending on the size of the row).

    The easiest way to do this when you don't know is to let the user decide. This means that you can decide on the appearance of workers N

    and distribute work among them, waiting for a response. Re-run the program on change N

    if you don't like how it works.

    The tricky ways to do this are to compare the amount of time, choose what you think makes sense as the maximum value, insert it into the pool library (if you want: some pool goes to pre-allocated resources and some for resizable amount ) and settle for a one-size-fits-all solution.

    But there really is no simple "optimal number of cores". You can run it on 50 processes as well as 65,000 of them if you like; if the task is awkwardly parallel, the VM should be able to use most of them and saturate the cores anyway.

-

  1. Reading files in parallel is an interesting question. It may or may not be faster (as direct comments mentioned) and it can only mean speedup if the work on each line is minimal so that reading the file has the most cost.

    The hard bit really is that functions such as pread/2-3

    accept byte offset. Your question is phrased so that you are worried about the lines of the file. Thus, the byte offsets you pass to workers can thus cross the line. If your block ends with a word my

    in this is my line\nhere it goes\n

    , one worker will see that he has an incomplete line, and another will report only about my line\n

    , skipping the previous one this is

    .

    Typically, this kind of annoying stuff is what will lead you to the first process to own the file and sift through it, only to hand over bits of text to the workers for processing; this process will act as a kind of coordinator.

    The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses were received, making it easy to know when to return results. If everything is inconsistent, you have to trust the starter and workers to tell you that "we all don't work" as a separate set of independent messages to know.

    In practice, you will probably find that it is most useful to know the operations that help the life of your equipment in terms of file operations, more than "how many people can read the file at once." There is only one hard drive (or SSD), all data must go through it anyway; parallelism can be limited at the end for access there.

-

  1. Use messages that make sense to your program. The most efficient program would have many processes capable of doing work without having to send messages, communicate, or acquire locks.

    A more realistic, highly efficient program will use very few very small messages.

    The interesting thing is that your problem is inherently data driven. So there are several things you can do:

    • make sure you are reading binary text; large binaries (> 64b) are allocated on the global binary heap, shared and GC'd with reference counting
    • Hand in information about what needs to be done, not the data to complete it; this would have to be measured, but the host could walk through the file, mark where the lines end, and just manually offset the bytes for the workers to go and read the file itself; note that you end up reading the file twice, so if memory allocation is not your main overhead, it will likely be slower.
    • Make sure the file is being read in raw

      or ram

      ; other modes use the middle man's process to read and send data (this is useful if you are reading files over the network in Erlang clustered nodes); Modes raw

      and ram

      gives a file descriptor directly to the calling process and is much faster.
    • First, think about writing a clear, understandable and correct program. Only if it's too slow, if you try to reorganize and optimize it; you can very well find it well enough on the first try.


Hope this helps.

PS You can try some really simple things first:

  • or:

    • read the whole file at once with {ok, Bin} = file:read_file(Path)

      and strip the lines (with binary:split(Bin, <<"\n">>, [global])

      ),
    • use {ok, Io} = file:open(File, [read,ram])

      and then reuse file:read_line(Io)

      in file descriptor
    • use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}])

      and then reuse file:read_line(Io)

      in file descriptor
  • call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines)

    to automatically start everything in parallel (it will spawn one process per line)

  • cause lists:sort/1

    as a result.

Then from there, you can refine each step if you identify them as problematic.

+1


source







All Articles