Example of multi-threaded application: array of threads/ja

From Lazarus wiki
Revision as of 01:34, 15 March 2014 by TheCreativeCAT (talk | contribs) (Creating a Japanese page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

ここで、私は多数のスレッドを生成して、それらが作業を終えるのを待つ例を示そうと思います(同期は必要としていません)。これを書く気になったのは、Multithreaded Application Tutorial/jaを読んでも、その種のプログラムを書く方法がはっきり判らなかったからです。私が書いているのはOS X用のアプリケーションですが、ここに出てきたコードはどんなシステムでも動作するでしょう。

次のようなループを考えてみましょう:

...
for i:=1 to n do begin
  power(i,0.5)
end;
...

このループは、冪乗を一つづつn回計算しています。スレッドを使って同じことを並列に処理させてみましょう。

マルチスレッドは必要なの?

現代のマルチコアコンピュータでは、マルチスレッドを利用すればパフォーマンスが劇的に改善する場合があります。しかしながら、現代のコンピュータは極めて高速であり、スレッド化したコードは逐次処理に比べ、しばしばデバグと保守が困難です。処理時間の短縮とマルチスレッドプログラミングの複雑さとを天秤に掛け、またアルゴリズムが並列処理に適しているかよく考えてみる必要があります。最後に、並列処理の中にも、CPUスレッドの利用が適している場合(この例)とGPUの利用が適している場合があることを知っていてください(後者の場合、OpenCLのようなツールが最適です)。

Managing memory.

Since multiple threads will be working simultaneously, you need to ensure that they do not have memory contention issues. We will have contention issues if multiple threads are writing to the same locations of memory. Some algorithms do not lend themselves to multi-threading, because each computation depends earlier results. On the other hand, multi-threading works very efficiently on problems where the computations can be performed independently and in parallel. In this example we will solve a problem that is completely independent and therefore easy to attack with multiple threads. Advanced algorithms will have to use memory locking features to avoid contention.

In our example, we will have each thread write to distinct memory locations. Specifically, we will create an array 1..n and compute the value power(i,0.5) where i is in the range 1..n. Each thread will be given an independent portion of the range to compute. Consider n=1000. If we use one thread, it will be tasked with the whole range 1..1000, whereas if we use two threads one will tackle 1..500 and the other 501..1000. This way threads will be working to fill different portions of our memory array.

1. Detect number of cores available.

A computer with only a single core will not benefit from threading, whereas a computer with four physical cores each with hyperthreading (able to run two tasks simultaneously) will be able to process up to eight tasks at once. The following unit "cpucount" reports the number of cores available. You can use this to determine how many threads your program should run on a given computer. For computers with four or more cores, you many want to run n-1 threads (where n is the core count), reserving one core for the graphical interface and other tasks, as the performance difference between n and n-1 threads will not be great with this many cores.

unit cpucount;
interface
//returns number of cores: a computer with two hyperthreaded cores will report 4
function GetLogicalCpuCount: Integer;

implementation
{$IFDEF UNIX}
{$IFDEF Darwin}
uses Process,SysUtils,Controls,classes;

function GetLogicalCpuCount: Integer;
//returns number of CPUs for MacOSX computer
//requires Process in Uses Clause
//see http://wiki.lazarus.freepascal.org/Executing_External_Programs
var
   lProcess: TProcess;
   lLen,lPos: integer;
   lStr: string;
   lStringList: TStringList;
begin
     Result := 1;
     lProcess := TProcess.Create(nil);
     lStringList := TStringList.Create;
     lProcess.CommandLine := 'sysctl hw.ncpu';
     lProcess.Options := lProcess.Options + [poWaitOnExit, poUsePipes];
     lProcess.Execute;
     lStringList.LoadFromStream(lProcess.Output);
     lLen := length(lStringList.Text);
     if lLen > 0 then begin
        lStr := '';
        for lPos := 1 to lLen do
            if lStringList.Text[lPos] in ['0'..'9'] then
               lStr := lStr + lStringList.Text[lPos];
        if length(lStr) > 0 then
           result := strtoint(lStr);
     end;//if at least one character returned
     if result < 1 then //just incase there is a horrible error, e.g. 0
        result := 1;
     lStringList.Free;
     lProcess.Free;
end;
{$ELSE} //Not Darwin ... Assume Linux
uses
    classes,sysutils;
function GetLogicalCpuCount: Integer;
var lS: TStringList;
    lFilename: string;
    lLine,lnLines: integer;
begin
     result := 1;
     lFilename := '/proc/cpuinfo';
     if not fileexists(lFilename) then exit;
     lS:= TStringList.Create;
     lS.LoadFromFile(lFilename);
     lnLines := lS.Count;
     if lnLines > 0 then begin
        result := 0;
        for lLine := 1 to lnLines do
            if lS[lLine-1] = '' then
               inc(result);
     end;
     if result < 1 then
        result := 1;
     lS.Free;
end;
{$ENDIF} //If Darwin Else Linux

{$ELSE} //If UNIX ELSE NOT Unix - assume Windows
uses Windows;
function GetLogicalCpuCount: Integer;
var
  SystemInfo: _SYSTEM_INFO;
begin
  GetSystemInfo(SystemInfo);
  Result := SystemInfo.dwNumberOfProcessors;
end;
{$ENDIF}
end.

2. Create a custom threads class.

I use a separate unit for defining the behavior of the threads. Note that I am setting the "FreeOnTerminate" to false - so my program will need to dispose of each thread when it is done. This makes it easier to juggle multiple threads (if you set FreeOnTerminate to true and launch multiple very fast jobs it is possible that the thread will be released before your program checks whether the thread is completed - and checking a non-existent thread can cause an exception). By setting FreeOnTerminate to false I can ensure that each thread completed successfully.

unit mythreads;
{$mode objfpc}{$H+}
interface
uses
  Classes, SysUtils, Math;
type
  TData = array of double;
  PData = ^TData;
 Type
    TMyThread = class(TThread)
    private
    protected
      tPtr: PData;
      tstart,tfinish: integer;
      procedure Execute; override;
    public
      property Terminated;
      Constructor Create(lstart, lfinish: integer; var lPtr: PData);
    end;

implementation

  constructor TMyThread.Create(lstart, lfinish: integer; var lPtr: PData);
  begin
    FreeOnTerminate := False;
    tstart := lstart;
    tfinish := lfinish;
    tPtr := lPtr;
    inherited Create(false);
  end;
  procedure TMyThread.Execute;
  var
    i: integer;
  begin
    for i := tstart to tfinish do
        tPtr^[i] := power(i,0.5);
  end;

end.

3. Write the main program.

You need to add 'cthreads' to the main unit, not to unit with threads!

Note that there are two ways to determine whether all the threads have completed. You can use the in-built "waitFor" function - this works very nicely but on my OSX computer I noted that it refreshes only every 100ms. This is perfect for real world programs (we only use threading for computationally slow problems) and reduces thread overhead. However, for quick example benchmarks it can hide the benefits of threading (as operations require a minimum of 100ms regardless of the number of threads). Therefore, in this example I detect the threads terminated status every 2ms. This provides more accurate benchmark timing.

Remember to free each thread when you are done with it. Since we set "FreeOnTerminate := False" the program needs to do this explicitly.

Tips: in my Lazarus IDE I was not able to debug multi-threading applications if I don't use 'pthreads'. I have read that if you use 'cmem', the program works faster, but I strongly recommend you to check it for any particular case (my program hangs when I use 'cmem').

uses //    cmem,pthreads,
  cthreads, Classes, SysUtils, CustApp, MyThreads, Math, cpucount;

procedure DoUnThreaded (nValues: integer);
var
 dataArray: TData;
 i: integer;
 StartMS: double;
begin
     if (nValues < 1) then exit;
     StartMS:=timestamptomsecs(datetimetotimestamp(now));
     setlength(dataArray, nValues+1);//+1 since indexed 0..n-1
     for i:=1 to nValues do
         dataArray[i] := power(i,0.5);  ;
     Writeln('Serially processed '+inttostr(nValues)+' values in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms, with '+inttostr(nValues)+'^0.5 = '+floattostr(dataArray[nValues]));
end;

procedure DoThreading (nThreadsIn, nValues: integer);
var
 threadArray: array  of TMyThread;
 dataArray: TData;
 lData : PData;
 nThreads, i,lStart,lFinish: integer;
 StartMS: double;
begin
     if (nThreadsIn < 1) or (nValues < 1) then exit;
     nThreads := nThreadsIn;
     if  nThreads > nValues then nThreads := nValues;
     StartMS:=timestamptomsecs(datetimetotimestamp(now));
     setlength(threadArray,nThreads+1);//+1 since indexed 0..n-1
     setlength(dataArray, nValues+1);//+1 since indexed 0..n-1
     lData := @dataArray;
     lStart := 1;
     for i:=1 to nThreads do begin
         if i < nThreads then
            lFinish:=i*(nValues div nThreads)
         else
             lFinish:=  nValues;
         threadArray[i]:= TMyThread.Create(lStart, lFinish, lData);
         //Writeln('Thread '+inttostr(i)+' processing '+inttostr(lStart)+'..'+inttostr(lFinish));
         lStart := lFinish+1;
     end;
     for i:=1 to nThreads do if not ThreadArray[i].Terminated then Sleep(2);
     //for i:=1 to nThreads do threadArray[i].waitFor;  //appears to sleep for 100ms on OSX
     for i:=1 to nThreads do threadArray[i].Free;
     Writeln(inttostr(nThreads)+' Threads processed '+inttostr(nValues)+' values in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms, with '+inttostr(nValues)+'^0.5 = '+floattostr(dataArray[nValues]));
end;

begin
  Writeln('Computer reports '+inttostr(GetLogicalCpuCount)+' cores: probably optimal number of threads ');
  DoUnthreaded(10);
  DoThreading(1,10);
  DoThreading(2,10);
  DoThreading(4,10);
  DoThreading(8,10);
  DoUnthreaded(100000000);
  DoThreading(1,100000000);
  DoThreading(2,100000000);
  DoThreading(4,100000000);
  DoThreading(8,100000000);
end.

Results.

The results show there is a delay in creating threads, but that for large tasks multiple parallel threads outperform serial processing. Note that when the number of threads exceeds the number of cores (4 for this computer) there is little benefit for additional threads. You should not expect speed to scale perfectly with the number of threads: there is some overhead to threading and most moderns CPUs will operate slightly faster when there is only one intensive task than when running running tasks that tax multiple cores ('turboboost').

  • Computer reports 4 cores: probably optimal number of threads
  • Serially processed 10 values in 0ms, with 10^0.5 = 3.16227766016838
  • 1 Threads processed 10 values in 3ms, with 10^0.5 = 3.16227766016838
  • 2 Threads processed 10 values in 5ms, with 10^0.5 = 3.16227766016838
  • 4 Threads processed 10 values in 9ms, with 10^0.5 = 3.16227766016838
  • 8 Threads processed 10 values in 19ms, with 10^0.5 = 3.16227766016838
  • Serially processed 100000000 values in 10214ms, with 100000000^0.5 = 10000
  • 1 Threads processed 100000000 values in 10320ms, with 100000000^0.5 = 10000
  • 2 Threads processed 100000000 values in 5894ms, with 100000000^0.5 = 10000
  • 4 Threads processed 100000000 values in 3801ms, with 100000000^0.5 = 10000
  • 8 Threads processed 100000000 values in 3733ms, with 100000000^0.5 = 10000