Difference between revisions of "Talk:PHASH"

From Lazarus wiki
Jump to navigationJump to search
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
== Calculating Perceptual hashes (PHASH) and comparing image similarity) ==
 
== Calculating Perceptual hashes (PHASH) and comparing image similarity) ==
  
 +
Some thoughts - What I do when I get a list of similar pictures I create two bat files. The first renames the files so that each of the duplicates are sequential in a directory and include the names and maybe other information. I will use the size so that the larger file will be first and the smaller next so that when I use an image viewer I can just delete the second after verifying it is similar while walking the directory. Afterwords I run the restore bat file to move them back to the correct directories. I am still playing around with the algorithm. Currently re-sizing the image to 34x34 and dropping the outside edge in an attempt to remove borders and lettering. I have found that once the bit difference exceeds 10 the files are usually different. If it is under 8 then there is a good chance it is a duplicate.
  
Here is a program to calculate the perceptual hash for images different than a CRC or md5 hash a perceptual hash will be similar for similar  images. A slight change in the image (sharpen, blur, resize) will give perceptual hash that is similar where a CRC or MD5 hash will be completely different. I looked for a PHASH program for pascal and could not find any. I did find three methods for getting a perceptual hash. The first was using a DCT transform but the code was written in C and was not easy to link the program in. The second way is to resize it and make a hash based on whether each pixel was more or less then its next neighbor. The third way was to use image moments. ImageMagick(IM) uses this method but I found that after a certain threshold the images where radically different. Also calculating the differences required a lot of computing power as you had to square the difference of 42 numbers for each image comparison. I tried all three methods, only the first had good promise. The last one was not even part of the IM Lazarus interface.
+
Note JPG files are lossy format so you need to be aware that EACH time you save the file you loose some quality. Also note that just because a image is larger does not mean that the image is better. It could be that the other image was enlarged or adjusted and you find the quality of the larger image is worse than the smaller image. The best way to compare and select the best picture is for you to do the comparison.
  
This program uses ImageMagick and I want to express my appreciation for the work of the translators of this tool to work with Lazarus. However I had a lot of problems getting it to work properly. First because it has not kept up with the code there were some routines that did not work because they have been deprecated (and no longer available) in ImageMagick. Second because some of the newer routines were missing entirely. And third because I was using the Q8 version of ImageMagick and it appeared that the translation was for the Q16 version. PPixelPacket for instance assumes the colors are 16 bit and not
 
8 bit so I had to change how I handled the structure). I use 8 bit ImageMagick because 99.99999% of the pictures I wanted to process are 8 bit or less and the Q8 program is much faster then the Q16 version.
 
  
To get this to work under windows 7 you need to compile under 32bit mode (and call 32 bit ImageMagick) as the calling conventions for 64 bit are quite different and Lazarus does not support the 64 bit calling conventions. (this may change but keep it in mind).
+
== Unit tests ==
  
This program takes one main argument - the directory to parse for images (currently only JPG and GIF but you can add more). The program maintains a file 'yalls' that contains information for each image it finds in that directory - deleting files that no longer exists, adding new files and if the md5 does not match recalculating data for that file. (for speed you can have it skip the md5 check and just rewrite existing lines for files by placing SKIP before the directory argument.
+
Do you have any unit tests or statements of expected results?  Especially for things like the DCT Matrix. I found this code while looking for the pHash algorithm because I want to implement it in Nim. Your code is quite clear so I can fairly easily port it, I think. However it would be very handy to be able to verify that it actually performs the same as yours. [[User:Kwhitefoot|Kwhitefoot]] ([[User talk:Kwhitefoot|talk]]) 20:27, 4 December 2022 (CET)
Since I had a specific set of directories to parse if you do not use any argument it will parse those directories. The output file will consist of one line for each image it finds consisting of fields separated by blanks. The program expects the filenames are without spaces and would need modification otherwise. There was a memory leak with the program and after about 200,000 images it crashed. This may be gone but if the program dies just move the yall file to yalls and restart with JPEGINFO SKIP to restart from where it ended.
 
 
 
The output file contains:
 
<code>
 
  MD5 - 32 hex bytes of the MD5 sum of the entire file.
 
  Signature - 64 hex bytes that comes from IM identify that is I think the md5 of the image only
 
  PHASH - 16 hex bytes of the DCT hash of the image
 
  Date of the file in YYYYMMDDHHMMSS format (from ImageMagick identify)
 
  Size in bytes of the file
 
  Bitdepth of the image
 
  Width of the image
 
  Height of the image
 
  Filename of the image
 
  Complete fileid of the image
 
  Example:
 
  md5      fb54e12c44d2b14d4d3266251109799e
 
  signature 2318170444932902d0541895584df4c2f8f38ad8b1d4dad899870760817422f2
 
  phash    CCFB4EAEF3F30C0D
 
  date      20151001010847
 
  size      901
 
  bitdepth  8
 
  width    32
 
  width    32
 
  filename  01.gif
 
  fid      F:\CCC\01.gif
 
</code>
 
I will add another program later if there is interest that calculates phash similarity. This program was written on a windows7 platform.
 
Basically to get phash similarity you count the number of bits that are different. If that number is 0 the pictures are likely to be the same,
 
if it is over 10 the pictures are likely to be different, otherwise the pictures may be similar to some degree.   
 
<syntaxhighlight>
 
program JPEGInfo;
 
 
 
{$APPTYPE CONSOLE} {$R-}
 
 
 
uses strutils,sysutils,classes, md5,magick_wand, ImageMagick,DateUtils;
 
 
 
 
 
{ Bytes = array[0..0] of Byte;}
 
{ LongWords = array[0..255] of Byte;}
 
 
 
 
 
type
 
  clr = record
 
    red:byte;
 
    green:byte;
 
    blue:byte;
 
    opacity:byte;
 
  end;
 
var
 
allfile:text;
 
infile:text;
 
donedir,donefname, inrec:string;
 
skipped,deleted,added,modified,count:longword;
 
wand: PMagickWand;
 
updatefileexists, skipmd5check,notstarted,readnextrec :boolean;
 
dctmatrix : array[0..7,0..31] of real;
 
status: MagickBooleanType;
 
starttime:tdatetime;
 
filename:pchar;
 
udir,ufname:string;
 
Size : Integer;
 
timed:longint;
 
DirName: string;
 
 
 
procedure ThrowWandException(wand: PMagickWand);
 
var
 
  description: PChar;
 
  severity: ExceptionType;
 
begin
 
  description := MagickGetException(wand, @severity);
 
  WriteLn(Format('An error occurred. Description: %s', [description]));
 
  description := MagickRelinquishMemory(description);
 
  Abort;
 
end;
 
 
 
function pad(msg:string;len:byte):string;
 
var par : byte;
 
begin
 
  result := msg;
 
  if length(msg) > len then result := copy(msg,1,len);
 
  while length(result) < len do result := result + ' ';
 
  for par := 1 to len do if ord(result[par]) < 32 then result[par] := ' ';
 
end;
 
 
 
{read the next old file record into inrec and fileid into donefname,donedir}
 
procedure getnextok;
 
var i: integer;
 
 
 
procedure skipitem;
 
begin
 
  while (i < length(inrec)) and (inrec[i] <> ' ') do inc(i);
 
end;
 
 
 
procedure skipspaces;
 
begin
 
  while inrec[i] = ' ' do inc(i)
 
end;
 
begin
 
  if eof(infile) then donedir := chr(255)
 
  else begin
 
      readln(infile,inrec);
 
      i := 1;
 
      skipitem; {md5}  skipspaces;  {fb54e12c44d2b14d4d3266251109799e}
 
      skipitem; {md6}  skipspaces;  {2318170444932902d0541895584df4c2f8f38ad8b1d4dad899870760817422f2}
 
      skipitem; {phash} skipspaces;  {CCFB4EAEF3F30C0D}
 
      skipitem; {date} skipspaces;  {20151001010847}
 
      skipitem; {size} skipspaces;  {        901}
 
      skipitem; {depth} skipspaces;  {  8}
 
      skipitem; {width} skipspaces;  {    32}
 
      skipitem; {height} skipspaces; {    32}
 
      skipitem; {name} skipspaces;  {01.gif                  }
 
      donefname:= copy(inrec,i,999); {F:\CCC\01.gif}
 
      i := rpos('\',donefname);
 
      if i = 0 then writeln('error reading line from yall:',inrec);
 
      donedir := StringReplace(uppercase(copy(donefname,1,i)),'\',' ',[rfReplaceAll]);;
 
      donefname := uppercase(copy(donefname,i+1,9999));
 
  end;
 
  readnextrec := false;
 
end;
 
function GetDif(time:longint):string;
 
var part: longint; t:longint;
 
begin
 
  result := '';
 
  t := time;
 
  if time > 86400 then begin
 
      part := t div 86400;
 
      t := t mod 86400;
 
      result := result + format('%.2d',[part])+' '; end;
 
  if time >  3600 then begin
 
      part := t div  3600;
 
      t := t mod  3600;
 
      result := result + format('%.2d',[part])+':';  end;
 
  if time >    60 then begin
 
      part := t div    60;
 
      t := t mod    60;
 
      result := result + format('%.2d',[part])+':'; end;
 
  if time >    0 then
 
      result := result + format('%.2d',[t]);
 
end;
 
 
 
function gethash:string;
 
var
 
  intermediate : array[0..7,0..31] of real;
 
  i,j,r,c,k:byte;
 
  img:array[0..31,0..31] of byte;
 
  dctresult:array[0..7,0..7] of real;
 
  pimg: Pimage;
 
  pack:record case integer of
 
    1:( pkt:PPixelPacket);
 
    2:( co:^clr);
 
  end;
 
  hash,one:qword; med:DOUBLE;
 
begin
 
  result := 'xxxxxxxxxxxxxxxx'; { in case an error occurs}
 
  try
 
      {Resize the image to 32x32. If it is my test case file
 
      'f:\ccc\01.gif' then it is already at the correct size and I
 
        found that if imagemagic resizes it the file gets changed ---
 
        I did not want that, it might be faster to grayscale the image
 
        first then resize but not if you use MagickQuantizeImage too slow}
 
      if uppercase(filename) <> 'F:\CCC\01.GIF' then
 
        MagickResizeImage(wand, 32, 32, LanczosFilter, 1.0);
 
      {now we convert the image to grayscale - this procedure is similar to
 
        MagickQuantizeImage(wand, 256, GRAYColorspace, iszero, iszero, iszero);
 
        but is a LOT faster. BTW I don't know why I had to use variables
 
        for this function. The conversion numbers were selected to be
 
        close to what MagickQuantizeImage used. Note I had an issue
 
        here. Under Q8 IM GetAuthenticPixels it returns an array of
 
        byte values for colors but the lazarus interface mapped it to word sizes.
 
        You could get an entire array of values with getauthenticpixels but
 
        the speed won't be much different so I didn't bother.
 
        We dump the results into the img array}
 
      pimg := GetImageFromMagickWand(wand);
 
      for j := 0 to 31 do begin
 
        for i := 0 to 31 do begin
 
            pack.pkt := GetAuthenticPixels(pimg, i, j, 1, 1, nil);
 
            img[i,j] :=round(pack.co^.red*0.072+
 
                            pack.co^.green*0.715+
 
                            pack.co^.blue*0.213);
 
        end;
 
      end;
 
      {results are in the img array. Now do matrix multiplcation
 
          DCTMATRIX * img * transposed(DCTMATRIX)
 
      since we only need the top 8x8 array we only do the
 
      calculations for that part}
 
      {intermediate = DCTMATRIX*IMG}
 
      fillchar(intermediate,sizeof(intermediate),0);
 
      for r := 0 to 7 do
 
        for c := 0 to 31 do
 
            for k := 0 to 31 do
 
                intermediate[r,c] := intermediate[r,c] +
 
                                    dctmatrix[r,k]*img[k,c];
 
      {dctresult  = intermediate*transposed(DCTMATRIX)}
 
      fillchar(dctresult,sizeof(dctresult),0);
 
      for r := 0 to 7 do
 
        for c := 0 to 7 do
 
            for k := 0 to 31 do
 
                dctresult[r,c] := dctresult[r,c] +
 
                intermediate[r,k]*dctmatrix[c,k]{inverted dctmatrix};
 
      {now we need to calculate the median value. Note the first value
 
      (0,0) is much too large and affects the median value too much.
 
      To fix it is set to 0 but perhaps some other solution can be done
 
      such as do a log on the value}
 
      med := 0;
 
      dctresult[0,0] := 0;
 
      for r := 0 to 7 do
 
        for c := 0 to 7 do
 
            med := med + dctresult[r,c];
 
      med := med / 64;
 
      {the hash is calculated by setting the bit of
 
      each value greater than the median}
 
      one := 1;
 
      hash := 0;
 
      for r := 0 to 7 do
 
        for c := 0 to 7 do begin
 
            if dctresult[r,c] > med then inc(hash,one);
 
            one := one shl 1;
 
        end;
 
      one := 1;
 
      {return the phash}
 
      result := inttohex(hash,16);
 
  finally
 
  end;
 
end;
 
 
 
procedure processfile(dir:string;fname:string);
 
var
 
  F: file;
 
  idents:pchar;
 
  i:integer;
 
  md5,md6, date, fsize,depth,imageheight,imagewidth,fno:string;
 
  P: Pointer;
 
begin
 
  inc(count);
 
  if (count mod 1000) = 0 then begin
 
      timed := SecondsBetween(Now,starttime);
 
      writeln('at count ',count,' E:'+ GetDif(timed)+' D:',deleted,' A:',added,' M:',modified,' S:',skipped,' ',dir,fname);
 
  end;
 
  filename := pchar(dir+fname);
 
  udir := StringReplace(uppercase(dir),'\',' ',[rfReplaceAll]);;
 
  ufname := uppercase(fname);
 
  if updatefileexists then begin
 
  if readnextrec then getnextok;
 
  while (udir > donedir) or ((udir = donedir) and (ufname > donefname)) do begin getnextok; inc(deleted); end;
 
  if skipmd5check then
 
      if (udir+ufname) = (donedir+donefname) then begin
 
            inc(skipped);
 
            writeln(allfile,inrec);
 
            readnextrec := true;
 
        exit;
 
      end;
 
    end;
 
  TRY
 
      Assign(F,FileName);
 
      FileMode:=fmOpenRead + fmShareDenyNone;
 
      Reset(F, 1);
 
      Size := FileSize(F);
 
      GetMem(P, Size);
 
      BlockRead(F, P^, Size);
 
      md5 := md5print(md5buffer(p^,Size));
 
  finally
 
      FreeMem(P);
 
      CloseFile(F);
 
  end;
 
  if updatefileexists then begin
 
      if (udir+ufname) = (donedir+donefname) then begin
 
        if copy(inrec,1,32) = md5 then begin
 
            inc(skipped);
 
            writeln(allfile,inrec);
 
            exit;
 
        end;
 
        inc(modified);
 
      end
 
      else inc(added);
 
  end;
 
  if notstarted then begin
 
      notstarted := false;
 
      writeln('Started at line:',count,' with file:',filename);
 
  end;
 
  try
 
      wand := NewMagickWand;
 
      status := MagickReadImage(wand, filename);
 
      if (status = MagickFalse) then ThrowWandException(wand);
 
      idents  := magickidentifyimage(wand);
 
      md6  := copy(idents,pos('signature:',idents)+11,64);
 
      i := pos('date:modify:',idents)+13;
 
      if i = 0 then i := pos('date:create:',idents)+13;
 
      date  := copy(idents,i  ,4)+copy(idents,i+ 5,2)+copy(idents,i+ 8,2)+
 
              copy(idents,i+11,2)+copy(idents,i+14,2)+copy(idents,i+17,2);
 
      fsize := copy(idents,pos('Filesize:',idents)+10,64);
 
      i := POS(Chr(10),FSIZE);
 
      if i > 0 then fsize := copy(fsize,1,i-1);
 
      i := pos('Depth:',  idents)+7;
 
      depth := copy(idents, i,1);
 
      i := pos('Geometry:',idents)+10;
 
      imageheight    := copy(idents, i,99);
 
      i := pos('x',imageheight);
 
      imagewidth := copy(imageheight,i+1,99);
 
      imageheight := copy(imageheight,1,i-1);
 
      i := pos('+',imagewidth);
 
      imagewidth := copy(imagewidth,1,i-1);
 
      fno := fname;
 
      if length(fno) < 22 then fno := pad(fno,25);
 
      writeln(allfile,md5,' ',md6,' ',gethash,' ',date:14,' ',
 
          size:12, depth:2,imageheight:7,imagewidth:7,' ',
 
          fno,' ',filename);
 
    finally
 
      wand := DestroyMagickWand(wand);
 
    end;
 
end;
 
 
 
// Recursive procedure to build a list of files
 
procedure FindFiles(StartDir: string);
 
var
 
  SR: TSearchRec;
 
  IsFound: Boolean;
 
begin
 
  if StartDir[length(StartDir)] <> '\' then  StartDir := StartDir + '\';
 
  { Build a list of the files in directory StartDir
 
    (but not the directories!)                        }
 
  IsFound := FindFirst(StartDir+'*.jpg', faAnyFile-faDirectory, SR) = 0;
 
  while IsFound do begin
 
    processfile(StartDir,SR.Name);
 
    IsFound := FindNext(SR) = 0;
 
  end;
 
  FindClose(SR);
 
  IsFound := FindFirst(StartDir+'*.gif', faAnyFile-faDirectory, SR) = 0;
 
  while IsFound do begin
 
    processfile(StartDir,SR.Name);
 
    IsFound := FindNext(SR) = 0;
 
  end;
 
  FindClose(SR);
 
  // Build a list of subdirectories
 
  IsFound := FindFirst(StartDir+'*.*', faDirectory, SR) = 0;
 
  while IsFound do begin
 
    if ((SR.Attr and faDirectory) <> 0) and (SR.Name[1] <> '.') then
 
      FindFiles(StartDir + SR.Name);
 
    IsFound := FindNext(SR) = 0;
 
  end;
 
  FindClose(SR);
 
end;
 
 
 
{$R *.res}
 
procedure makedct;
 
var c1:real; ii,p,q: byte;
 
begin
 
  c1 := 1/sqrt(32);
 
  for ii := 0 to 31 do dctmatrix[0,ii] := c1;
 
  c1 := c1*2;
 
  for p := 1 to 7 do
 
      for q := 0 to 31 do
 
          dctmatrix[p,q] := c1 * cos(pi/64*p*(2*q+1));
 
end;
 
 
 
begin
 
  starttime := now;  {keep track of time}
 
  {have we started processing files as opposed to copy old recs}
 
  notstarted := true;
 
  skipmd5check := false; {on copy skip md5 check}
 
  DirName := ParamStr(1); {did he pass a filename}
 
  if length(dirname) > 3 then
 
    if copy(upcase(dirname),1,4) = 'SKIP' then begin
 
        dirname := paramstr(2);
 
        skipmd5check := true;
 
  end;
 
  count := 0;
 
  deleted := 0; modified := 0; added := 0; skipped := 0;
 
  {allow access to file while writing it}
 
  FileMode:=fmOpenwrite + fmShareDenyNone;
 
  assign(allfile,'yall');
 
  rewrite(allfile);
 
  readnextrec := true;
 
  updatefileexists := fileexists('yalls');
 
  if updatefileexists then begin
 
    {allow access to file while reading it}
 
    FileMode:=fmOpenwrite + fmShareDenyNone;
 
    assign(infile,'yalls');
 
    reset(infile);
 
    getnextok;
 
  end;
 
 
 
  MagickWandGenesis;
 
  makedct;  {make DCT matrix (or at least what is needed)}
 
  {This is a set of directories that I desired to maintain}
 
  if (dirname = 'F:\') or (dirname = 'F:\') or
 
    (length(dirname) < 3) then  begin
 
    FindFiles('F:\CCC\');
 
    FindFiles('F:\CD\');
 
    FindFiles('F:\CHECK\');
 
    FindFiles('F:\QDISK\');
 
  end
 
  else FindFiles(DirName); {or use a user supplied directory name}
 
  writeln('found:',count,' files ',' Deleted:',deleted,' Added:',added,' Changed:',modified,' SKIPPED:',skipped);
 
  close(allfile);
 
  if updatefileexists then close(infile);
 
  MagickWandTerminus;
 
  { I get paranoid and like to keep a few versions of the files around}
 
  if fileexists('yalls.005') then deletefile('yalls.005');
 
  if fileexists('yalls.004') then renamefile('yalls.004','yalls.005');
 
  if fileexists('yalls.003') then renamefile('yalls.003','yalls.004');
 
  if fileexists('yalls.002') then renamefile('yalls.002','yalls.003');
 
  if fileexists('yalls.001') then renamefile('yalls.001','yalls.002');
 
  if fileexists('yalls') then renamefile('yalls','yalls.001');
 
  renamefile('yall','yalls');
 
end.
 
</syntaxhighlight>
 
--[[User:CampLCC|CampLCC]] 05:45, 3 October 2015 (CEST)
 

Latest revision as of 21:27, 4 December 2022

Calculating Perceptual hashes (PHASH) and comparing image similarity)

Some thoughts - What I do when I get a list of similar pictures I create two bat files. The first renames the files so that each of the duplicates are sequential in a directory and include the names and maybe other information. I will use the size so that the larger file will be first and the smaller next so that when I use an image viewer I can just delete the second after verifying it is similar while walking the directory. Afterwords I run the restore bat file to move them back to the correct directories. I am still playing around with the algorithm. Currently re-sizing the image to 34x34 and dropping the outside edge in an attempt to remove borders and lettering. I have found that once the bit difference exceeds 10 the files are usually different. If it is under 8 then there is a good chance it is a duplicate.

Note JPG files are lossy format so you need to be aware that EACH time you save the file you loose some quality. Also note that just because a image is larger does not mean that the image is better. It could be that the other image was enlarged or adjusted and you find the quality of the larger image is worse than the smaller image. The best way to compare and select the best picture is for you to do the comparison.


Unit tests

Do you have any unit tests or statements of expected results? Especially for things like the DCT Matrix. I found this code while looking for the pHash algorithm because I want to implement it in Nim. Your code is quite clear so I can fairly easily port it, I think. However it would be very handy to be able to verify that it actually performs the same as yours. Kwhitefoot (talk) 20:27, 4 December 2022 (CET)