LGenerics

From Lazarus wiki
Revision as of 08:56, 14 April 2022 by Alextpp (talk | contribs) (JSON patch)
Jump to navigationJump to search

About

Collection of generic algorithms and data structures for Free Pascal.

Requires: FPC 3.2+, Lazarus 1.9+.

Author: A.Koverdyaev (avk)

License: Apache License 2.0

Features

Implemented primitives

  • stack (unit lgStack)
  • queue (unit lgQueue)
  • deque (unit lgDeque)
  • vector (unit lgVector)
  • vector of bits (unit lgVector)
  • priority queue based on binary heap (unit lgPriorityQueue)
  • priority queue with key update and melding based on pairing heap (unit lgPriorityQueue)
  • sorted list (unit lgList)
  • hashed list - array based list with the ability to fast search by key (unit lgList)
  • hashset (unit lgHashSet)
  • fine-grained concurrent hashset (unit lgHashSet)
  • sorted set (unit lgTreeSet)
  • set of arbitrary size (unit lgUtil, TGSet)
  • hash multiset (unit lgHashMultiSet)
  • fine-grained concurrent hashmultiset (unit lgHashMultiSet)
  • sorted multiset (unit lgTreeMultiSet)
  • hashmap (unit lgHashMap)
  • fine-grained concurrent hashmap (unit lgHashMap)
  • sorted map (unit lgTreeMap)
  • hash multimap (unit lgMultiMap)
  • tree multimap (unit lgMultiMap)
  • list miltimap (unit lgMultiMap)
  • bijective map (unit lgBiMap)
  • sparse 2D table (unit lgTable2D)
  • disjoint set (unit lgHashSet)
  • AVL tree (unit lgAvlTree)
  • red-black tree (unit lgRbTree)
  • some treap variants (unit lgTreap)
  • general rooted tree (unit lgRootTree)
  • sparse labeled undirected graph (unit lgSimpleGraph)
  • sparse labeled directed graph (unit lgSimpleDigraph)
  • lite containers based on advanced records
  • extended IEnumearble interface - filtering, mapping, etc.

Algorithms on graphs

  • core functions:
    • vertices/edges addition/removal/query/enumeration, edge contraction, degree
    • load/save to own binary format, primitive export to DOT format
  • connectivity:
    • connected/strongly connected components, bipartite detection, degeneracy, k-core
    • articulation points, bridges, biconnected components
    • edge-connectivity
  • traversals:
    • BFS/DFS traversals with visitors,
    • cycle/negative cycle detection,
    • topological sort
  • operations:
    • induced subgraphs, complement, reverse, union, intersect, symmetric difference,
  • chordality testing
  • planarity testing: FMR Left-Right Planarity algorithm
  • distance within graph:
    • eccentricity, radius, diameter, center, periphery
  • matching:
    • maximum cardinality matching on bipartite/arbitrary graphs
    • minimum/maximum weight matching on bipartite graphs
  • dominators in flowgraps: simple iterative and Semi-NCA algorithms
  • some suggestions for NP-hard problems:
    • maximum independent set, maximal independent sets enumeration
    • maximum clique, cliques enumeration
    • minimum vertex cover, minimal vertex covers enumeration
    • vertex coloring, approximations and exact
    • minimum dominating set
    • Hamiltonian cycles and paths
    • local search TSP approximations, BnB TSP solver
  • minimum spanning trees: Prims's and Kruskal's algorithms
  • single source shortest paths:
    • Dijkstra with pairing heap, Bellman-Ford-Moor with Tarjan's subtree disassembly(BFMT)
  • single pair shortest paths:
    • Dijkstra with binary heap, BFMT, bidirection Dijkstra, A*, NBA*
  • all pairs shortest paths:
    • Floyd–Warshall, Johnson, BFMT
  • networks:
    • maximum flow: push/relabel, capacity scaling Dinitz
    • minimum-cost flow: Busacker-Gowen, cost scaling push/relabel algorithm
    • global minimum cut: Stoer–Wagner, Nagamochi-Ibaraki

Algorithms on arrays and vectors

(mostly unit lgArrayHelpers)

  • reverse, right/left cyclic shifts
  • permutations
  • binary search
  • N-th order statistics
  • inversion counting
  • distinct values selection
  • quicksort
  • introsort
  • dual pivot quicksort
  • mergesort
  • timsort (unit lgMiscUtils)
  • counting sort
  • radix sort
  • translation of Orson Peters' PDQSort algorithm
  • static segment tree
  • longest increasing subsequence
  • ...

Algorithms on strings and sequences

(units lgStrHelpers, lgSeqUtils)

  • exact string matching
    • Boyer-Moore string matching algorithm(in Fast Search variant), case sensitive and case insensitive(unit lgStrHelpers)
    • Boyer-Moore-Horspool-Raita algorithm(unit lgStrHelpers)
  • longest common subsequence of two sequences:
    • reducing the LCS problem to LIS
    • Kumar-Rangan algorithm for LCS
    • Myers algorithm for LCS
  • the Levenshtein distance:
    • simple DP algorithm
    • modified Berghel-Roach algorithm
    • Myers bit-vector algorithm with cut-off heuristic
  • LCS distance:
    • Myers algorithm for LCS distance
  • fuzzy string matching(k differences)
    • Ukkonen EDP algorithm
  • fuzzy string matching with preprocessing(something similar to fuzzywuzzy)

JSON Patch

Class TJsonPatch allows to implement JSON Patch (RFC 6902).

JSON Patch is a JSON document format (media type is application/json-patch+json) for expressing a sequence of operations applied to a target JSON document; it is always an array of objects, each representing a single operation, supported set of operations: ADD, REMOVE, REPLACE, MOVE, COPY, and TEST. For example, for this document:

    {
        "foo": [1, 3],
        "boo": ["bar", "baz"]
    }

applying this patch

    [
        {"op": "add", "path": "/foo/1", "value": "fizz"},
        {"op": "replace", "path": "/boo/1", "value": "qux"},
        {"op": "add", "path": "/ozz", "value": "zoo"}
    ]

will have the following result:

    {
        "foo": [1, "fizz", 3],
        "boo": ["bar", "qux"],
        "ozz": "zoo"
    }

The TJsonPatch class also contains a POC implementation of the Diff utility, which generates a patch by comparing source and target documents. For example, for this source:

    {
        "a": 1,
        "b": null,
        "c": ["test", "plop"]
    }

and for this target:

    {
        "a": 6,
        "c": ["test", "flop"],
        "d": "baz"
    }

it will generate a patch like this:

    [
        {"op": "replace", "path": "/a", "value": 6},
        {"op": "remove", "path": "/b"},
        {"op": "replace", "path": "/c/1", "value": "flop"},
        {"op": "add", "path": "/d", "value": "baz"}
    ]

Other

  • non-cryptogarphic hashes (unit lgHash):
    • Yann Collet's xxHash32, xxHash64
    • Austin Appleby's MurmurHash2, MurmurHash2A, MurmurHash3_x86_32, MurmurHash64A
  • brief and dirty implementation of futures concept (unit lgAsync)
  • brief channel implementation (unit lgAsync)
  • brief implementation of thread pool (unit lgAsync)
  • 128-bit integers (unit lgInt128)
  • JSON validator/parser/generator(unit lgJson)
  • JSON Patch/Diff(unit lgJson)
  • Eisel-Lemire fast string-to-double conversion algorithm(unit lgJson)
  • Ryū double-to-string conversion algorithm(unit lgJson)

Perfomance

Containers

Hash maps

There is a wonderful benchmark from BeniBela, which also covers hashmaps from LGenerics.

Algorithms on arrays and vectors

Sorting

The benchmark sorts 1000000 integers and calculates the number of CPU cycles spent on one element and involves:

  • QuickSort_ItemList_Context from unit SortBase (Rtl/SortBase)
  • TOrderingArrayUtils.Sort from Fcl-Stl.GArrayUtils (Fcl-Stl)
  • TArrayHelper.Sort from Rtl-Generics.Generics.Collections (Rtl-Generics)
  • std::sort from libstdc++ (std::sort)
  • std::stable_sort from libstdc++ (std::stable_sort)
  • TGOrdinalArrayHelper.QuickSort (LG/QuickSort)
  • TGOrdinalArrayHelper.IntroSort (LG/IntroSort)
  • TGOrdinalArrayHelper.DualPivotQuickSort (LG/DualPivotQuickSort)
  • TGOrdinalArrayHelper.PDQSort (LG/PDQSort)
  • TGComparableArrayHelper.MergeSort (LG/MergeSort)
  • TGComparableTimSort.Sort (LG/TimSort)
  • TGOrdinalArrayHelper.Sort (LG/Sort)
  • A negative value indicates that the algorithm exhibits quadratic behavior.

sort bench 1000000.png

Algorithms on graphs

For performance comparison the AGraph library (taken from here) was chosen as a reference implementation. All tests were compiled for x86 and run on win64 machine (unfortunately the 64-bit version of AGraph crashes on these tests).

So, AGraph vs LGenerics:

problem graph AGraph time(ms) LGenerics time(ms) notes
BFS traversal undirected, V=500000, E=4000000 640 109
BFS traversal directed, V=500000, E=4000000 484 94
Connected components undirected, V=500000, E=4000000 764 0 If no edges were removed, TGSimpleGraph always knows the number and ownership of connected components
Strongly connected components directed, V=500000, E=4000000 1060 265
Single-source shortest path undirected, V=500000, E=4000000 3011 842
Single-source shortest path directed, V=500000, E=4000000 2153 718
Single-pair shortest path undirected, V=500000, E=4000000 2777 31
Single-pair shortest path directed, V=500000, E=4000000 1825 47
Minimum spanning tree undirected, V=500000, E=4000000 5974 640
Maximum flow dense, V=1000, E=499500 29499 47
Maximum flow sparse, V=64000, E=512000 3339 63
Planarity test undirected, V=219086, E=657252 1419 172
Maximum clique DIMACS brock200_2.clq, V=200, E=9876 84210 16
Maximum clique DIMACS brock200_4.clq, V=200, E=13089 Cancelled(timeout 1 h) 312

Algorithms on strings and sequences

It was curious to compare the performance of the SimRatioLevEx() function (which is inspired by FuzzyWuzzy) with some incarnations of the FuzzyWuzzy (listed here) on benchmark datasets. Disclamer: SimRatioLevEx() does not reproduce FuzzyWuzzy, but it does some things in a similar way, in particular, SimRatioLevEx() in smTokenSetEx mode and token_set_ratio() do roughly the same job.

It seems the C++ version only supports single-byte strings, so only compared to the single-byte version of SimRatioLevEx():

       Dataset            SimRatioLevEx() token_set_ratio()
     
    Short/big_dist             1154              6440
    Short/small_dist            967              3020
    Medium/big_dist             811              3450
    Medium/small_dist           702              1470
    Long/big_dist              1966             15000
    Long/small_dist            1061              2250

The numbers indicate the run time in milliseconds; the C++ version was compiled with gcc-8.1.0 with options -O3 -m64; the Pascal version was compiled with FPC-3.3.1-9941-g8e6e9bbf33, -O3. The benchmark was run on a Windows x64 machine.

The Go version, on the contrary, always works with Unicode strings:

       Dataset          SimRatioLevExUtf8() TokenSetRatio()
     
    Short/big_dist             2143             18705
    Short/small_dist           1593              2224
    Medium/big_dist            1266             15062
    Medium/small_dist           853              1742
    Long/big_dist              3853             79851
    Long/small_dist            1269              3126

Go version: go1.10.4 linux/amd64; FPC-3.3.1-10683-g2a19e152b7 -O3. The benchmark was run on a virtual Linux machine.

Other

JSON

A small performance comparison of the lgJson.TJsonNode class with some other implementations, the benchmark was compiled for x86_64 and run on a win64 machine.

description FpJson JsonTools lgJson notes
Average parsing speed on a set of JSON samples from the example/json/parse_bench folder, MB/s 34,45 46,11 68,12
Average validating speed on a set of JSON samples from the example/json/parse_bench folder, MB/s 34,4 46,02 349,48 FpJson does not provide any function to validate JSON, so a parser was used
Recursive traversal(using the built-in enumerator) of the document of moderate size(34 KB) with all values read(50000 repeats), milliseconds 8315 6802 639
Search in a moderately sized document using path(750000 repeats), milliseconds 8549 5398 873
Dynamic creation of a small document and reading it as formatted JSON(150000 repeats), milliseconds 7597 4150 1903
Saving a moderate size document to a stream(10000 repeats), milliseconds 4820 4587 827
RFC 8259 compliance test, 295 test files from test/json_testset/testset 5 failed 23 failed all passed

Download

GitHub: https://github.com/avk959/LGenerics