Generating initialised data

From Lazarus wiki
Revision as of 13:29, 6 March 2015 by Jonas (talk | contribs)
Jump to navigationJump to search

History

Originally, the only way to generate initialised data in the compiler was by directly generating assembler statements in the parser (e.g. while handling typed constants) or the code generator (e.g. for rtti and VMTs). In particular, the typed constant parser almost directly translated all definitions into tai* assembler constant entities.

The typed constant parser was made more generic for the JVM port. This was required because e.g. records had to be implemented as classes on that platform and since it is not possible to defined a class constant with pre-initialised fields in Java Bytecode.

The existing typed constant parsing code (from compiler/ptconst.pas) was refactored into two classes (compiler/ngtcon.pas): a base class called ttypedconstbuilder and its subclass tasmlisttypedconstbuilder. The base class mostly handles the parsing of the typed constants, while the subclass handles the generation of the initialised data into the assembler list (tai*). Additionally, a tnodetreetypedconstbuilder subclass was implemented to handle data initialisation through the generation of parser nodes such as explicit assignment statements. These nodes are then added to the initialisation code of the current unit, which is used to initialise complex data structures on the JVM platform.

This extra functionality by itself did not solve the issue of generating initialised data outside typed constant definitions from the original program. In part this was not necessary for the JVM port as e.g. RTTI and VMTs are automatically deduced by the JVM from the byte code. It was also alleviated by the introduction of compiler/symcreat.str_parse_typedconst() routine, which can be passed a string of (potentially compiler-internally generated) Pascal code and transforms it into initialised data. It requires the expression to be valid Pascal though, which means that no compiler-internal identifiers can be used and that all used types must be defined already and accessible via a valid Pascal identifier.

The need for typed initialised data

Unlike the JVM port, the LLVM port should be able to handle the same Pascal code that is supported on so-called "native" platforms. Its byte code can represent arbitrarily structured initialised data, but unlike other targets it requires type information to be attached to every bit of data. The existing tasmlisttypedconstbuilder therefore had to be extended again, and furthermore all initialised data generated in other places also needs to get type information.

This has been implemented in the compiler/aamscnst.pas unit via the ttai_typedconstbuilder class and its descendants. All typed constant parsing (for non-JVM platforms) remained in the tasmlisttypedconstbuilder, but the generation of the initialisation data itself is now handled via the ttai_typedconstbuilder. Since this latter class does not depend on Pascal code as input, it can also be used more easily elsewhere in the compiler to generate initialised data.

Generating typed initialised data

Note that everything that follows only holds for non-JVM targets. Due to its specific nature, the JVM target has to be handled differently and is out of scope of the rest of this document.

The old method of directly generating tai* entities as initialised data still works for non-LLVM targets. This means that existing code shared by all targets can be gradually converted to the new approach, rather than that everything needs to be switched over at once. New shared code should however use either the aforementioned compiler/symcreat.str_parse_typedconst() or the ttai_typedconstbuilder to ensure compatibility with the LLVM and potentially other "high level" targets.

A major advantage of the new way is also that it will automatically insert padding bytes where required for alignment, which was a major source of errors in the old approach.

Setup

To start generating an initialised data entry, create a new specialised ttai_typedconstbuilder instance

var
  tcb: ttai_typedconstbuilder;
begin
  tcb:=ctai_typedconstbuilder;
  ...
end;

ctai_typedconstbuilder is a class reference variable that holds the ttai_typedconstbuilder descendant type appropriate for the current target.

Generating data

All generated data consists of two aspects: the data itself and its type. The data is represented by tai* subclasses, just like before. The type is represented by a tdef subclass.

There are three kinds of initialised data:

  • Fundamental or simple constants. Examples are ordinal constants, floating point numbers and pointer constants (including addresses of static variables and procedures).
  • Composite expression. The result type of such an expression is a fundamental/simple constant, but the expression that initialises it is more complex. Examples are expressions that contain type conversions, array indexations (with a constant index, obviously) and record subscripts.
  • Aggregate constants. These are initialised records and (non-dynamic) arrays. Elements of aggregate constants can in turn be any these three kinds of initialised data.

Basic routines

The most basic routine to generate a fundamental/simple constant, is

  procedure ttai_typedconstbuilder.emit_tai(p: tai; def: tdef);

As mentioned above, the first argument is the tai* entity that you would normally directly concatenate to the tasmlist, while the second one is a def that describes the data.

There is one variant of this method:

  procedure ttai_typedconstbuilder.emit_tai_procvar2procdef(p: tai; pvdef: tprocvardef);

The reason is that when taking the address of a method or a nested routine, the result is a complex procedural variable (procedure of object, procedure is nested, ...). However, sometimes we are only interested in the address of the method rather than in a complete teethed record. The above method can be used in that case, and it will also create an appropriate tdef to describe this address based on the complex procvardef described by pvdef.

Example:

   ...
   { create a byte value with value 5 }
   tcb.emit_tai(tai_const.create_8bit(5),u8inttype);
   ...


Generating composite expressions

In case of a type-safe byte code, we cannot just add an arbitrary offset to a symbol or reinterpret data without encoding this typecast explicitly. We therefore have build a composite expression that contains all necessary information.

Composite expressions can be constructed via a queue interface of the ttai_typedconstbuilder class. First, initialise the queue:

  procedure ttai_typedconstbuilder.queue_init(todef: tdef);

todef is the def of the entity to which the expression will be assigned. This ensures the queue itself can insert any necessary type conversions when parsing e.g. const x: byte = 100;, as the parser will interpret the 100 as a native integer by default.

Once a queue has been initialised, intermediate operations can be queued from outer to inner:

  { queue an array/string indexing operation (performs all range checking,
    so it doesn't have to be duplicated in all descendents). }
  procedure ttai_typedconstbuilder.queue_vecn(def: tdef; const index: tconstexprint);
  { queue a subscripting operation }
  procedure ttai_typedconstbuilder.queue_subscriptn(def: tabstractrecorddef; vs: tfieldvarsym);
  { queue a type conversion operation }
  procedure ttai_typedconstbuilder.queue_typeconvn(fromdef, todef: tdef);
  { queue an address taking operation }
  procedure ttai_typedconstbuilder.queue_addrn(fromdef, todef: tdef);

The operations that form the expression must be added to the queue from outermost to innermost, i.e. in the order they would be encountered by the parser if it would process a node tree built from the equivalent Pascal expression.

Finally, the data element onto which these queued operations should be applied must be supplied:

  { finalise the queue (so a new one can be created) and flush the
    previously queued operations, applying them in reverse order on a...}
  { ... procdef }
  procedure ttai_typedconstbuilder.queue_emit_proc(pd: tprocdef);
  { ... staticvarsym }
  procedure ttai_typedconstbuilder.queue_emit_staticvar(vs: tstaticvarsym);
  { ... labelsym }
  procedure ttai_typedconstbuilder.queue_emit_label(l: tlabelsym);
  { ... constsym }
  procedure ttai_typedconstbuilder.queue_emit_const(cs: tconstsym);
  { ... asmsym/asmlabel }
  procedure ttai_typedconstbuilder.queue_emit_asmsym(sym: tasmsymbol; def: tdef);

As documented, this final operation also flushes the queued operations, so it can/must be initialised anew when another composite expression is to be queued afterwards.

Example:

  ...
  { encode @recvar.arrayfield[4], with arrayfield an "array[0..3] of word" }
  tcb.queue_init(getpointerdef(u16inttype));
  tcb.queue_addrn(getpointerdef(u16inttype),getpointerdef(u16inttype));
  tcb.queue_vecn(arrayfielddef,4);
  tcb.queue_subscriptn(recvardef,arrayfieldsym);
  tcb.queue_emit_staticvar(recvarsym);
  ...

Generating aggregate data

Just like with composite expressions, a typesafe byte code needs to be explicitly told what the structure is of an aggregate (record, array). In particular, all data belonging to a single aggregate must be explicitly grouped together. One can no longer just emit a label and then sequentially dump all data after it.

The most basic way to start and finish an aggregate is using the following methdos:

  { begin a potential aggregate type. Must be called for any type
   that consists of multiple tai constant data entries, or that
   represents an aggregate at the Pascal level (a record, a non-dynamic
   array, ... }
  procedure ttai_typedconstbuilder.maybe_begin_aggregate(def: tdef);
  { end a potential aggregate type. Must be paired with every
   maybe_begin_aggregate }
  procedure ttai_typedconstbuilder.maybe_end_aggregate(def: tdef);

The reason for the "maybe_" at the start of those method names is that the way a particular Pascal data type is represented in the typed bytecode (or plain assembler code) may vary from platform to platform. On some platforms it may be represented as an aggregate, on others it may be a single, simple data element (e.g. a small set may just be an ordinal constant). As the comments indicate, they must nevertheless be called for any type for which you may emit multiple individual constant entities (e.g. multiple character bytes to represent a string) or that represent an aggregate in Pascal.

Aggregates can, of course, be nested, and inside an aggregate you can emit both fundamental/simple constants and composite expressions.

Example TODO.

Dealing with variant records

Explanation TODO.

Arbitrarily structured data

In some cases, you do not have the tdef describing the structured data in advance. An example is RTTI info. Rather than manually constructing an appropriate tdef in advance, the ttai_typedconstbuilder can do this for you:

  { similar as above, but in case
    a) it's definitely a record
    b) the def of the record should be automatically constructed based on
       the types of the emitted fields
  }
  function ttai_typedconstbuilder.begin_anonymous_record(const optionalname: string; packrecords: shortint): trecorddef;
  function ttai_typedconstbuilder.end_anonymous_record: trecorddef;

If optionalname is different from an empty string, a ttypesym with that name will be created for the constructed tdef (so it can be looked up again later under that name). The packrecords parameter can be used to control the alignment of the fields.

The ttai_typedconstbuilder.begin_anonymous_record() method immediately returns a trecorddef. Although the structure of the record is not yet known at that time, this is the tdef that will be completed once end_anonymous_record has been called. It can be used to e.g. already create a pointer to the recorddef, in case this is required for one of the fields inside.

Example TODO.

Helpers

Explanation TODO.

  { class functions and an extra list parameter, because emitting the data
   for the strings has to happen via a separate typed const builder (which
   will be created/destroyed internally by these methods) }
  class function ttai_typedconstbuilder.emit_ansistring_const(list: TAsmList; data: pchar; len: asizeint; encoding: tstringencoding; newsection: boolean): tasmlabofs;
  class function emit_unicodestring_const(list: TAsmList; data: pointer; encoding: tstringencoding; winlike: boolean):tasmlabofs;
  { emit a shortstring constant, and return its def }
  function ttai_typedconstbuilder.emit_shortstring_const(const str: shortstring): tdef;
  { emit a guid constant }
  procedure ttai_typedconstbuilder.emit_guid_const(const guid: tguid);
  { emit a procdef constant }
  procedure ttai_typedconstbuilder.emit_procdef_const(pd: tprocdef);
  { emit an ordinal constant }
  procedure ttai_typedconstbuilder.emit_ord_const(value: int64; def: tdef);

Finalising the data

Explanation TODO.

  { finalize the internal asmlist (if necessary) and return it.
    This asmlist will be freed when the builder is destroyed, so add its
    contents to another list first. This property should only be accessed
    once all data has been added. }
  function get_final_asmlist(sym: tasmsymbol; def: tdef; section: TAsmSectiontype; const secname: TSymStr; alignment: longint; const options: ttcasmlistoptions): tasmlist;


Back to contents