Tensor Data Model
(top) (pkg)
Table of Contents
1. Introduction
On data model that the Wire-Cell Toolkit (WCT) supports is the tensor data model. This model is factored into two layers:
- The generic tensor data model consists of a ordered collection (set)
of tensor objects. A tensor object consists of a multi-dimensional,
numeric array and a structured metadata object. The WCT data
interface classes
ITensor
andITensorSet
express these aspects. - The specific tensor data model defines conventions on the generic tensor data model in order to map certain other data types to the generic tensor data model.
The next two sections define the generic and specific tensor data models and are written in rfc2119 language.
2. Generic tensor data model
A tensor shall be composed of two conceptual parts: an array part and a metadata part.
The array part consists of a sequence of elements that shall be stored contiguously and associated with a shape, layout order and array element type and size.
The array shape shall be represented as a vector of integer type with
each element corresponding to the number of elements spanned by the
corresponding dimension. The array layout order shall be represented
as a vector which stores the majority of the dimensions or it shall be
empty indicating “C” layout (conventional array layout for the C
programming language). A “C” layout should be used. The array
element type shall be represented as a string with values following
Numpy array interface typestr
specification. This typically means a
letter followed by a number giving the size in bytes of the type. The
array element size shall be represented as an integer providing the
size of one element measured in bytes.
A tensor metadata (MD) part shall follow the JSON data model and be of JSON type object. The MD model reserves the following attributes names.
- array
- optional and when existing it is an object providing array specifications using keyword names given in bold above (shape, etc).
Additional MD attributes are required by the specific tensor data model. User application may define attributes not otherwise reserved here. The MD must be faithfully passed through any tensor converter round trip.
A tensor set shall be composed of two conceptual parts: an ordered sequence of tensors and a metadata part. A tensor set shall not contain other tensor sets. The ordering of a tensor set sequence shall be stable and may be empty. A tensor may be contained in multiple tensor sets. The tensor set MD object shall follow the JSON data model and be of JSON type object but otherwise this tensor data model places no requirements on its contents. The tensor set MD must be faithfully passed through any tensor set converter round trip.
3. Specific tensor data model
The specific tensor data model defines representations of a number of data types in terms of the generic tensor data model.
3.1. Common metadata conventions
A tensor MD shall have an attribute named datapath of type string that
identifies the tensor in a logical hierarchical structure of multiple
tensors. The datapath value shall be unique among all tensors held in
a tensor set. The value shall be interpreted as a sequence of atomic
identifiers separated by a /
(slash) character. An atomic
identifier must be valid for use as a name of C++ or Python variables
or functions. A datapath value may be stored in a tensor MD attribute
and refer to another tensor when both tensors are held in the same
tensor set.
A tensor MD shall have an attribute datatype which is identifies a data type in the specific tensor data model that it represents. The datatype value shall be one from the set of specific types described below. Additional requirements on MD attributes may be defined that are specific to a datatype as described in the corresponding section below.
Complex data types may be represented as an aggregation of multiple tensors. These shall be defined on a per data type basis below as a set of MD attributes providing datapath values.
3.2. Overview of specific types
The following specific types are mapped to the basic tensor data model. Each item in the list gives the datatype MD attribute value and describes the WCT data type to which it associates.
- pcarray
- a
PointCloud::Array
- pcdataset
- a
PointCloud::Dataset
- pcgraph
- a
PointGraph
- trace
- one
ITrace
as 1D array or multipleITrace
as 2D array. - tracedata
- tagged trace indices and summary data
- frame
- an
IFrame
as aggregate of traces and/or traceblocks - cluster
- an
ICluster
The specific requirements for each data type are given in the following sections in terms of their tensor array and metadata and in some cases in terms of other types defined previously.
3.3. pcarray
The datatype of pcarray indicates a tensor representing one
PointCloud::Array
.
The tensor array information shall map directly to that of Array
.
A pcarray places no additional requirements on its tensor MD.
3.4. pcdataset
The datatype of pcdataset indicates a tensor representing on
PointCloud::Dataset
.
The tensor array shall be empty.
The tensor MD shall have the following attributes:
- arrays
- an object representing the named arrays. Each attribute name provides the array name and each attribute value provides a datapath to a tensor of type pcarray holding the named array.
Additional user application Dataset
metadata may reside in the tensor
MD.
3.5. pcgraph
The datatype of pcgraph indicates a tensor representing a “point cloud graph”. This extends a point cloud to include relationships between pairs of points. The array part of a pcgraph tensor shall be empty. The MD part of a pcgraph tensor shall provide reference to two pcdataset instances with the following MD attributes:
- nodes
- a datapath refering to a pcdataset representing graph vertex features.
- edges
- a datapath refering to a pcdataset representing graph edges and their features.
In addition, the pcdataset referred to by the edges attribute shall provide two arrays of integer type with names tails and heads. Each shall provide indices into the nodes point cloud representing the tail and head endpoint of graph edges. A node or edge dataset may be shared between different pcgraph instances.
3.6. trace
The datatype of trace indicates a tensor representing a single ITrace
or a collection of ITrace
which have been combined.
The tensor array shall represent the samples over a contiguous period of time from traces.
The tensor array shall have dimensionality of one when representing a
single ITrace
. A collection of ITrace
shall be represented with a
two-dimensional array with each row representing one or more traces
from a common channel. In such a case, the full trace content
associated with a given channel may be represented by one or more
rows.
The array element type shall be either "i2"
(int16_t
) or "f4"
(float
)
depending on if ADC or signals are represented, respectively.
The tensor MD may include the attribute tbin with integer value and providing the number of sample periods (ticks) between the frame reference time and the first sample (column) in the array.
3.7. tracedata
The datatype of tracedata provides per-trace information for a subset of. It is similar to a pcdataset and in fact may carry that value as the datatype but it requires the following differences.
It defines additional MD attributes:
- tag
- optional, a trace tag. If omitted or empty string, dataset must span total trace ordering.
The following array names are recognized:
- chid
- channel ident numbers for the traces.
- index
- provides indices into the total trace ordering.
- summary
- trace summary values.
A chid value is require for every trace. If the tracedata has no tag then a chid array spanning the total trace ordering must be provided and neither index nor summary is recognized. If the tracedata has a tag it must provide an index array and may provide a summary array and may provide a chid array each corresponding to the traces identified by index.
3.8. frame
See the topic frames as tensors for details about representing frames with tensors.
The datatype of frame represents an IFrame
.
The tensor array shall be empty.
The tensor MD aggregates tensors of datatype trace and tracedata and provides other values as listed;
- ident
- the frame ident number (required)
- tags
- an array of string giving frame tags
- time
- the reference time of the frame (required)
- tick
- the sample period of the traces (required)
- masks
- channel mask map (optional)
- traces
- a sequence of datapath references to tensors of datatype trace. The order of this sequence, along with the order of rows in any 2D trace tensors determines the total order of traces.
- tracedata
- a sequence of datapath references to tensors of datatype tracedata
In converting an IFrame
to a frame tensor the sample values may be
truncated to type "i2"
.
A frame tensor of type "i2"
shall have its sample values inflated to
type float
when converted to an IFrame
.
3.9. cluster
A cluster
is a pcgraph
with convention for how to serialize each of
the node types of an ICluster
graph (wire, channel, measure, blob and
slice) and edges between them. It generally follows an array schema
outlined in the document on Cluster Arrays. Their representation
is similar to HeteroData
type from pytorch-geometric.
4. Similarity to HDF5
The data model is intentionally similar to HDF5 abstract data model and there is a conceptual mapping between the two:
- HDF5 group hierarchy \(\leftrightarrow\)
ITensor
metadata attribute providing a hierarchy path as array of string. - HDF5 group \(\leftrightarrow\) No direct equivalent but an
ITensor
with no array is effectively the same. - HDF5 dataset \(\leftrightarrow\)
ITensor
array. - HDF5 dataspace and datatype \(\leftrightarrow\)
ITensor
methodsshape()
,dtype()
, etc. - HDF5 group or dataset attribute \(\leftrightarrow\)
ITensor
metadata attribute