Releases 15, 16 and 17

Brett Viren

2021-08-31 12:52

Releases 0.15, 0.16 and 0.17 (in prep) come with some changes worth mentioning. So, here we mention them. At length….

Some changes are "breaking" and others provide new, better ways to do things. Many are cleanups which were needed for a while. As 0.17 is not technically cut, it may have additional changes. So, sit back and read about some of the improvements that you may not yet have learned about.

Release 0.15

We start with 0.15 made June 2021. This release was actually motivated by upgrades to Boost and TBB that Fermilab wanted to apply to art and LArSoft but it also collected a lot of prior development after 0.14 from May 2020.

Starting then and due to backwards incompatible changes in TBB, WCT now begins to track minimum versions on some requirements:

TBB (oneAPI) 2021.1.1
Boost 1.75.0

As these may change we will strive to keep the list in the main README up to date.

This release also brought

Support for Jsonnet "top level arguments" (TLA) which we now recommend to "inject" external information instead of the previously used std.extVar() idiom.
Support for the ICARUS and SBND detectors.
Increased flow graph fanout sizes.
JSON and Numpy I/O for depos and frames.
Many other smaller features and bug fixes.

Release 0.16

This release is largely a cleanup, made 17 Aug 2021.

Signal-ROI refinement finding may be excluded from sigproc (backported to 0.14.1).
Switch from Jsonnet's C++ API to the C API and the Go implementation.
Fix out-of-order problems with TbbFlow.
Add support for streaming I/O via boost::iostreams and a new tar based stream filter.

More details on these follow.

Go Jsonnet

The C Jsonnet library (libjsonnet.so) is essentially unoptimized. It is easy to create WCT configuration that can take many seconds, even minutes to parse. On the other hand, the Go version (libgojsonnet.so) is well optimized. What may take minutes for the C version to consume takes a few hundred milliseconds for the Go version.

The Go library is not ABI compatible but thankfully it is API compatible. In principle the C++ API could be used to access either the C/C++ version or the Go/C++ version. However, problems were uncounted and it was simpler to drop WCT down to the C API.

At this point, while the Go version is not required, it is strongly recommended. In particular, the WCT team will not work on any configuration speed issues that are not reproduced with the Go version.

Like the C version, the Go version build is not totally polished and Go is not a common compiler in our ecosystem so here are some brief instructions to build. It assumes the install target is /usr/local (such as in this Dockerfile) so adjust to fit your own desired install prefix.

$ exportg JSONNET_VERSION=0.17.0
wget -O /usr/include/libjsonnet.h \
      https://raw.githubusercontent.com/google/jsonnet/v${JSONNET_VERSION}/include/libjsonnet.h
git clone https://github.com/google/go-jsonnet.git
cd go-jsonnet
git checkout v${JSONNET_VERSION}
go build ./cmd/jsonnet
go build ./cmd/jsonnetfmt
go build ./cmd/jsonnet-deps
CGO_CXXFLAGS="-std=c++17 -Wall -I/usr/local/include" \
  go build -o libgojsonnet.so -buildmode=c-shared ./c-bindings
cp libgojsonnet.h /usr/local/include/
cp libgojsonnet.so /usr/local/lib/
cp jsonnet jsonnetfmt jsonnet-deps /usr/local/bin/

WCT build has added support to use the Go version of the library:

$ ./wcb configure \
  --with-jsonnet=/usr/local \
  --with-jsonnet-libs=gojsonnet \
  [ ... ]

TbbFlow out-of-order

We got something of a surprise to start seeing out-of-order data on TbbFlow flow graph edges. This ended up being due to several issues related to confusion over details of how TBB's flow_graph works with respect to concurrency and buffering.

We had many nodes configured to use unlimited concurrency and which should not have been.
The TBB split_node that underpins every "fanout" node has unlimited concurrency. This can not be changed but more importantly it was not realized to be the case.
Despite the TBB documentation, the input buffering of function_node is not a FIFO and is an undocumented source of out-of-order (which apparently others have already stumbled upon).

The solutions include making concurrency of all nodes, where possible, be "serial". We may come back to this at some point to allow more concurrency but so far we can do the simple thing and still fill out more threads than typically are available.

To solve the not-a-FIFO buffer problem, TbbFlow will now append sequencer_node to existing nodes. To give this fodder, a sequence number is now combined with the normal data payload that is passed on TbbFlow edges. All of this happens in TbbFlow and developers of individual INode components need not change anything.

After figuring out the problem, one mystery was why this had not come to bite us earlier. The reasons are likely some mix of:

Though we have tested TbbFlow in the past, its use is increasing as we attempt ever larger jobs and higher thread usage. These jobs are reaching new regions of "configuration phase space" and that gets us to:
The out-of-order problem will not happen, or at least is very rare, unless we, in a sense, "over provisions" TBB with "too many" threads. Eg, a ProtoDUNE-SP job usually has 6 pipelines providing the main bottleneck. Providing 8 threads allows out-of-order while fewer threads do not. Parts of these jobs will have otherwise idle threads.
Fast multi-object output from a component also seems to be correlated with the out-of-order problem. This happens with short one "event" jobs which then cause a component to emit an output very quickly followed with an EOS. This increases the chance that the buffer in a succeeding node will gain two objects and thus have a chance to pick the "later" one instead of the desired "earlier" one.

In any case, these problems seem now to be solved and one can safely use multi-threading. Understanding and fixing this began in 0.16 but is complete in 0.17. So, best not use 0.16 if TbbFlow is needed. Pgrapher is of course totally unaffected. (But has other performance issues fixed in 0.17).

Streaming IO

WCT already used boost::iostreams to read our somewhat large .json.bz2 configuration "data" files. This release adds powerful boost::iostreams filters that lets the tar codec be used. It also adds a new Numpy (.npy file) codec (the existing cnpy is still available). Both of these new codecs are implemented from scratch as C++ header-only libraries part of the custard project.

Together, they solve some problems related to saving data beyond the various maximum sizes imposed by cnpy (and really by its ZIP format). They also give a powerful way to read and write any stream which is composed of individual files. Components have been developed to make use of this streaming including I/O for depos, frames and clusters. See the *File{Sink,Source} components in sio. Work on these extend into 0.17.

Not considered part of the WCT release cycle, but wire-cell-python also sees updates to make transparent use of .npy-in-.tar files (compressed or not) in symmetry with standard .npz Numpy files.

Release 0.17

Hot on the tail of 0.16, we are soon making 0.17. It is not yet officially cut but we will ignore that minor detail and describe many useful things it brings. Besides finishing out-of-order fixes and streaming I/O components mentioned above this release brings:

A version string!
Logging improvements.
Removal of vestigial python/ sub-directory.
A "batched" DepoSetDrifter and loader.
Graph execution tracing in TbbFlow.

Continue reading for details about these items.

Version string

A long missing feature of WCT is to know what version it is. A version string is now provided which is based on the state of the git repository from which WCT was built. It will not be well defined if building from a directory outside of git's control. When built from source managed by git the output of:

git describe --tags

gets baked in to libWireCellUtil.so (see top of wscript). This version string can be recovered on the command line:

$ wire-cell --version
0.16.0-19-gfa04dd3

Applications using WCT, may get the version string from the same place that wire-cell CLI does:

WireCell::Main;
std::cerr << Main.version() << std::endl;

Logging improvements

In the course of debugging the out-of-order problem described above, logging was shown to be very inadequate. Two major problems existed:

Variability in log line format made it difficult to understand bigger picture.
Lack of individual node identity was a source of degeneracy.

Given the concurrent nature of the data flow graph execution with TbbFlow and the out-of-order problem being worked on these problems added up to a real mess! it was a mess!

To fix this, the following changes were made:

A new INamed interface was introduced as a way to tell components their instance names.
A new Aux::Logger intermediate class provides INamed as well as a log object which formats messages consistently with type and instance names.
Additions to low level spdlog handling related to making distinct sinks in order to support different format patterns.

To add benefit, all components need the following modifications:

Add Aux::Logger to the component inheritance. Call the new parent class constructor as illustrated:

class MyComponent : public Aux::Logger, ... {};

MyComponent::MyComponent(...)
  // shows as "[  mygrp  ] <MyComponent:instancename> ..."
  : Aux::Logger("MyComponent", "mygrp") 
  , ... { ... }

Replace any locally constructed log objects to the one provided by the new parent as the protected data member named log.
Remove from the log message itself any identifiers related to type or instance. They will simply become cluster and are now redundant with a standard prefix given to each log line by the Aux::Logger.

At the same time, we also now try to format information in the log line body in a more consistent manner using phrases of "key=value" like:

log->debug("call={} value={}", m_count, m_myval);

This make it much easier to grep logs for values. The result allows for rational log queries to pull out portions of the graph.

And, finally, it is especially helpful to revisit configuration to rationalize the component instance names. It is recommended that were possible give an instance name which is simply the number associated with the pipeline branch (usually the APA ID number). In some cases this would produce type+instance pairs that are degenerate and something more must be provided in the instance name. One such case is when multiple FrameFileSink components are on the same pipeline and for these it is suggested to use an instance name matching the frame "tag" name (eg, "orig1", "gauss5").

Here is an example of finding logs related to APA 0's pipeline executing sim, sigproc and 3D imaging.

$ grep ':0>' log.txt
...
[09:54:26.529] D [  gen   ] <DepoTransform:0> call=0 frame=0 ntraces=4887
[09:54:26.691] D [  gen   ] <Reframer:0> call=0 frame=0 all traces in: 4887 out tag: ""
[09:54:27.065] D [  gen   ] <AddNoise:0> call=0 frame=0 2560 traces
[09:54:27.841] D [  gen   ] <DepoTransform:0> EOS at call=1
[09:54:27.841] D [  gen   ] <Reframer:0> EOS at call=1
[09:54:27.841] D [  gen   ] <AddNoise:0> EOS at call=1
[09:54:28.698] D [  gen   ] <Digitizer:0> call=0 traces=2560 frame=0 outtag="orig0"
[09:54:28.698] D [  gen   ] <Digitizer:0> see EOS at call=1
[09:54:31.258] D [sigproc ] <OmnibusNoiseFilter:0> call=0, frame=0, ntraces=2560, nticks=6000 intag=orig0 outtag=raw0
[09:54:31.259] D [sigproc ] <OmnibusNoiseFilter:0> EOS at call=1
[09:54:31.259] D [sigproc ] <OmnibusSigProc:0> OmnibusSigProc: nticks=6000 tbinmin=0 tbinmax=6000
[09:54:31.316] D [sigproc ] <OmnibusSigProc:0> plane index: 0 input data identifies 0 bad regions
[09:54:35.145] D [sigproc ] <OmnibusSigProc:0> plane index: 1 input data identifies 0 bad regions
[09:54:38.329] D [sigproc ] <OmnibusSigProc:0> plane index: 2 input data identifies 0 bad regions
[09:54:40.939] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 0, Qtot=147524624 added 800 traces to total 800 indices:[0,799]
[09:54:41.234] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 0, Qtot=141124311 added 800 traces to total 800 indices:[800,1599]
[09:54:41.917] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 1, Qtot=138288987 added 800 traces to total 800 indices:[1600,2399]
[09:54:42.225] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 1, Qtot=133841197 added 2400 traces to total 1600 indices:[800,3199]
[09:54:42.834] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 2, Qtot=121136654 added 960 traces to total 960 indices:[3200,4159]
[09:54:43.340] D [sigproc ] <OmnibusSigProc:0> save_data plane index: 2, Qtot=118093358 added 4320 traces to total 2560 indices:[800,5119]
[09:54:43.405] D [sigproc ] <OmnibusSigProc:0> call=0 produce 5120 traces: 2560 wiener0, 0 decon_charge0, 2560 gauss0, frame tag: sigproc
[09:54:43.412] D [sigproc ] <OmnibusSigProc:0> EOS at call=1 anode=0
[09:54:43.521] D [  img   ] <SumSlice:0> frame=0, make 1464 slices in [0,1499] from 1464
[09:54:43.535] D [  img   ] <SumSlice:0> EOS
[09:54:43.596] D [  glue  ] <SliceFanout:0> sending out 2 EOSes
[09:54:43.787] D [  glue  ] <BlobSetSync:0> EOS
[09:54:46.697] D [  img   ] <BlobClustering:0> flush 1 clusters + EOS on EOS
[09:54:55.695] D [  img   ] <BlobGrouping:0> have 33517 graph nodes
[09:54:56.018] D [  img   ] <BlobGrouping:0> EOS
[09:55:05.860] D [  img   ] <BlobSolving:0> send graph with 33517
[09:55:06.244] D [  img   ] <BlobSolving:0> EOS
...

Nice and tidy and effectively impossible to do before unless one made an exhaustively long grep command line.

This example is from a single "event" job. Note the lack of out-of-order EOS's for any given component!

The `python/` subdir

For a while now we have deprecated the use of the python/ sub directory in the wire-cell-toolkit repository and instead want all updates committed to wire-cell-python. Well, some people (me) kept forgetting and would commit to python/ and that lead to divergence which was somewhat irritating to fix. To remove this pitfall, the two underwent a final sync (via git-subrepo) and the deprecated location has been removed. If you still used python/ and see it disappeared, hopefully this paragraph gives you closure.

Batched depos

The Pgrapher data flow execution engine is very simple (it must be for I came up with it). It works by first performing a topological sort on the flow graph. Each execution cycle, Pgrapher will then iterate the sorted graph in reverse order. The engine starts with the last output node and if the node has input waiting the node will be executed. If no input is waiting, the engine goes "upstream" one more node and repeats until it finds a ready node. Once a node is executed the iteration restarts again at the last output node.

This simple algorithm is designed to move data through the graph in a compact "wave", thus minimizing the amount of data "in flight" in the hopes of keeping memory usage low. For this goal, it is successful as WCT tends to contribute less than or around 1GB to 6-APA jobs.

And, this algorithm works fast for small graphs and even large graphs if the number of data objects is small. However, it meets a pathology when individual depos are passed through the graph. Made worse, the depos tend to be at the input end of the graph causing each depo transfer to require an almost entire graph iteration. Pgrapher will then uselessly iterate almost the entire graph O(1M) times per "event"!

As a consequence, a job can take many minutes per event just to shuttle around these individual depos. Given that the rest of the simulation stage requires less than one minute for an APA, all this time to do very little is intolerable.

This pathology requires a combination of Pgrapher and many individual data objects, especially ones at the start of the graph. Fix either and the pathology is broken. We can switch to TbbFlow and keep our individual depos, or we can stick with Pgrapher but transfer depos all together.

To keep Pgrapher viable, a DepoSetDrifter is created. It takes in an entire set of depos, drifts them, and outputs the result as another depo set. This new component still uses the exact same Drifter as a local (out-of-graph) helper component to do the actual drifting one-by-one. Drifter itself is plenty fast (though certainly could become faster with SIMD) and all this new depo set drifter does is avoid the Pgrapher pathology.

Of course you ask, why not both? While TbbFlow is very fast with singular Drifter it can be made even faster using the new DepoSetDrifter.

Here are some timing numbers from the four combinations made by the outer product of:

(Pgrapher, TbbFlow) x (singular depos, depo set)

TbbFlow, 1 thread, depo set: 0.746s
[11:05:22.321] D [   io   ] <NumpyDepoSetLoader:depos> call=0 sent ndepos=627983
[11:05:23.067] D [  glue  ] <DepoSetFanout:fanout> call=0 fanout depo set 0 with 537049

TbbFlow, 1 thread, singular depo: 1.423s
[11:07:11.140] D [   io   ] <NumpyDepoLoader:depos> load 0 complete with 627983 new, 627983 total
[11:07:12.563] D [  glue  ] <DepoSetFanout:fanout> call=0 fanout depo set 0 with 164164

Pgrapher, 1 thread, depo set: 0.756s
[11:08:29.559] D [   io   ] <NumpyDepoSetLoader:depos> call=0 sent ndepos=627983
[11:08:30.315] D [  glue  ] <DepoSetFanout:fanout> call=0 fanout depo set 0 with 537049

Pgrapher, 1 thread, singular depo: 221s
[11:09:03.242] D [   io   ] <NumpyDepoLoader:depos> load 0 complete with 627983 new, 627983 total
[11:12:33.302] D [  gen   ] <Drifter:> see EOS, ( dropped:90934 + drifted:537049 ) / total:627983 depos from stream, outside of all 4 drift xregions
[11:12:43.911] D [  glue  ] <DepoSetFanout:fanout> call=0 fanout depo set 0 with 164164

So, we are now able to choose TbbFlow or Pgrapher based on other requirements while in both cases the new DepoSetDrifter is preferable. The singular Drifter may still be used (with TbbFlow, even if with only one thread) in cases where there is not yet a source that provides depos batched into sets.

Graph tracing in TbbFlow

Pgraher has a very useful feature of measuring the time spent by each component and dumping out a summary at the end of the job showing which nodes are the CPU pigs. TbbFlow has lacked any such measure and that is now (partly) corrected.

TbbFlow can now log each node's entry, exit and possibly error action. This measures what is seen at the TbbFlow node level and implementations of INode need not change anything to participate. The resulting log can then be parsed to answer various questions.

One "cute" question is "what does the graph execution history look like?". One answer is in the form of an animation produced with some help from GraphvizAnim and some code in wire-cell-python.

$ wirecell-img anidfg mylog.txt
$ animate anidfg.gif

The current styling leaves some things to be desired, but an example animation follows. Look for nodes to change from black to red which indicates they are active during a sampled time slice.

nil

Expect wirecell-img to learn to produce a timing summary in the near future.

Wire-Cell News