Design Choices: CPLD

This is the third post in a series of posts detailing the rationale for design choices in our motion capture glove. I will try and anticipate questions that informed users will ask. This post will discuss the sequence of choices that culminated in r0's CPLD and give some indications of what was changed.

Discussion of the issues at hand, and a tour of r0's CPLD
Final r0 CPLD Schematic:

The fact that I didn't use functional sub-units in the block-editor gives this schematic low marks for clarity, but it allows me to easily show...

Regional subdivisions

IRQ aggregation
Signal selection logic
Internal register selection
Reset synchronization and internal clock
IRDA clock division
IRDA pin control

Quartus compilation reports a maximum frequency for this design at being well over my design target of 10MHz when optimizing for area. Even so, much hand-optimization (and sacrifice) was needed to get this design to fit. Only a single flip-flop was left to spare.

Criticisms of r0's CPLD, and changes made in r1

First, some surrounding design changes should be mentioned...

  • Because the r1 CPLD is now on the main PCB, I eliminated the high-precision 10MHz oscillator. The r1 CPLD clock is now driven by a CPU timer channel. This allows actions in the firmware to directly scale the CPLD clock down to DC.
  • The IR LED was moved to the main PCB with the elimination of the metacarpals PCB. The new metacarpals unit was too small to channel the required current without burning traces or making the MC magnetometer useless. They are now being driven by one of the CPU's timer pins.
  • Despite retaining the IR LEDs, the IRDa peripheral was scrapped. The bluetooth module can be used for hand-to-hand communication in the event that is desirable.
  • The r1 CPLD has ~4x the area as that of r0. Combined with the fact that the IRDA clock division and IRDA pin control functions are no longer needed, this leaves about 6x more CPLD area to use for nothing but IMU support.
  • This design was my most-serious foray into the Quartus timing constraints. There are certain things that are non-optimized. I used synchronous counters in places where I could easily use ring counters. This would have conveyed a power-usage advantage, but I couldn't at the time figure out how to enforce my down-stream timing constraints. The r1 design will take advantage of the low target clock rate.
So what to do with all the new area?

Simple things first... Reset synchronization and the internal clock are now better-buffered. There can no longer be glitches when transitioning between clock domains.

The signal selection logic in r0 was sloppy. I was driving the sensor-facing MOSI pins regardless of selection criteria. Needless power usage. Needless EMI. This has been rectified. Additionally, the 1-2-demultiplexers on the sensor boards were removed in favor of direct lines for each chip-select. The A0 pins are therefore no longer required.

Internal register selection is largely the same, with some deep upgrades to facilitate sequential/concurrent register access and automated transfer.

The SPI bus is now operating with the CPLD as the master. This relieves the firmware of the duties associated with driving bus behavior during the SPI ISR. Instead, the DMA interrupt will fire when large swaths of data arrive in the buffer. Additional work in the GCC linker allows this data to land directly in the memory locations that it is expected to be when the Cortex-M DSP instructions begin to handle it prior to a type-upgrade to float.

All of this is excellent news for the computational pipeline, but the biggest benefit comes from the fact that it is now possible to read an entire IMU frame (from all 17 sensors) in 2 bus operations, where it previously required 51. This directly translates into less ISR thrash, less running memory load, and the concomitant decrease in software work required to place the resulting data.

For write operations, registers were added to allow ranked register access; automatically replicating write operations across digits. This optimization cuts the number of register-accesses down to one-sixth (or less) of what it is in r0 when writing instructions to all IMUs (as is the nominal case). This also will have the benefit of eliminating frame-skew caused by delays in sensor reconfiguration during dynamic-range adjustments.

The most painful sacrifice in r0's CPLD was tied to IRQ aggregation. Originally, I had planned to keep a double-buffered copy of interrupt signals on state-change, but I had to settle for an abbreviated implementation. This forced concessions in the IMU configuration, as well as latency in retrieval as firmware preempted other bus operations to read the pin state as-soon-as-possible. Under conditions of multiple closely-spaced IRQs, events would be lost for lack-of double-buffering, and there would be frame drop as the single SPI bus was monopolized for the sake of IRQ discovery. Interrupts on r0 were nearly useless.

In cooperation with some sleight-of-hand in the digit design, the new CPLD has far more robust IRQ aggregation, as well as an independent SPI channel operating in master-mode to shuffle the IRQ data to the CPU. Transfer width is a constant, and DMA is now used to make IRQ discovery possible within a handful of microseconds after a pin-state change. All this without the risk of glitching or missed IRQs that plagued this sub-system in r0.

Power conservation

Being that r1 has the ability to continuously scale the CPLD clock, and that the CPLD is now the bus master, only as much power as-is needed will be consumed for a given sample rate from the IMUs. And since that clock is not being divided by flip-flops in the CPLD, there is far less current-draw due to useless state transitions.

The improved signal selection will also add a modest improvement to the power profile.

Some of the new features in r1 also allow the CPU to relinquish the CPLD's clock source and to go to sleep entirely while the CPLD runs on its internal clock. This allows the CPLD to continue monitoring the IMUs and wake the CPU on predetermined inertial events that are defined by the user. Snapping a finger can now wake the glove, for instance. Useful for a mobile device where every electron counts.

Going forward

Once the digits arrive from our fabricator, I will be able to provide empirical measurements of latency, bandwidth, and other low-level statistics. With my current design, the numbers show me a CPU usage below 1% while reading all data from all 17 IMUs at their full sample rate of 997Hz.

Quartus resource usage for the r1 CPLD:

Revision Name            DigitabulumCPLD
Top-level Entity Name    r1-CPLD
Family                   MAX II
Device                   EPM570F100C5
Timing Models            Final
Total logic elements     475 / 570 ( 83 % )
Total pins               76 / 76 ( 100 % )
Total virtual pins       0
UFM blocks               1 / 1 ( 100 % )

This leaves open the possibility of somewhat higher-level functionality to be added later. There are yet some low-hanging optimizations to make by converting shift-registers into sum-of-terms multiplexers. Since the sensors' maximum bus frequency is 10MHz, the design only has to be timing-constrained to 20MHz, and this is easily achievable.

The expansion port on r1 allows access to a handful of uncommitted CPLD pins, as well as the CPLD JTAG interface. Interest and a successful release will open the possibility of opening the design for community modification and/or special accommodations for dedicated motion processors, should those things be developed.

I've worked very hard to reserve every possible CPU cycle for crunching numbers and running user code. This design choice represents the creation of specialized bus sub-processor to handle a workload that would drown even a large microcontroller in its own transfer bureaucracy.

I dare say I've succeeded.