Work on the project started in September 2011. The design was originally inspired from the experience with fREEDA [freeda] and carrot [carrot] plus some ideas from other simulators and improvements that take advantage of the flexibility in Python.

The internal circuit representation and the device library attempt to
be independent of the formulation used for simulation. The `Circuit`
class has no methods to obtain nodal voltages or calculate the
Jacobian of the circuit. This is delegated to other classes that
handle particular circuit analysis approaches such as nodal,
port-based, linear/nonlinear separation, *etc.*

All elements but frequency-defined ones are modeled using the current-source approach [aplac2]. The interface essentially describes a subcircuit composed of independent and voltage-controlled current/charge sources. If the nodal approach is used this has the advantage that an analysis can be implemented by just considering a few fundamental components. For example there is no need for MNAM stamps as the NAM can handle all elements. This approach is also very flexible. Both fREEDA-style state-variable and spice-style nonlinear models have been implemented.

Frequency-defined elements are modeled by their Y matrix (or equivalent, see S-parameter-based transmission line model). The interface returns a 3-D matrix with the parameters at all frequencies. It is possible to conceive a more compact hybrid representation for some devices that combines the current-source approach with smaller Y matrices. However this may unnecessarily complicate the implementation of simpler devices. This interface is being reviewed and may change in the future.

Internal terminals are not tracked directly by the circuit. One of the
advantages of this is that a device can process parameters
independently of the containing circuit (a reference to the circuit is
no longer needed in `process_params()`). Another advantage is that
the terminal name does not need to be unique. The `Circuit` class
has now a function to retrieve all internal terminals, which (as
explained above) are not present in the `termDict` dictionary.

By default there is no local reference terminal. A local reference
terminal can be created calling `self.add_reference_term()`.

Internal terminals could be internally indexed by its position in the
terminal list, but for devices that are derived from a base device
(such as autothermal devices) the internal terminal indexes change if
the number of external terminals change. This problem can be avoided
by indexing internal terminals separately and refer to internal
terminals as `(self.numTerms + number)` instead of using fixed
numbers. A more convenient solution currently implemented is to have
the `add_internal_terminal()` and `add_reference_term()` functions
return the internal terminal index.

This solution however would not work for a ‘hardwired’ subcircuit. Hardwired subcircuits would also present other problems with parameter lists. As hardwired subcircuits can be replaced with regular subcircuits, no support for them is planned. Just for the record a possible solution is presented here: have a function in the Element class that returns terminal indexes. External terminal indexes would also change in this case:

```
class Element:
...
def term_idx(n):
return self.baseIndex + n
def int_term_idx(n):
return self.baseIndex + self.numTerms + n
```

`baseIndex` is set to the starting terminal number for a particular
element.

The return of `eval_cqs()` for device models is normally a tuple of
the form `(iVec, qVec)`, where `iVec` is a vector of currents and
`qVec` a vector of charges. Any of the vectors may be empty in cases
where there are no currents or charges to calculate. For this reason
sometimes this approach introduces some overhead compared to having
both currents and charges combined in a single vector. However the
current output format is easier to handle at a higher level and thus
it is preferred.

Still, the `eval()` and `eval_and_deriv()` functions lump currents
and charges in a single vector because this is a lot more efficient
when implemented using AD tapes (at least with the current
interface). However, the analysis code could bypass those completely
and generate custom AD tapes for greater efficiency.

When the `getOP` flag is set in `eval_cqs()`, additional
operating point variables may be calculated. Originally these
variables were returned in an array so they could be taped by the
automatic differentiation library. However keeping track of many
different variables in a vector is prone to mistakes and there is no
clear advantage in making operating point variables (other than ouput
currents / charges) available for differentiation.

Due to this operating point variables are now returned in a dictionary
so they can be accessed by name. Only the dictionary is returned when
`getOP` is True. This simplifies implementation as the currents and
charges can be obtained using the (faster) AD tapes.

Why `get_OP()` is a separated function? Answer: often the Jacobian
is needed to calculate operating point variables (transconductance,
capacitance). We can not evaluate the Jacobian (using AD) from within
`eval_cqs()`.

The current interface requires the `get_OP()` function to correctly
work for regular and electrothermal versions of a given model. This is
necessary if AD tapes are to be used: for electrothermal models AD
tapes input/output variables include temperature/power.

Currently device parameters can be initially set by three different means, in order of priority:

- Explicitly specified in the device line
- Specified in a global model
- The default value

The explicitly-specified parameters are kept in a separated dictionary in each element. Global netlist variables can be used to set the values of parameters specified in the device line or the model line.

- If a string is specified as the value of a numeric parameter value, then it is marked as a potential variable.
- Variables are specified in a
`.vars`statement in the netlist and are assumed to be numeric/vectors - When the circuit/analysis is initialized, elements/models/analysis check the global netlist variable dictionary to find and set the variable value. If the variable is not found raise an exception. One problem with this is that by that time the netlist line number is lost and the diagnostic message is not as good.

Another difficulty is how to update dependent parameters when a variable value is changed. This would require to repeat the whole process for all models/elements as there is no way to know which ones are affected. A change in variables/model/element parameters is likely to happen in sweeps, sensitivity and optimization calculations. From the above considerations the current solution requires re-visiting all elements and re-generating all equations. One work around is to create a list of elements to be updated when needed in the analysis code.

For efficiency indexing individual elements in arrays from Python
should be avoided as much as possible. Advanced numpy indexing to
avoid loops for each element of the matrices was tried but
unfortunately `G[obj] += Jac` does not work when some of the
elements selected by `obj` are repeated (see tentative numpy
tutorial at http://www.scipy.org/NumPy_Tutorial). Possible approaches
to overcome this are the following:

- Use a few optimized functions doing the inner loops (perhaps using cython) or implementing fancy indexing with sparse matrices.
- Create a giant AD tape for the whole circuit. The nodalAD module implements this. This approach is simpler but in practice test results (see nodalAD module) show that it does not improve efficiency. Also tape generation seems to require a dense matrix multiplication (G * x). This rules out the approach for large circuits.

Currently the first approach is being used for dense matrices and is
described here: nonlinear (and frequency-defined) elements are added
new attribute vectors `nD_?pos` and `nD_?neg` that contain the
non-reference terminal RC numbers where the internal current sources
are connected. This requires some pre-processing but in this way we
can avoid `if` statements in inner loop functions. For regular
linear transconductances this does not seem necessary as we only have
to fill the matrix once.

For the sparse-matrix nodal implementation, further pre-processing is
needed to create the index vectors to directly fill the main
matrix. The main matrix is built in triplet format (Scipy coo_matrix
format). Values from nonlinear elements are directly inserted in the
coo_matrix `data` array:

```
M.data[mpidx + self._mbase] = Jac.flat[jacpidx]
self._mbase += len(mpidx)
M.data[mnidx + self._mbase] = -Jac.flat[jacnidx]
self._mbase += len(mnidx)
```

It seems that this method is the fastest that can be achieved without
resorting to compilation. According to my tests fancy indexing is a
little faster (very little) than using `numpy.put()` and
`numpy.take()` to fill the matrix. This does not agree with the
comments in http://wesmckinney.com/blog/?p=215 . Perhaps this is
different for other architectures or versions of the program?

Further optimization would possibly require access to the low-level interface to the C libraries plus compilation of some functions used in inner loops.

A profile transient analysis of the soliton line with a matrix size of 3022 was performed using scipy matrices, optimized matrix filling and SuperLU for matrix decomposition. The number of time steps is 501. The results shown in the table were obtained in a netbook with an Intel Atom processor:

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 59.380 59.380 cardoon:27(run_analyses) 1 0.152 0.152 59.380 59.380 tran.py:73(run) 501 0.040 0.000 44.099 0.088 fsolve.py:23(solve) 501 0.016 0.000 44.059 0.088 nodal.py:420(solve_simple) 501 0.260 0.001 44.043 0.088 fsolve.py:59(fsolve_Newton) 505 0.068 0.000 40.807 0.081 nodal.py:423(get_deltax) 501 7.340 0.015 25.678 0.051 nodalSP.py:731(get_i_Jac) 505 0.132 0.000 14.793 0.029 nodalSP.py:204(_get_deltax) 47282 7.824 0.000 9.421 0.000 nodalSP.py:165(_set_Jac) 505 0.036 0.000 9.129 0.018 linsolve.py:131(factorized) 505 0.096 0.000 9.093 0.018 linsolve.py:103(splu) 505 8.597 0.017 8.597 0.017 :0(dgstrf) 501 2.756 0.006 7.720 0.015 nodalSP.py:651(update_q) 1004 4.904 0.005 4.904 0.005 :0(solve) 23782 1.996 0.000 4.236 0.000 cppaddev.py:143(eval_and_deriv) 230703 3.812 0.000 3.812 0.000 :0(len) 505 0.136 0.000 3.084 0.006 coo.py:238(tocsc) 1020 2.680 0.003 2.680 0.003 :0(max) 499 0.036 0.000 2.492 0.005 nodalSP.py:223(get_chord_deltax) 8013 0.408 0.000 2.476 0.000 nodalSP.py:43(set_quad) 70829 2.388 0.000 2.388 0.000 nodal.py:52(set_i) 23782 1.604 0.000 2.100 0.000 adfun.py:208(jacobian) 18224 1.252 0.000 2.076 0.000 nodalSP.py:34(triplet_append)

There seem to be no obvious major bottlenecks. Most of the simulation
time is spent building the matrix and evaluating nonlinear devices
(`get_i_Jac`), followed by linear system solving
(`_get_deltax`). Note than nonlinear device evaluation
`eval_and_deriv` takes a small percentage of the total simulation
time, in part because there are few nonlinear devices in this circuit.

A profile transient analysis of the soliton line with a matrix size of 189 was performed using scipy matrices, optimized matrix filling and SuperLU for matrix decomposition. The number of time steps is 101. The results shown in the table were obtained in a netbook with an Intel Atom processor:

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 13.149 13.149 cardoon:27(run_analyses) 1 0.020 0.020 13.149 13.149 tran.py:73(run) 101 0.012 0.000 11.137 0.110 fsolve.py:23(solve) 101 0.012 0.000 11.125 0.110 nodal.py:420(solve_simple) 101 0.084 0.001 11.113 0.110 fsolve.py:59(fsolve_Newton) 257 0.064 0.000 10.849 0.042 nodal.py:423(get_deltax) 203 1.844 0.009 6.748 0.033 nodalSP.py:731(get_i_Jac) 257 0.032 0.000 2.716 0.011 nodalSP.py:204(_get_deltax) 11960 2.076 0.000 2.424 0.000 nodalSP.py:165(_set_Jac) 6708 0.604 0.000 1.660 0.000 cppaddev.py:143(eval_and_deriv) 54 0.192 0.004 1.320 0.024 nodalSP.py:438(get_i_Jac) 257 0.088 0.000 1.320 0.005 coo.py:238(tocsc) 257 0.008 0.000 1.284 0.005 linsolve.py:131(factorized) 257 0.040 0.000 1.276 0.005 linsolve.py:103(splu) 257 1.008 0.004 1.008 0.004 :0(dgstrf) 101 0.284 0.003 0.992 0.010 nodalSP.py:651(update_q) 6708 0.764 0.000 0.888 0.000 adfun.py:208(jacobian) 54474 0.800 0.000 0.800 0.000 :0(len) 14586 0.768 0.000 0.768 0.000 nodal.py:52(set_i) 358 0.088 0.000 0.740 0.002 base.py:270(__mul__) 260 0.112 0.000 0.528 0.002 compressed.py:20(__init__) 358 0.064 0.000 0.384 0.001 compressed.py:259(_mul_vector) 9334 0.376 0.000 0.376 0.000 nodal.py:41(set_xin)

Similar observations can be made for this case, except that in this circuit most of the devices are nonlinear.

Note: the code used in this profile is not the default currently used
in the program. A profile transient analysis of the soliton line with
a matrix size of 3022 (using pysparse) seems to indicate that about
half of the time is spent building and half factoring the matrix. At
this time the main Jacobian is (almost) created and factored from
scratch at every iteration. The results shown in the table were
obtained in a netbook with an Intel Atom processor. Note the time to
evaluate nonlinear models (`eval_and_deriv`) is only about 25% of
the time to build the matrix.

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 23.301 23.301 <string>:1(<module>) 1 0.000 0.000 23.301 23.301 profile:0(run_analyses(analysisQueue)) 1 0.000 0.000 23.301 23.301 cardoon:26(run_analyses) 1 0.036 0.036 23.301 23.301 tran.py:66(run) 100 0.000 0.000 19.181 0.192 fsolve.py:23(solve) 100 0.012 0.000 19.181 0.192 nodalSP.py:257(solve_simple) 100 0.132 0.001 19.169 0.192 fsolve.py:59(fsolve_Newton) 257 0.056 0.000 18.197 0.071 nodalSP.py:260(get_deltax) 253 3.516 0.014 8.993 0.036 nodalSP.py:729(get_i_Jac) 257 0.016 0.000 8.921 0.035 nodalSP.py:246(_get_deltax) 257 8.453 0.033 8.453 0.033 :0(factorize) 12126 0.924 0.000 2.096 0.000 cppaddev.py:143(eval_and_deriv)

Of course things may be different with other circuits but for this case it may be expected to speed the simulator at least by a factor of two if matrix creation and factorization are optimized.

Currently voltages are stored in terminals only after the final solution is found. The main reason for this is efficiency as it is less work to operate directly from the vector of unknowns in the equation-solving routine.

The nodal voltage in the thermal port of electrothermal models represents the difference between the device temperature and the ambient temperature. In this way a zero difference is usually a good guess for the nonlinear solver and the numerical solution is more robust.

Some issues that should be addressed in the future include:

- Automatically convert the temperature difference to the actual temperature for plotting and saving (the same scheme should also work for currents in voltage sources for example).
- Propagate units across terminals in thermal circuits using an algorithm similar to freeda’s local reference group checking.

Short-term goals:

- Add support for convolution of frequency-defined elements

Longer term goals and other tasks to be done at some time:

- Add netlist variable sweep in DC analysis
- Implement time-step control in transient analysis
- Implement Harmonic Balance analysis (or equivalent)
- Add “+” notation for multiple lines in parser
- Implement sparse-matrix-based AC analysis
- Handle RuntimeWarnings more gracefully. Currently all RuntimeWarnings are suppressed when the AD tapes are being cleated/evaluated as sometimes function arguments are invalid.

Housekeeping and refinement:

Check for repeated terminals in subcircuit external connections. Should the following be allowed or not?:

xamp1 12 12 2 26 myamplifier