The end goal is to have a streaming architecture that will support rather
arbitrary pipelines with all kinds of data formats.  Performance will be a
priority, and whatever means will be used to get there.  (the existing OGI
pipeline has some points to make, so it doesn't use all possible speedups,
some listed later.  this is written with the OGI pipeline as a background,
expect that).

Since pipelines come in all kinds of configurations, threads are not
always used, though they will be encouraged in this arch.  A more
generally sane implementation of the OGI pipeline would be designed with a
lot fewer threads, using lockstep sub-pipelines for the 'sippers' to
ensure that the scheduler doesn't pull its hair out.  This one point alone
dictates that the streamer arch must be rather different from the OGI
arch, in that we'll be leaving a lot more of the basic pipeline structure
up to the coder.

Function pointers will be used to push data, which is a marked difference.
Full queues will somewhat be pipeline elements rather than connections.
This should be doable with no performance penalty, and allows the lockstep
pipelines to be done with no extra work.  In fact, this sounds a lot like
the original swift-c arch, where you push something into one end and
eventually something comes out the other.  The strategic use of queues and
threads (thread->queue->thread->queue->thread->etc.) allows you to tailor
things rather nicely.

Since queues become entities placed there for a reason rather than
background, they can reasonable be controlled more directly. Specifically,
queues should be flushable on demand.  In the case of a video pipeline
with disk source and ff/rw controls, you want to be able to push data
through as fast as possible once state changes to get things started.

To this end, I'd like to investigate the use of signals on either end of
queues, or some other active self-scheduling mechanisms.  For instance, if
you're trying to flood data through the pipeline to get things sane after
a flush, you'd want to process one end-unit of data (as defined by the
type of data) in pipeline order as fast as possible, not queueing it very
long.  Once that's through, you can go back to normal scheduling.

Anyway, high-level pseudo-code of a small pipeline might look like:

main(char *filename) {
  queue1 = queue_create();
  thread_create(thread1,filename);
  thread_create(thread2);
}

thread1(char *filename) {
  pipe pipe1, pipe2;

  pipe1 = create_pipe();
  pipe2 = create_pipe();

  element1_init(pipe1,filename);
  element2_init(pipe1,pipe2);
  queue_connect(pipe2,queue1);

  while(!eof(fd))
    element1_read();
}

thread2() {
  pipe pipe3, pipe4;

  pipe3 = create_pipe();
  pipe4 = create_pipe();

  queue_connect(queue1,pipe3);
  element3_init(pipe3,pipe4);
  element4_init(pipe4);

  while (1)
    queue_pull(queue1);
}


pipe element1_outpipe;
int fd;

element1_init(pipe out,char *filename) {
  fd = open_file(filename);
  element1_outpipe = pipe;
}

element1_read() {
  buffer buf;
  buf = read_from_disk(fd);
  element1_outpipe->chain(buf);
}


pipe element2_inpipe;
pipe element2_outpipe;

element2_init(pipe in,pipe out) {
  element2_inpipe = in;
  element2_outpipe = out;

  element2_inpipe->chain = element2_chain;
}

element2_chain(buffer buf) {
  element2_frob(buf);
  element2_outpipe->chain(buf);
}


pipe element3_inpipe;
pipe element3_outpipe;

element3_init(pipe in,pipe out) {
  element3_inpipe = in;
  element3_outpipe = out;

  element3_inpipe->chain = element3_chain;
}

element3_chain(buffer buf) {
  element3_frob(buf);
  element3_outpipe->chain(buf);
}


pipe element4_inpipe;

element4_init(pipe in) {
  element4_inpipe = in;

  element4_inpipe->chain = element4_chain;
}

element4_chain(buffer buf) {
  send_to_display(buf);
}


There's a lot of stuff that's not shown, but that's the basic idea.
Perhaps I should rearrange it so things are connected outside the
elements.  Duh.  The threads should look something like this:

thread1(char *filename) {
  pipe pipe1, pipe2;
  element element1;
  element element2;

  pipe1 = create_pipe();
  pipe2 = create_pipe();

  element1 = element1_init(filename);
  element2 = element2_init();

  stream_connect(element1,pipe1);
  stream_connect(pipe1,element2);
  stream_connect(element2,pipe2);
  stream_connect(pipe2,queue1);

  while(!eof(element1->get(fd)))
    element1->chain();
}

Or somesuch.  I'm thinking in terms of reconfigurability here.  Eventually
I'd like to see .so's for all the various codecs, and have a function that
will take two entities (say, two elements) and autoconnect them with the
appropriate magic in the middle.

Anyway, what this means for each element is that it needs to be nicely
self-contained and rather simple to use.  Extra goop can be done out of
band, with various functions and possibly GUI controls and such.

So a short walkthrough of how the above would work...  main() creates the
connecting queue and starts the two threads.  Each thread creates its
internal pipes and elements, then starts into a loop doing stuff.  In this
case it's a simple linear non-controllable playout, so the thread1 loop
just calls the element1 function to do the read from disk and push the
data.  This function is called by pointer simply because its name won't be
known at runtime necessarily.  Eventually you'll be able to say "give me
element to convert from A to B" and it'll give it to you.

Aside, this could make it very much possible to have meta-elements that
are in fact sub-pipelines themselves.  Since the connections are basically
hidden and abstracted, you could shove whatever you want in there and no
one would notice outside the meta-element.

Anyway, it calls the chain function for that pipe, which in this case is a
pointer to the main function of element2, which frobs the data and calls
chain on its outpipe.  At this point the chain function actually calls to
the queue that was created by main().

In thread2(), the main loop sits there and tells the queue to pull
something and shove it into the output pipe.  The cycle continues, and
you end up in this case with something spewed to screen.  In this case the
display rate is not continuous, since it's not clocked.  In a more sane
case, you'd have a single thread that reads from the queue with more
finesse and displays it on a clock.


So, I was starting to move away from the idea of conciously separating
queues from lockstep pipes, when I realized something rather fundamental
to understanding this arch: queues are a special case.  Specifically,
queues exist to allow the schedular the opportunity, without serious
consequences (skips, jumps, etc.) to fix something that without queues it
would break.  Without queues and the opportunity to dampen jitter, the
scheduler would fail miserably.  Queues are put in place specifically to
allow the scheduler to help recover from this disaster.  Circular, but
true.


On buffers, there are three basic cases we have to consider.  The first is
the standard memory transfer, where you just malloc a memory region that
gets filled up by the element and passed on through the pipe/queue to be
read and freed by the next element.  The next is where the source
specifically owns the buffer, as in the case of a v4l2 source element.  In
this case, you have to pass the buffers around with specific information
about what to do in the case of everyone finishing with it.

The last is when the destination owns it.  This could be the case with
XFree 4.0 and off-screen overlay double-buffering.  If you directly
connect a YUV to RGB element to the display, you may want to have only two
buffers available, and the colorspace conversion must wait for the next
available buffer (i.e. a page-flip to expire the old region) before doing
anything.  (in this case, the last thread holds only the conversion and
display elements, which are lockstep CPU-sippers, but are tightly clocked)


I'd like to follow the gtk object model to some extent, but with some
differences.  stream_connect() above will take any pipeline entity,
connecting the output of the first to the input of the second.  Since it
doesn't know what it's getting pointers to, the data refered to must say
what it is in the first set of bytes.  The caller would cast it to the
generic pipeline entity type, the function would switchout between the
combinations that are supported, and make the connection.

Essentially there are two types of entities: pipes and elements.  A queue
is a special-case element, and actually could be just another element,
since the input interface to an element is just a push function.  It'd
throw it on the queue, and a special function for elements of type queue
would pull something back off the queue.

So the structure defining a pipe or element would have an ID at the top,
then generic information, then info specific to that entity.  Each struct
would have info for both the source and dest ends, which may be null
entirely, though would probably be at minimum a pointer to the entity
conencted on the other end.  Pipes are a more complex case, since they can
be 1-to-N, and thus need a list of outputs.

In fact, duh, the pointer to the entity gives indirect access to that
entities chaining function.  This does mean that cache-line effects should
be considered, cause an indirect means extra time.  However, it makes
replugging rather trivial.

Ideally, pipes will be passthroughs, where the chaining function pointer
is simply copied from the output entity.  That makes pipes not only
transparent to the elements, but zero-time as well.  They exist only to be
glue, *EXCEPT* when they're 1-to-N, in which case they must do mutability
fun.  Ick.
