Comments on PL/SQL Challenge: Explorations into Pipelined Functions (9398)

Hello All, All these examples are very instructiv...

2011-11-04T17:22:52.966+00:00

Hello All,

All these examples are very instructive indeed.

As per the most descriptions of the pipelined functions feature in various sources,
we would indeed expect the function to continue to work really "in parallel" with the calling SELECT.

By the examples above it seems that this is NOT what happens, even for a non-autonomous pipelined function.

If such a parallel work was indeed happening, then we may wonder what could happen in a case in which,
for example, the generating of rows by the pipelined function is "very fast", while processing them
in the loop is "very slow" (due to some time-consuming processing) ?
In such a case, the pipelined function would need to really buffer all the rows it is producing
"ahead of request moment", and this may go as far as having to buffer the entire collection,
similarly with a non-pipelined function.

What we could reasonably expect, though, is to have the pipelined function buffering a number of rows
equal to the retrieval array size, that is, to "stop and wait for a new request" after having produced
the "n" rows for the next fetch.
An additional "nuance" here would be that the fetch array size itself may be changed on the fly,
if for example we perform "FETCH ... BULK COLLECT INTO ... LIMIT n" using different values for "n",
which is of course an uncommon practice.

If we add to this also some DML processing and make the function autonomous, then
having such a function execute in parallel with the main processing, may cause the data processing results
to be effectively dependent on the array size, regarding not only performance but also the data content
itself, also something to be considered, though probably a scenario less used in practice.

In summary, with such parallel work and buffering "up to one full array", we could benefit
at least partially from having the rows produced in parallel with having them consumed.

If, instead, the rows production and consumption happen completely serially, as the above examples show,
then we remain with the other benefits of pipelined functions, that is, the streaming (chaining) of several
such functions, and the ease they offer for integrating PL/SQL processing within single SQL statements,
a.s.o.

Maybe this blog can represent a basis for a good question that we could ask Steven to be so kind
and raise it with the PL/SQL development team on the next week meeting, for the benefit of all os us :) :)

Thanks & Best Regards,
Iudith

Part 3 -- results returning 110 rows with pipelin...

2011-11-04T15:54:15.789+00:00

Part 3

-- results returning 110 rows with pipeline function in autonomous transaction

date from "select" date from CURSOR FOR loop
-----------------------------------------------------------
2, 1, 11/04/2011 09:10:59, 11/04/2011 09:19:15
4, 2, 11/04/2011 09:11:04, 11/04/2011 09:19:16
6, 3, 11/04/2011 09:11:09, 11/04/2011 09:19:17
8, 4, 11/04/2011 09:11:14, 11/04/2011 09:19:18
...
196, 98, 11/04/2011 09:19:04, 11/04/2011 09:20:52
198, 99, 11/04/2011 09:19:09, 11/04/2011 09:20:53
200, 100, 11/04/2011 09:19:14, 11/04/2011 09:20:54
202, 101, 11/04/2011 09:20:59, 11/04/2011 09:21:45
204, 102, 11/04/2011 09:21:04, 11/04/2011 09:21:46
206, 103, 11/04/2011 09:21:09, 11/04/2011 09:21:47
208, 104, 11/04/2011 09:21:14, 11/04/2011 09:21:48
210, 105, 11/04/2011 09:21:19, 11/04/2011 09:21:49
212, 106, 11/04/2011 09:21:24, 11/04/2011 09:21:50
214, 107, 11/04/2011 09:21:29, 11/04/2011 09:21:51
216, 108, 11/04/2011 09:21:34, 11/04/2011 09:21:52
218, 109, 11/04/2011 09:21:39, 11/04/2011 09:21:53
220, 110, 11/04/2011 09:21:44, 11/04/2011 09:21:54

PL/SQL procedure successfully completed.

Elapsed: 00:11:00.03

-- Remove the update/commit from the pipeline function and remove the autonomous transaction setting.
-- Similar results are seen when returning 110 rows from a non-autonomous version of the pipelined function.

2, 1, 11/04/2011 10:01:43, 11/04/2011 10:09:59
4, 2, 11/04/2011 10:01:48, 11/04/2011 10:10:00
6, 3, 11/04/2011 10:01:53, 11/04/2011 10:10:01
8, 4, 11/04/2011 10:01:58, 11/04/2011 10:10:02
...
196, 98, 11/04/2011 10:09:48, 11/04/2011 10:11:36
198, 99, 11/04/2011 10:09:53, 11/04/2011 10:11:37
200, 100, 11/04/2011 10:09:58, 11/04/2011 10:11:38
202, 101, 11/04/2011 10:11:43, 11/04/2011 10:12:29
204, 102, 11/04/2011 10:11:48, 11/04/2011 10:12:30
206, 103, 11/04/2011 10:11:53, 11/04/2011 10:12:31
208, 104, 11/04/2011 10:11:58, 11/04/2011 10:12:32
210, 105, 11/04/2011 10:12:03, 11/04/2011 10:12:33
212, 106, 11/04/2011 10:12:08, 11/04/2011 10:12:34
214, 107, 11/04/2011 10:12:13, 11/04/2011 10:12:35
216, 108, 11/04/2011 10:12:18, 11/04/2011 10:12:36
218, 109, 11/04/2011 10:12:23, 11/04/2011 10:12:37
220, 110, 11/04/2011 10:12:28, 11/04/2011 10:12:38

PL/SQL procedure successfully completed.

It would seem that all interaction with the pipeline function is serial in nature (both when in an autonomous transaction and when not in). It will only begin processing/generating the next row when requested by the invoker.

Thanks,
Chad

Make that a 3 part response. Part 2 Looking furt...

2011-11-04T15:53:22.461+00:00

Make that a 3 part response.

Part 2

Looking further and constructing an example where the number of rows returned by the pipelined function > 100, it appears that point (1) may be false, at least in my understanding of what it says and how it relates to the autonomous transaction nature of this particular pipelined function (it would appear to indicate that the pipelined function would then continue operating in parallel with the invoker as the invoker works with the returned data and the pipelined function generates the next value). If my understanding of (1) is not correct, the following can be ignored. As seen in the examples with 3 rows, the 2 second delay in the pipeline function and the 3 second delay in the get_sysdate function are executed serially. If the example is extended to return 110 rows, one can see that the first 100 values from the pipelined function are materialized in accordance with the array optimization of the CURSOR FOR loop. There is a 5 second interval per row as they are returned, indicated by the times 9:10:59 - 9:19:14 for rows 1-100, and then a one second interval for the CURSOR FOR loop processing for these same 100 rows (9:19:15 - 9:20:54). If the pipelined function was continuing its work (in parallel) when control was passed to the CURSOR FOR loop after 100 rows, the time gap would not continue to be 5 seconds for rows 100+. At minimum, the time gap between the CURSOR FOR loop timestamp of row 100 and and the date from the 'select' for row 101 would be 3 seconds rather than the returned 5 seconds (9:20:54 - 9:20:59) (2 seconds from pipelined function occuring in parallel while the CURSOR FOR loop operated on the first 100 rows). It would appear that the autonomous transaction nature of this particular example enforces serialization for all interaction with the pipelined function.

I'm posting a 2 part response Chad Part 1 ...

2011-11-04T15:50:03.961+00:00

I'm posting a 2 part response

Chad

Part 1

_Nikotin - good point. I had the exact same thought after posting and realizing the small test case (and the quiz question) did not cross the default array optimization barrier of 100. I knew I had to look into this the next work day.

Kim Berg Hansen - I agree with analysis. However, I can see the wording of "need not" in its context as meaning one of two things. 1: either/or as in materialize all rows at once or materialize only 1 row at a time or 2: something potentially other than the stated "materialize all rows" which could mean 1 row at a time or some value >= 1 potentially as high as all rows. Your analysis uses meaning 2, while I was looking at meaning 1. The default array optimization of the cursor loop would materialize 100 rows at a time.

All - thanks for the info on array sizing, SQL Plus parms, etc. I was not able to cover all of these issues in my initial example and perhaps could have been clearer. These options and contributed examples definitely show the impact of these settings on observed behavior. As with most of the PL/SQL Blog topics, the further one investigates, the more difficult it is to address all the issues from all angles in a succinct manner.

As such, it would seem that point (2) could be true as stated in some cases, but there could also be cases where it is false, as the back and forth communication for each row returned by the pipelined function could slow things down vs returning all rows with a single return. As stated by others, many times this is not possible as the data set involved is just too large to materialize completely. Also, point (3) could be true, where less PGA is used, but may also be false, if the data volume is small (but that would likely not be a realistic scenario or something to worry about if materializing less than 100 total rows)

From Niels Hecker - Part 3 (sorry, Niels, for spel...

2011-11-04T14:27:13.075+00:00

From Niels Hecker - Part 3 (sorry, Niels, for spelling your name wrong in previous posts!):

And that's one of the reasons why pipelined table functions were introduced. They are really useful for processing of ETL-operations (extract - transform - load). You can for example nest the calls for pipelined table functions, e.g. (a little bit changed from the documentation)

SELECT *
FROM TABLE( PipeFunc_Outer( CURSOR(
SELECT *
FROM TABLE( PipeFunc_Inner()))));

where PipeFunc_Outer() can process each record from PipeFunc_Inner() immediately when it is piped away to avoid the buffering of the complete output of PipeFunc_Inner().

The only question is - why can't we just say:

SELECT *
FROM TABLE( PipeFunc_Outer( PipeFunc_Inner()));

But that's a question for Steven (I haven't never tried it, perhaps it is possible).

Greetings, Niels

From Neils Hecker - Part 2 To show this internal ...

2011-11-04T14:26:33.498+00:00

From Neils Hecker - Part 2

To show this internal processing I have changed the code a little bit more. The first execution uses a cursor-loop:

DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM all_source
WHERE (owner = 'SYS')
AND (name = 'STANDARD')
AND (type = 'PACKAGE')
AND (line <= 250)
ORDER BY line))) a;
BEGIN
FOR rec IN cData LOOP
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/

2, 1, 11/04/2011 10:50:27, 11/04/2011 10:53:45
4, 2, 11/04/2011 10:50:29, 11/04/2011 10:53:45
.....
198, 99, 11/04/2011 10:53:43, 11/04/2011 10:53:45
200, 100, 11/04/2011 10:53:45, 11/04/2011 10:53:45
202, 101, 11/04/2011 10:53:47, 11/04/2011 10:57:05
204, 102, 11/04/2011 10:53:49, 11/04/2011 10:57:05
.....
398, 199, 11/04/2011 10:57:03, 11/04/2011 10:57:05
400, 200, 11/04/2011 10:57:05, 11/04/2011 10:57:05
402, 201, 11/04/2011 10:57:07, 11/04/2011 10:58:45
404, 202, 11/04/2011 10:57:09, 11/04/2011 10:58:45
.....
498, 249, 11/04/2011 10:58:43, 11/04/2011 10:58:45
500, 250, 11/04/2011 10:58:45, 11/04/2011 10:58:45

As you can see the internal bulk processing buffers 200 rows and then processes them.

The second code-fragment is change accordingly:

DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM all_source
WHERE (owner = 'SYS')
AND (name = 'STANDARD')
AND (type = 'PACKAGE')
AND (line <= 250)
ORDER BY line))) a;

rec cData%ROWTYPE;
BEGIN
OPEN cData;
LOOP
FETCH cData INTO rec;
EXIT WHEN cData%NOTFOUND;

dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/

2, 1, 11/04/2011 10:50:30, 11/04/2011 10:50:30
4, 2, 11/04/2011 10:50:32, 11/04/2011 10:50:32
.....
198, 99, 11/04/2011 10:53:46, 11/04/2011 10:53:46
200, 100, 11/04/2011 10:53:48, 11/04/2011 10:53:48
202, 101, 11/04/2011 10:53:50, 11/04/2011 10:53:50
204, 102, 11/04/2011 10:53:52, 11/04/2011 10:53:52
.....
398, 199, 11/04/2011 10:57:06, 11/04/2011 10:57:06
400, 200, 11/04/2011 10:57:08, 11/04/2011 10:57:08
402, 201, 11/04/2011 10:57:10, 11/04/2011 10:57:10
404, 202, 11/04/2011 10:57:12, 11/04/2011 10:57:12
.....
498, 249, 11/04/2011 10:58:46, 11/04/2011 10:58:46
500, 250, 11/04/2011 10:58:48, 11/04/2011 10:58:48

In the output you can easily see that each row is processed immediately when it is generated.

I am going to post a multi-part response from Neil...

2011-11-04T14:24:55.415+00:00

I am going to post a multi-part response from Neils Hecker. Unfortunately, Blogger is very annoying regarding max length of posts and formatting. So, a heads up: we plan to move commentary into the Challenge website itself in the next release. No more blog!

From Neils - Part 1

Hi folks,

I have changed the example a little bit so that it uses a cursor. In the first example you could see the same thing that Chad showed us before:

DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM user_source
WHERE (name = 'PLCH_PIPELINE')
AND (type = 'PACKAGE')
AND (line <= 3)
ORDER BY line))) a;

BEGIN
FOR rec IN cData LOOP
dbms_lock.sleep(1);
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/

2, 1, 11/04/2011 10:42:25, 11/04/2011 10:42:30
4, 2, 11/04/2011 10:42:27, 11/04/2011 10:42:31
6, 3, 11/04/2011 10:42:29, 11/04/2011 10:42:32

If you instead use a normal cursor loop, all is fine and you see that each row is given back to PL/SQL as it is generated:

DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM user_source
WHERE (name = 'PLCH_PIPELINE')
AND (type = 'PACKAGE')
AND (line <= 3)
ORDER BY line))) a;

rec cData%ROWTYPE;

BEGIN
OPEN cData;
LOOP
FETCH cData INTO rec;
EXIT WHEN cData%NOTFOUND;

dbms_lock.sleep(1);
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/

2, 1, 11/04/2011 10:42:59, 11/04/2011 10:43:00
4, 2, 11/04/2011 10:43:02, 11/04/2011 10:43:03
6, 3, 11/04/2011 10:43:05, 11/04/2011 10:43:06

The difference in the processing lies in the PL/SQL optimizer, which internally generates a BULK COLLECT INTO if you use a for-loop.

Hello All, In addition to setting plsql_optimize_...

2011-11-04T13:56:43.728+00:00

Hello All,

In addition to setting plsql_optimize_level = 0, the same test can be tried by using an explicit cursor, and fetching the rows one by one, instead of the FOR LOOP, this way being closer to a real one-to-one "dialogue" between the calling statement and the pipelined function.

Or, if directly testing the SELECT in SQL*PLUS, using SET ARRAYSIZE 1.
The default array size is 15, which is higher than our 3 rows returned.

I guess that when fetching row by row, the results will be a little bit different,
I will also try it when I will be back to my database.
I don't expect this to depend on whether the pipelined function does perform DML (and thus should be defined as autonomous) or not.

Anyway, the essential lesson of this quiz is that a pipelined function performing DML operations alternating with PIPE ROW statements should be effectively "broken" into several transactions, because each return of control to the calling statement requires its terminating COMMIT, so, in fact, our autonomous transaction becomes "several autonomous transactions".

We should of course be aware of the data consistency implications of such a technical requirement.

I can confirm from my own experience that the PGA memory usage decreases significantly
when using a pipelined function versus a non-pipelined table function, especially with very large collections.

The pipelined functions topic is a very interesting one, I rememebr how enthusiastic I was about them when they helped me solve a tough problem of creating custom views to retrieve data from a database used by SAP, which sometimes uses some "exotic modes" of storing data, even in an Oracle database.

Thanks & Best Regards,
Iudith

It's simple array optimization for CURSOR FOR ...

2011-11-04T11:28:02.220+00:00

It's simple array optimization for CURSOR FOR Loop in PL/SQL. You can use plsql_optimize_level = 0 to disable it.

And you can also use sysdate in PLSQL, not necessary by select into from dual.

Interesting points. But notice a couple of operat...

2011-11-04T07:40:54.047+00:00

Interesting points.

But notice a couple of operative words:

(2) ... collection *need not* be constructed and returned ...

(3) ... the object cache *need not* materialize ...

The pipelined function *need not* construct everything in one go. But of course the database *may* have to anyway depending on how and where the function is used. Or how the client fetches the data. That SQL*PLUS prompt does not show each row as it is created I suspect is array fetching kicking in.

So point is if you create a function returning a collection non-pipelined, you *will* construct the entire thing - if you do pipelined it sometimes won't help much, but often will help a lot, depending on how and where it is used.