(1) A pipelined table function returns a row to its invoker (a query) with the PIPE ROW statement and then continues to process rows.
(2) Pipelined functions can help improve response time because the entire collection need not be constructed and returned to the server before the query can work with and return a single result row.
(3) In addition, the function consumes less Process Global Area memory, because the object cache need not materialize the entire collection.
I post his comments below for all to consider and discuss:
If one modifies the plch_pipeline package body by adding "dbms_lock.sleep(2);" just before the update statement to introduce a 2 second delay between returning rows from the pipelined function, and creates the following get_sysdate function to return the current sysdate accurate to the second, one can see that the pipelined function does indeed return rows to the calling query in 2 second intervals, as seen in the output. This result does not directly counter (2) or (3).
However, the results of the query are returned all at once, rather than one row every two seconds (think all_rows vs first_rows), which seems to indicate there is some buffering/materializing of the rows due to the autonomous transaction / pipelined function before control is passed fully to the calling SQL statement. If the calling SQL statement was completely free to work with the individual rows returned from the pipelined autonomous transaction (the different get_sysdate value seems to indicate partial control at the calling SQL), wouldn't the rows have been returned to the SQL Plus prompt in 2 second intervals similar to the first_rows hint? This behavior would seem to counter (2) and (3).
If I put the query in a loop in a PL/SQL block, I can see that the loop does not execute until all rows from the pipelined function have been returned. Thus, the rows returned by the pipelined function are being buffered somewhere. Would this not be in the PGA? In effect, isn't the entire collection being materialized before control is returned to the calling program? Does this counter (2) and (3), as the loop does not work with the first row when it is returned, but instead waits for all rows to be materialized before starting execution of the loop?
This behavior reinforces my understanding of Autonomous Transactions, such that although operating outside of the calling transaction, the autonomous transaction must complete fully before control is passed back to the calling transaction. Thus, any data returned by a pipelined function used within an autonomous transaction must in fact be materialized fully before control is passed back to the calling transaction. Whether this occurs one row at a time with a back and forth interaction between the calling transaction and the autonomous transaction or when all rows are computed by the autonomous transaction, a continually growing memory structure would be created. The manner in which the rows are returned would also seem to impact performance, with the back and forth action you describe seemingly slower than returning all at once.
A more in depth example could be created to determine where the pipelined rows are being materialized before the calling transaction continues execution.
CREATE OR REPLACE PACKAGE BODY plch_pipeline IS FUNCTION double_values (dataset refcur_t) RETURN numbers_t PIPELINED IS PRAGMA AUTONOMOUS_TRANSACTION; l_number NUMBER; BEGIN LOOP FETCH dataset INTO l_number; EXIT WHEN dataset%NOTFOUND; dbms_lock.sleep(2); -- add sleep of 2 seconds UPDATE plch_parts SET partnum = partnum; COMMIT; PIPE ROW (l_number * 2); END LOOP; CLOSE dataset; RETURN; END; END plch_pipeline; / -- generate the current sysdate to the second create or replace function get_sysdate return varchar2 is ls_date varchar2(20); begin select to_char(sysdate,'mm/dd/yyyy hh24:mi:ss') into ls_date from dual; return ls_date; end; / col get_sysdate format a20 SELECT a.column_value, rownum, get_sysdate FROM TABLE (plch_pipeline.double_values ( CURSOR (SELECT line FROM user_source WHERE name = 'PLCH_PIPELINE' AND type = 'PACKAGE' AND line <= 3 ORDER BY line))) a /Can verify rows are being returned by the autonomous transaction at 2 second intervals. However, the results are all returned at the same time to the SQL Plus prompt and not as they receive their get_sysdate value.
COLUMN_VALUE ROWNUM GET_SYSDATE ------------ ---------- -------------------- 2 1 11/03/2011 10:33:10 4 2 11/03/2011 10:33:12 6 3 11/03/2011 10:33:14 begin for rec in (SELECT a.column_value, rownum, get_sysdate date_val FROM TABLE (plch_pipeline.double_values ( CURSOR (SELECT line FROM user_source WHERE name = 'PLCH_PIPELINE' AND type = 'PACKAGE' AND line <= 3 ORDER BY line))) a) loop dbms_lock.sleep(1); dbms_output.put_line( rec.column_value || ', ' || rec.rownum || ', ' || rec.date_val || ', ' || to_char(sysdate,'mm/dd/yyyy hh24:mi:ss')); end loop; end; /The timestamp generated/printed within the loop has a value after the timestamp generated from the pipelined function, showing that the data from the pipelined autonomous transaction function is being materialized prior to the loop starting execution. */
2, 1, 11/03/2011 11:21:31, 11/03/2011 11:21:36 4, 2, 11/03/2011 11:21:33, 11/03/2011 11:21:37 6, 3, 11/03/2011 11:21:35, 11/03/2011 11:21:38In addition, if I modify the get_sysdate function to introduce a delay of 3 seconds, one can see from the output that there is a delay of 5 seconds between each row returned from the pipelined function (2 seconds within the pipelined function and 3 seconds from the get_sysdate call), indicating that the entire operation is running in a strictly serial manner. (1 - get value from pipeline function, 2 - make call to get_sysdate for the row in the calling query, repeat these two steps for each of the three rows returned by the pipeline function)
create or replace function get_sysdate return varchar2 is ls_date varchar2(20); begin dbms_lock.sleep(3); select to_char(sysdate,'mm/dd/yyyy hh24:mi:ss') into ls_date from dual; return ls_date; end; / SELECT a.column_value, rownum, get_sysdate FROM TABLE (plch_pipeline.double_values ( CURSOR (SELECT line FROM user_source WHERE name = 'PLCH_PIPELINE' AND type = 'PACKAGE' AND line <= 3 ORDER BY line))) a / COLUMN_VALUE ROWNUM GET_SYSDATE ------------ ---------- -------------------- 2 1 11/03/2011 13:47:54 4 2 11/03/2011 13:47:59 6 3 11/03/2011 13:48:04The same serial behavior is seen in the pl/sql block, with all rows from the pipeline function materialized in a serial manner before executing the cursor loop.
begin for rec in (SELECT a.column_value, rownum, get_sysdate date_val FROM TABLE (plch_pipeline.double_values ( CURSOR (SELECT line FROM user_source WHERE name = 'PLCH_PIPELINE' AND type = 'PACKAGE' AND line <= 3 ORDER BY line))) a) loop dbms_lock.sleep(1); dbms_output.put_line( rec.column_value || ', ' || rec.rownum || ', ' || rec.date_val || ', ' || to_char(sysdate,'mm/dd/yyyy hh24:mi:ss')); end loop; end; / 2, 1, 11/03/2011 13:51:45, 11/03/2011 13:51:56 4, 2, 11/03/2011 13:51:50, 11/03/2011 13:51:57 6, 3, 11/03/2011 13:51:55, 11/03/2011 13:51:58
Interesting points.
ReplyDeleteBut notice a couple of operative words:
(2) ... collection *need not* be constructed and returned ...
(3) ... the object cache *need not* materialize ...
The pipelined function *need not* construct everything in one go. But of course the database *may* have to anyway depending on how and where the function is used. Or how the client fetches the data. That SQL*PLUS prompt does not show each row as it is created I suspect is array fetching kicking in.
So point is if you create a function returning a collection non-pipelined, you *will* construct the entire thing - if you do pipelined it sometimes won't help much, but often will help a lot, depending on how and where it is used.
It's simple array optimization for CURSOR FOR Loop in PL/SQL. You can use plsql_optimize_level = 0 to disable it.
ReplyDeleteAnd you can also use sysdate in PLSQL, not necessary by select into from dual.
Hello All,
ReplyDeleteIn addition to setting plsql_optimize_level = 0, the same test can be tried by using an explicit cursor, and fetching the rows one by one, instead of the FOR LOOP, this way being closer to a real one-to-one "dialogue" between the calling statement and the pipelined function.
Or, if directly testing the SELECT in SQL*PLUS, using SET ARRAYSIZE 1.
The default array size is 15, which is higher than our 3 rows returned.
I guess that when fetching row by row, the results will be a little bit different,
I will also try it when I will be back to my database.
I don't expect this to depend on whether the pipelined function does perform DML (and thus should be defined as autonomous) or not.
Anyway, the essential lesson of this quiz is that a pipelined function performing DML operations alternating with PIPE ROW statements should be effectively "broken" into several transactions, because each return of control to the calling statement requires its terminating COMMIT, so, in fact, our autonomous transaction becomes "several autonomous transactions".
We should of course be aware of the data consistency implications of such a technical requirement.
I can confirm from my own experience that the PGA memory usage decreases significantly
when using a pipelined function versus a non-pipelined table function, especially with very large collections.
The pipelined functions topic is a very interesting one, I rememebr how enthusiastic I was about them when they helped me solve a tough problem of creating custom views to retrieve data from a database used by SAP, which sometimes uses some "exotic modes" of storing data, even in an Oracle database.
Thanks & Best Regards,
Iudith
I am going to post a multi-part response from Neils Hecker. Unfortunately, Blogger is very annoying regarding max length of posts and formatting. So, a heads up: we plan to move commentary into the Challenge website itself in the next release. No more blog!
ReplyDeleteFrom Neils - Part 1
Hi folks,
I have changed the example a little bit so that it uses a cursor. In the first example you could see the same thing that Chad showed us before:
DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM user_source
WHERE (name = 'PLCH_PIPELINE')
AND (type = 'PACKAGE')
AND (line <= 3)
ORDER BY line))) a;
BEGIN
FOR rec IN cData LOOP
dbms_lock.sleep(1);
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/
2, 1, 11/04/2011 10:42:25, 11/04/2011 10:42:30
4, 2, 11/04/2011 10:42:27, 11/04/2011 10:42:31
6, 3, 11/04/2011 10:42:29, 11/04/2011 10:42:32
If you instead use a normal cursor loop, all is fine and you see that each row is given back to PL/SQL as it is generated:
DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM user_source
WHERE (name = 'PLCH_PIPELINE')
AND (type = 'PACKAGE')
AND (line <= 3)
ORDER BY line))) a;
rec cData%ROWTYPE;
BEGIN
OPEN cData;
LOOP
FETCH cData INTO rec;
EXIT WHEN cData%NOTFOUND;
dbms_lock.sleep(1);
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/
2, 1, 11/04/2011 10:42:59, 11/04/2011 10:43:00
4, 2, 11/04/2011 10:43:02, 11/04/2011 10:43:03
6, 3, 11/04/2011 10:43:05, 11/04/2011 10:43:06
The difference in the processing lies in the PL/SQL optimizer, which internally generates a BULK COLLECT INTO if you use a for-loop.
From Neils Hecker - Part 2
ReplyDeleteTo show this internal processing I have changed the code a little bit more. The first execution uses a cursor-loop:
DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM all_source
WHERE (owner = 'SYS')
AND (name = 'STANDARD')
AND (type = 'PACKAGE')
AND (line <= 250)
ORDER BY line))) a;
BEGIN
FOR rec IN cData LOOP
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/
2, 1, 11/04/2011 10:50:27, 11/04/2011 10:53:45
4, 2, 11/04/2011 10:50:29, 11/04/2011 10:53:45
.....
198, 99, 11/04/2011 10:53:43, 11/04/2011 10:53:45
200, 100, 11/04/2011 10:53:45, 11/04/2011 10:53:45
202, 101, 11/04/2011 10:53:47, 11/04/2011 10:57:05
204, 102, 11/04/2011 10:53:49, 11/04/2011 10:57:05
.....
398, 199, 11/04/2011 10:57:03, 11/04/2011 10:57:05
400, 200, 11/04/2011 10:57:05, 11/04/2011 10:57:05
402, 201, 11/04/2011 10:57:07, 11/04/2011 10:58:45
404, 202, 11/04/2011 10:57:09, 11/04/2011 10:58:45
.....
498, 249, 11/04/2011 10:58:43, 11/04/2011 10:58:45
500, 250, 11/04/2011 10:58:45, 11/04/2011 10:58:45
As you can see the internal bulk processing buffers 200 rows and then processes them.
The second code-fragment is change accordingly:
DECLARE
CURSOR cData IS
SELECT a.column_value, rownum, get_sysdate date_val
FROM TABLE (plch_pipeline.double_values (
CURSOR (SELECT line
FROM all_source
WHERE (owner = 'SYS')
AND (name = 'STANDARD')
AND (type = 'PACKAGE')
AND (line <= 250)
ORDER BY line))) a;
rec cData%ROWTYPE;
BEGIN
OPEN cData;
LOOP
FETCH cData INTO rec;
EXIT WHEN cData%NOTFOUND;
dbms_output.put_line(
rec.column_value || ', ' || rec.rownum || ', ' ||
rec.date_val || ', ' ||
to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
END LOOP;
END;
/
2, 1, 11/04/2011 10:50:30, 11/04/2011 10:50:30
4, 2, 11/04/2011 10:50:32, 11/04/2011 10:50:32
.....
198, 99, 11/04/2011 10:53:46, 11/04/2011 10:53:46
200, 100, 11/04/2011 10:53:48, 11/04/2011 10:53:48
202, 101, 11/04/2011 10:53:50, 11/04/2011 10:53:50
204, 102, 11/04/2011 10:53:52, 11/04/2011 10:53:52
.....
398, 199, 11/04/2011 10:57:06, 11/04/2011 10:57:06
400, 200, 11/04/2011 10:57:08, 11/04/2011 10:57:08
402, 201, 11/04/2011 10:57:10, 11/04/2011 10:57:10
404, 202, 11/04/2011 10:57:12, 11/04/2011 10:57:12
.....
498, 249, 11/04/2011 10:58:46, 11/04/2011 10:58:46
500, 250, 11/04/2011 10:58:48, 11/04/2011 10:58:48
In the output you can easily see that each row is processed immediately when it is generated.
From Niels Hecker - Part 3 (sorry, Niels, for spelling your name wrong in previous posts!):
ReplyDeleteAnd that's one of the reasons why pipelined table functions were introduced. They are really useful for processing of ETL-operations (extract - transform - load). You can for example nest the calls for pipelined table functions, e.g. (a little bit changed from the documentation)
SELECT *
FROM TABLE( PipeFunc_Outer( CURSOR(
SELECT *
FROM TABLE( PipeFunc_Inner()))));
where PipeFunc_Outer() can process each record from PipeFunc_Inner() immediately when it is piped away to avoid the buffering of the complete output of PipeFunc_Inner().
The only question is - why can't we just say:
SELECT *
FROM TABLE( PipeFunc_Outer( PipeFunc_Inner()));
But that's a question for Steven (I haven't never tried it, perhaps it is possible).
Greetings, Niels
I'm posting a 2 part response
ReplyDeleteChad
Part 1
_Nikotin - good point. I had the exact same thought after posting and realizing the small test case (and the quiz question) did not cross the default array optimization barrier of 100. I knew I had to look into this the next work day.
Kim Berg Hansen - I agree with analysis. However, I can see the wording of "need not" in its context as meaning one of two things. 1: either/or as in materialize all rows at once or materialize only 1 row at a time or 2: something potentially other than the stated "materialize all rows" which could mean 1 row at a time or some value >= 1 potentially as high as all rows. Your analysis uses meaning 2, while I was looking at meaning 1. The default array optimization of the cursor loop would materialize 100 rows at a time.
All - thanks for the info on array sizing, SQL Plus parms, etc. I was not able to cover all of these issues in my initial example and perhaps could have been clearer. These options and contributed examples definitely show the impact of these settings on observed behavior. As with most of the PL/SQL Blog topics, the further one investigates, the more difficult it is to address all the issues from all angles in a succinct manner.
As such, it would seem that point (2) could be true as stated in some cases, but there could also be cases where it is false, as the back and forth communication for each row returned by the pipelined function could slow things down vs returning all rows with a single return. As stated by others, many times this is not possible as the data set involved is just too large to materialize completely. Also, point (3) could be true, where less PGA is used, but may also be false, if the data volume is small (but that would likely not be a realistic scenario or something to worry about if materializing less than 100 total rows)
Make that a 3 part response.
ReplyDeletePart 2
Looking further and constructing an example where the number of rows returned by the pipelined function > 100, it appears that point (1) may be false, at least in my understanding of what it says and how it relates to the autonomous transaction nature of this particular pipelined function (it would appear to indicate that the pipelined function would then continue operating in parallel with the invoker as the invoker works with the returned data and the pipelined function generates the next value). If my understanding of (1) is not correct, the following can be ignored. As seen in the examples with 3 rows, the 2 second delay in the pipeline function and the 3 second delay in the get_sysdate function are executed serially. If the example is extended to return 110 rows, one can see that the first 100 values from the pipelined function are materialized in accordance with the array optimization of the CURSOR FOR loop. There is a 5 second interval per row as they are returned, indicated by the times 9:10:59 - 9:19:14 for rows 1-100, and then a one second interval for the CURSOR FOR loop processing for these same 100 rows (9:19:15 - 9:20:54). If the pipelined function was continuing its work (in parallel) when control was passed to the CURSOR FOR loop after 100 rows, the time gap would not continue to be 5 seconds for rows 100+. At minimum, the time gap between the CURSOR FOR loop timestamp of row 100 and and the date from the 'select' for row 101 would be 3 seconds rather than the returned 5 seconds (9:20:54 - 9:20:59) (2 seconds from pipelined function occuring in parallel while the CURSOR FOR loop operated on the first 100 rows). It would appear that the autonomous transaction nature of this particular example enforces serialization for all interaction with the pipelined function.
Part 3
ReplyDelete-- results returning 110 rows with pipeline function in autonomous transaction
date from "select" date from CURSOR FOR loop
-----------------------------------------------------------
2, 1, 11/04/2011 09:10:59, 11/04/2011 09:19:15
4, 2, 11/04/2011 09:11:04, 11/04/2011 09:19:16
6, 3, 11/04/2011 09:11:09, 11/04/2011 09:19:17
8, 4, 11/04/2011 09:11:14, 11/04/2011 09:19:18
...
196, 98, 11/04/2011 09:19:04, 11/04/2011 09:20:52
198, 99, 11/04/2011 09:19:09, 11/04/2011 09:20:53
200, 100, 11/04/2011 09:19:14, 11/04/2011 09:20:54
202, 101, 11/04/2011 09:20:59, 11/04/2011 09:21:45
204, 102, 11/04/2011 09:21:04, 11/04/2011 09:21:46
206, 103, 11/04/2011 09:21:09, 11/04/2011 09:21:47
208, 104, 11/04/2011 09:21:14, 11/04/2011 09:21:48
210, 105, 11/04/2011 09:21:19, 11/04/2011 09:21:49
212, 106, 11/04/2011 09:21:24, 11/04/2011 09:21:50
214, 107, 11/04/2011 09:21:29, 11/04/2011 09:21:51
216, 108, 11/04/2011 09:21:34, 11/04/2011 09:21:52
218, 109, 11/04/2011 09:21:39, 11/04/2011 09:21:53
220, 110, 11/04/2011 09:21:44, 11/04/2011 09:21:54
PL/SQL procedure successfully completed.
Elapsed: 00:11:00.03
-- Remove the update/commit from the pipeline function and remove the autonomous transaction setting.
-- Similar results are seen when returning 110 rows from a non-autonomous version of the pipelined function.
2, 1, 11/04/2011 10:01:43, 11/04/2011 10:09:59
4, 2, 11/04/2011 10:01:48, 11/04/2011 10:10:00
6, 3, 11/04/2011 10:01:53, 11/04/2011 10:10:01
8, 4, 11/04/2011 10:01:58, 11/04/2011 10:10:02
...
196, 98, 11/04/2011 10:09:48, 11/04/2011 10:11:36
198, 99, 11/04/2011 10:09:53, 11/04/2011 10:11:37
200, 100, 11/04/2011 10:09:58, 11/04/2011 10:11:38
202, 101, 11/04/2011 10:11:43, 11/04/2011 10:12:29
204, 102, 11/04/2011 10:11:48, 11/04/2011 10:12:30
206, 103, 11/04/2011 10:11:53, 11/04/2011 10:12:31
208, 104, 11/04/2011 10:11:58, 11/04/2011 10:12:32
210, 105, 11/04/2011 10:12:03, 11/04/2011 10:12:33
212, 106, 11/04/2011 10:12:08, 11/04/2011 10:12:34
214, 107, 11/04/2011 10:12:13, 11/04/2011 10:12:35
216, 108, 11/04/2011 10:12:18, 11/04/2011 10:12:36
218, 109, 11/04/2011 10:12:23, 11/04/2011 10:12:37
220, 110, 11/04/2011 10:12:28, 11/04/2011 10:12:38
PL/SQL procedure successfully completed.
It would seem that all interaction with the pipeline function is serial in nature (both when in an autonomous transaction and when not in). It will only begin processing/generating the next row when requested by the invoker.
Thanks,
Chad
Hello All,
ReplyDeleteAll these examples are very instructive indeed.
As per the most descriptions of the pipelined functions feature in various sources,
we would indeed expect the function to continue to work really "in parallel" with the calling SELECT.
By the examples above it seems that this is NOT what happens, even for a non-autonomous pipelined function.
If such a parallel work was indeed happening, then we may wonder what could happen in a case in which,
for example, the generating of rows by the pipelined function is "very fast", while processing them
in the loop is "very slow" (due to some time-consuming processing) ?
In such a case, the pipelined function would need to really buffer all the rows it is producing
"ahead of request moment", and this may go as far as having to buffer the entire collection,
similarly with a non-pipelined function.
What we could reasonably expect, though, is to have the pipelined function buffering a number of rows
equal to the retrieval array size, that is, to "stop and wait for a new request" after having produced
the "n" rows for the next fetch.
An additional "nuance" here would be that the fetch array size itself may be changed on the fly,
if for example we perform "FETCH ... BULK COLLECT INTO ... LIMIT n" using different values for "n",
which is of course an uncommon practice.
If we add to this also some DML processing and make the function autonomous, then
having such a function execute in parallel with the main processing, may cause the data processing results
to be effectively dependent on the array size, regarding not only performance but also the data content
itself, also something to be considered, though probably a scenario less used in practice.
In summary, with such parallel work and buffering "up to one full array", we could benefit
at least partially from having the rows produced in parallel with having them consumed.
If, instead, the rows production and consumption happen completely serially, as the above examples show,
then we remain with the other benefits of pipelined functions, that is, the streaming (chaining) of several
such functions, and the ease they offer for integrating PL/SQL processing within single SQL statements,
a.s.o.
Maybe this blog can represent a basis for a good question that we could ask Steven to be so kind
and raise it with the PL/SQL development team on the next week meeting, for the benefit of all os us :) :)
Thanks & Best Regards,
Iudith