03 November 2011

Explorations into Pipelined Functions (9398)

The 2 November quiz on pipelined table functions and autonomous transactions prompted the following submission from Chad Lee of Telligen, specifically the 2nd and 3rd points in the explanation:

(1) A pipelined table function returns a row to its invoker (a query) with the PIPE ROW statement and then continues to process rows.

(2) Pipelined functions can help improve response time because the entire collection need not be constructed and returned to the server before the query can work with and return a single result row.

(3) In addition, the function consumes less Process Global Area memory, because the object cache need not materialize the entire collection.

I post his comments below for all to consider and discuss:

If one modifies the plch_pipeline package body by adding "dbms_lock.sleep(2);" just before the update statement to introduce a 2 second delay between returning rows from the pipelined function, and creates the following get_sysdate function to return the current sysdate accurate to the second, one can see that the pipelined function does indeed return rows to the calling query in 2 second intervals, as seen in the output. This result does not directly counter (2) or (3).

However, the results of the query are returned all at once, rather than one row every two seconds (think all_rows vs first_rows), which seems to indicate there is some buffering/materializing of the rows due to the autonomous transaction / pipelined function before control is passed fully to the calling SQL statement. If the calling SQL statement was completely free to work with the individual rows returned from the pipelined autonomous transaction (the different get_sysdate value seems to indicate partial control at the calling SQL), wouldn't the rows have been returned to the SQL Plus prompt in 2 second intervals similar to the first_rows hint? This behavior would seem to counter (2) and (3).

If I put the query in a loop in a PL/SQL block, I can see that the loop does not execute until all rows from the pipelined function have been returned. Thus, the rows returned by the pipelined function are being buffered somewhere. Would this not be in the PGA? In effect, isn't the entire collection being materialized before control is returned to the calling program? Does this counter (2) and (3), as the loop does not work with the first row when it is returned, but instead waits for all rows to be materialized before starting execution of the loop?

This behavior reinforces my understanding of Autonomous Transactions, such that although operating outside of the calling transaction, the autonomous transaction must complete fully before control is passed back to the calling transaction. Thus, any data returned by a pipelined function used within an autonomous transaction must in fact be materialized fully before control is passed back to the calling transaction. Whether this occurs one row at a time with a back and forth interaction between the calling transaction and the autonomous transaction or when all rows are computed by the autonomous transaction, a continually growing memory structure would be created. The manner in which the rows are returned would also seem to impact performance, with the back and forth action you describe seemingly slower than returning all at once.

A more in depth example could be created to determine where the pipelined rows are being materialized before the calling transaction continues execution.
CREATE OR REPLACE PACKAGE BODY plch_pipeline
IS
   FUNCTION double_values (dataset refcur_t)
      RETURN numbers_t PIPELINED
   IS
      PRAGMA AUTONOMOUS_TRANSACTION;
      l_number   NUMBER;
   BEGIN
      LOOP
         FETCH dataset INTO l_number;
         EXIT WHEN dataset%NOTFOUND;

         dbms_lock.sleep(2);      -- add sleep of 2 seconds

         UPDATE plch_parts SET partnum = partnum;
         COMMIT;
         
         PIPE ROW (l_number * 2);
      END LOOP;
      CLOSE dataset;
      RETURN;
   END;
END plch_pipeline;
/

-- generate the current sysdate to the second
create or replace function get_sysdate
return varchar2
is
  ls_date       varchar2(20);
begin
  select to_char(sysdate,'mm/dd/yyyy hh24:mi:ss') 
    into ls_date from dual;
  return ls_date;
end;
/

col get_sysdate format a20
SELECT a.column_value, rownum, get_sysdate
  FROM TABLE (plch_pipeline.double_values (
                CURSOR (SELECT line
                          FROM user_source
                         WHERE name = 'PLCH_PIPELINE'
                           AND type = 'PACKAGE'
                           AND line <= 3
                         ORDER BY line))) a
/
Can verify rows are being returned by the autonomous transaction at 2 second intervals. However, the results are all returned at the same time to the SQL Plus prompt and not as they receive their get_sysdate value.
COLUMN_VALUE     ROWNUM GET_SYSDATE
------------ ---------- --------------------
           2          1 11/03/2011 10:33:10
           4          2 11/03/2011 10:33:12
           6          3 11/03/2011 10:33:14

begin
  for rec in (SELECT a.column_value, rownum, get_sysdate date_val
    FROM TABLE (plch_pipeline.double_values (
                CURSOR (SELECT line
                          FROM user_source
                         WHERE name = 'PLCH_PIPELINE'
                           AND type = 'PACKAGE'
                           AND line <= 3
                         ORDER BY line))) a) loop
  
    dbms_lock.sleep(1);
    dbms_output.put_line(
         rec.column_value || ', ' || rec.rownum || ', ' || 
         rec.date_val || ', ' || 
         to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
  end loop;
end;
/
The timestamp generated/printed within the loop has a value after the timestamp generated from the pipelined function, showing that the data from the pipelined autonomous transaction function is being materialized prior to the loop starting execution. */
2, 1, 11/03/2011 11:21:31, 11/03/2011 11:21:36
4, 2, 11/03/2011 11:21:33, 11/03/2011 11:21:37
6, 3, 11/03/2011 11:21:35, 11/03/2011 11:21:38
In addition, if I modify the get_sysdate function to introduce a delay of 3 seconds, one can see from the output that there is a delay of 5 seconds between each row returned from the pipelined function (2 seconds within the pipelined function and 3 seconds from the get_sysdate call), indicating that the entire operation is running in a strictly serial manner. (1 - get value from pipeline function, 2 - make call to get_sysdate for the row in the calling query, repeat these two steps for each of the three rows returned by the pipeline function)
create or replace function get_sysdate
return varchar2
is
  ls_date       varchar2(20);
begin
  dbms_lock.sleep(3);
  select to_char(sysdate,'mm/dd/yyyy hh24:mi:ss') 
    into ls_date from dual;
  return ls_date;
end;
/

SELECT a.column_value, rownum, get_sysdate
  FROM TABLE (plch_pipeline.double_values (
                CURSOR (SELECT line
                          FROM user_source
                         WHERE name = 'PLCH_PIPELINE'
                           AND type = 'PACKAGE'
                           AND line <= 3
                         ORDER BY line))) a
/


COLUMN_VALUE     ROWNUM GET_SYSDATE
------------ ---------- --------------------
           2          1 11/03/2011 13:47:54
           4          2 11/03/2011 13:47:59
           6          3 11/03/2011 13:48:04

The same serial behavior is seen in the pl/sql block, with all rows from the pipeline function materialized in a serial manner before executing the cursor loop.
begin
  for rec in (SELECT a.column_value, rownum, get_sysdate date_val
    FROM TABLE (plch_pipeline.double_values (
                CURSOR (SELECT line
                          FROM user_source
                         WHERE name = 'PLCH_PIPELINE'
                           AND type = 'PACKAGE'
                           AND line <= 3
                         ORDER BY line))) a) loop
  
    dbms_lock.sleep(1);
    dbms_output.put_line(
        rec.column_value || ', ' || rec.rownum || ', ' ||
        rec.date_val || ', ' || 
        to_char(sysdate,'mm/dd/yyyy hh24:mi:ss'));
  end loop;
end;
/

2, 1, 11/03/2011 13:51:45, 11/03/2011 13:51:56
4, 2, 11/03/2011 13:51:50, 11/03/2011 13:51:57
6, 3, 11/03/2011 13:51:55, 11/03/2011 13:51:58

02 November 2011

"Average" function not correct due to lack of rounding? (9397)

The 1 November quiz tested your knowledge of the ability to select from a nested table using the TABLE operator and, in this case, apply the AVG function to the contents of the collection. The question asked in the quiz was:

Which of the choices implements a function named plch_avg that calculates the average of all elements in a collection of the above type so that the following block displays "3.5" after execution

We compared the TABLE operator approach to algorithms one might write themselves to compute the average. One of these, marked as correct, is:
CREATE OR REPLACE FUNCTION plch_avg (numbers_in IN plch_numbers_t)
   RETURN NUMBER
IS
   l_index   PLS_INTEGER := numbers_in.FIRST;
   l_average     NUMBER := 0;
BEGIN
   WHILE l_index IS NOT NULL
   LOOP
      l_average := l_average + numbers_in (l_index)/numbers_in.count;
      l_index := numbers_in.NEXT (l_index);
   END LOOP;

   RETURN l_average;
END;
Several players objected and believe that this choice should be marked as incorrect, since it will not always produce the expected result. For example, if I pass a nested table of 9 rows (values 1 through 9), this function returns:
5.00000000000000000000000000000000000001
This is certainly true, and if we asked you to identify algorithms that would work under all circumstances, we would be wrong to mark this as correct. We would also be wrong to mark choice 8476 as correct, since an empty collection would cause the function to fail with a divide-by-zero exception.

But we asked you to identify choices so that the following block displays "3.5". It certainly does that, and so we believe that this choice should be scored as correct.

Your thoughts?