26 February 2011

DML restarts only happen with concurrent sessions? (2030)

The 25 February quiz touched on an aspect of Oracle database behavior that I suspect many players and more generally many PL/SQL developers were unaware (it took me by surprise!):

Oracle may, and often does, restart a DML statement after it experiences an internal failure. In such situations, Oracle automatically performs a rollback to an implicit savepoint it set before the statement executed. It then executes the statement again.

Two players wrote with an objection regarding the assumptions for the daily quiz and the reason for these restarts. Their comments below:

"After seeing that this question was ranked as advanced, I stared a long time at the code and really could not find the reason for it. One of the PL/SQL Challenge general assumptions says the following: "All code in the question and in the multiple choice options run in the same session (and concurrent sessions do not play a role in the quiz unless specified)." The write consistency issue which is explained magnificently by Tom Kyte (and I had the rare occasion to attend a live conference about it !) IS ALWAYS RELATED TO CONCURRENCY, which means AT LEAST 2 SESSIONS active !!! The Oracle documentation quoted in the Overall answer also speaks about A CONCURRENT UPDATE OR DELETE. So, under the quiz assumption that each piece of code (for each choice) IS NOT influenced by any other session, the choice that says that the package variable can end by being HIGHER than the c_iterations constant WAS WRONGLY SCORED AS CORRECT !!! It is true that placing non-transactional code in triggers is in most cases NOT a good practice and prone to bugs, like for example, code being re-executed an additional time unintentionally after a REAL rollback of a transaction (due to some unhandled Oracle error or an explicit ROLLBACK, and NOT because of write consistency issues !!! ) but THIS IS NOT THE CASE in the "sterile single session" context of the PL/SQL Challenge ... It's true that following this consideration the whole quiz "will lose its point", but I think that the assumptions rules fully justify a rescoring for this quiz."

and

"A quibble more than a bug with yesterday's quiz... One of the assumptions of the quiz is that concurrent sessions don't play a role unless otherwise specified. The text of the question about what *could* be displayed strongly implies that you intended to point out that a trigger may be executed more than c_iterations times in order to ensure write consistency. But I'm not aware of any way to cause that to happen without positing the existence of another session that is accessing the PLCH_EMPLOYEES table and the text of the question does not explicitly specify that we may consider the impact of other sessions."


My response: it is certainly possible that this error occurs most frequently because of concurrency issues, but I can tell you for certain that it is not the only cause. When my reviewers and I ran tests of this code (as we did that a lot), we saw restarts happening when there was not any concurrent behavior. None at all. I sat here on my lonely, little laptop - the only user on the instance, nothing else playing with the plch_employees table, and I consistently saw restarts.


So this issue is not always related to concurrency and I think the current scoring stands. As to why these restarts happen without a concurrency issue, I have no idea and it seems that Oracle is mum on the topic as well.


Your thoughts? The results of your own tests? Can no one else run the verification code and see re-starts, without a second session running and touching the plch_employees table at the same time?

12 comments:

  1. Hello All,
    Unfortunately, I will only by able to try to reproduce the case after this week-end, though it will be very difficult to find a database where a session can work alone ...

    I am still waiting for a complete example WITH A CONSISTENT EXPLANATION regarding the count-inconsistency experienced while ONE SINGLE SESSION is accessing the table.

    Why do we need at all the HIGH number of 100,000 iterations ?
    Has it anything to do with it ?!?
    What happens for a lower number ?

    On Tom Kyte's blog there is a nice example how to put it in evidence, but WITHOUT EXPLANATION about WHEN does this happen or when/why it MIGHT happen.

    Unlike the write-consistency case, which is widely discussed AND EXPLAINED in his book also and for which issue clear and COMPLETELY REPRODUCIBLE cases are offered, in this blog we can only see repeated the statement that "such restarts just happen and will always happen", but without an explanation, so we might conclude that this is the norm, rather than the exception.

    One of the blog readers suggests that it might be going "in parallel" with allocating UNDO extents, but there is no confirmation for it.

    I can hardly accept that "things just happen"
    without any reason, because of some undefined "internal error" ...

    If so, then the FIRST statement about triggers in any PL/SQL book would have to state this clearly and from the start !!!

    The truth is that Tom Kyte told us that he himself became aware of this behavior only after many years of using Oracle, and even then he was spoking ONLY about the cases where CONCURRENCY is involved and which are fully reproducible !!!

    I would happy to hear more comments on the issue.

    Thanks & Best Regards,
    Iudith

    ReplyDelete
  2. I was able to reproduce the restarts easily on my XE instance with no other user sessions. I would suspect row migration might be a general cause. If updating each row of a million row table, and a row on the first block migrated to the last block as a result of that update, somehow Oracle would need to know that the migrated row has already been updated as part of that statement. SCNs and locks aren't at the statement level, so I could see it a two stage process to migrate rows as necessary, then do the updates.

    But this test is a single row table, and with each statement updating just that one row, so row migration can't happen.

    I've tried with 100 and get occasional outputs of 101. The number of restarts isn't consistent and repeatable. I'd suspect something to do with the DBWR updating the block as it is the only other session on the DB with a reason to do anything with the block.

    ReplyDelete
  3. There is old russian discussion http://groups.google.com/group/relcom.comp.dbms.oracle/msg/c40c6f6367a66f5e
    Vladimir Begun pointed out that restart of ROW is related with undo management.
    As I understand the process. Before-row trigger is fired. New values is applied to row. When there is no space left in rollback segment it is autoextended and row actions is repeated starting with before-row trigger.
    Why undo management is not performed transparently for application? Maybe it's implementation issue...

    ReplyDelete
  4. Hello All,
    I will post the next comment in several fragments, due to the 4096 char limitation ...

    I just performed a small test on a relatively inactive database (there are other occasional sessions running, but NOT related to my table).

    I created the following test procedure
    for testing out what happens for:
    1. Different numbers of iterations
    2. If performing ROLLBACK / COMMIT / NO ACTION
    at the end of each test.

    CREATE OR REPLACE PROCEDURE plch_test (p_iterations IN PLS_INTEGER, p_action IN VARCHAR2)
    AS
    c_iterations CONSTANT PLS_INTEGER := p_iterations ;
    l_salary plch_employees.salary%TYPE;
    BEGIN
    plch_emp_count.g_count := 0;

    FOR i IN 1 .. c_iterations
    LOOP
    UPDATE plch_employees
    SET salary = salary + 1
    WHERE employee_id = 100;
    END LOOP;


    DBMS_OUTPUT.put_line('c_iterations = '||c_iterations);
    DBMS_OUTPUT.put_line('g_count = '||plch_emp_count.g_count);

    CASE p_action
    WHEN 'N' THEN NULL;
    WHEN 'R' THEN ROLLBACK;
    WHEN 'C' THEN COMMIT;
    END CASE;

    END;
    /

    The results are not completely concludent,
    but anyway they do CONSISTENTLY show that the problem starts appearing only for 100000 iterations, and NOT for a smaller number.

    ReplyDelete
  5. ( continued )
    I performed the following:

    -- TEST WITHOUT ANY ACTION AT END OF EACH CALL

    SQL>EXEC plch_test(10,'N');
    c_iterations = 10
    g_count = 10

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(100,'N');
    c_iterations = 100
    g_count = 100

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(1000,'N');
    c_iterations = 1000
    g_count = 1000

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(10000,'N');
    c_iterations = 10000
    g_count = 10000

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(100000,'N');
    c_iterations = 100000
    g_count = 100002

    PL/SQL procedure successfully completed.


    -- TEST WITH "ROLLBACK" AT END OF EACH CALL
    SQL>EXEC plch_test(10,'R');
    c_iterations = 10
    g_count = 10

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(100,'R');
    c_iterations = 100
    g_count = 100

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(1000,'R');
    c_iterations = 1000
    g_count = 1000

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(10000,'R');
    c_iterations = 10000
    g_count = 10000

    PL/SQL procedure successfully completed.

    SQL-CONS >EXEC plch_test(100000,'R');
    c_iterations = 100000
    g_count = 100001

    PL/SQL procedure successfully completed.


    -- TEST WITH "COMMIT" AT END OF EACH CALL
    SQL>EXEC plch_test(10,'C');
    c_iterations = 10
    g_count = 10

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(100,'C');
    c_iterations = 100
    g_count = 100

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(1000,'C');
    c_iterations = 1000
    g_count = 1000

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(10000,'C');
    c_iterations = 10000
    g_count = 10000

    PL/SQL procedure successfully completed.

    SQL>EXEC plch_test(100000,'C');
    c_iterations = 100000
    g_count = 100000

    PL/SQL procedure successfully completed.

    In the test above, the problem appeared for
    "NO ACTION" and for "ROLLBACK" after each test,
    but NOT for "COMMIT".

    ReplyDelete
  6. ( continued - the last part )
    In another repetition of exactly the same test,
    the problem appeared for "NO ACTION" and for "COMMIT", but NOT for "ROLLBACK".

    Again, the problem appeared ONLY FOR 100,000 iterations and NOT for less.

    I also tested the same when the update was a "DUMMY" one, setting SALARY=SALARY,
    with the same results.


    So, it looks that it might be somehow related
    to undo extents allocation, as suggested by a blogger on Tom Kyte's blog, I hope to still be able to check this also.

    Anyway, the whole issue is still "far more hidden" than a trivial ROLLBACK that might anytime happen during a transaction, and this is the one that causes performing non-transactional operations in triggers to be considered a BAD practice.

    If the quiz were directed to teaching BEST PRACTICES, then using the ROLLBACK case
    would have been the classic example (I don't remember having it already in any of the previous quizes).

    This one, as it was presented and scored touches a little bit the field of "Oracle mystics" ...

    Maybe it's good that such "dark areas" still exist (at least for us !), because they can offer nice discoveries in the future ...

    But I still think that, being unreproducible
    and especially UNEXPLAINABLE, it was still somehow improper to put it in a quiz ...

    I am almost sure that those who explicitely
    implemented this restart behavior at the Oracle kernel level DO HAVE and explanation for it,
    it is interesting to know WHY DON'T THEY PUBLISH IT LOUDLY AND CLEARLY, with a complete technical rationale.
    Is it for defending brand rights ?
    or maybe to prevent people's trust in the Oracle database to be diminished if they hear about
    "some internal errors" just happening here and there, without a consistent reason ?!?

    I still expect an precise explanation for this
    behavior, though, we see daily more and more issues for which everybody is happy if just finding a work-around for it ...

    And, in summary, what to do ?
    I still feel very uncomfortable with this quiz,
    that was based on an UNDOCUMENTED feature ...
    ( the documented one being only that of the concurrent sessions case ) ...

    Thanks & Best Regards,
    Iudith

    ReplyDelete
  7. This is one more forced example that shows how expensive restarts can be.

    Connected to:
    Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
    With the Partitioning, OLAP, Data Mining and Real Application Testing options

    SQL> set serveroutput on
    SQL> CREATE SMALLFILE UNDO TABLESPACE UNDO_TEST DATAFILE 'F:\\oracle\DB\ORCL\UNDO_TEST.ora' SIZE 5M;

    Tablespace created.

    SQL> ALTER SYSTEM SET UNDO_TABLESPACE = UNDO_TEST;

    System altered.

    SQL> drop table t0;

    Table dropped.

    SQL> create table t0 as select 0 n from dual;

    Table created.

    SQL> drop table t1;

    Table dropped.

    SQL> create table t1 as select lpad('x',4000,'x') x from dual;

    Table created.

    SQL> create or replace trigger trg1
    2 before update on t1
    3 for each row
    4 declare
    5 pragma autonomous_transaction;
    6 begin
    7 update t0 set n = n + 1;
    8 commit;
    9 end;
    10 /

    Trigger created.

    SQL> declare
    2 t0 timestamp;
    3 t1 timestamp;
    4 n pls_integer;
    5 begin
    6 for i in 1 .. 10 loop
    7 update t0 set n = 0;
    8 commit;
    9 t0 := systimestamp;
    10 for i in 1 .. 100 loop
    11 update t1 set x=x;
    12 end loop;
    13 t1 := systimestamp;
    14 commit;
    15 select n into n from t0;
    16 dbms_output.put_line('n = '||rpad(n, 7, ' ')||' t1-t0 = '||substr(t1-t0, 15, 9));
    17 end loop;
    18 end;
    19 /
    n = 112 t1-t0 = 00:00.031
    n = 112 t1-t0 = 00:00.015
    n = 112 t1-t0 = 00:00.016
    n = 112 t1-t0 = 00:00.016
    n = 112 t1-t0 = 00:00.047
    n = 146822 t1-t0 = 00:59.187
    n = 112 t1-t0 = 00:00.016
    n = 154 t1-t0 = 00:00.031
    n = 118 t1-t0 = 00:00.031
    n = 88734 t1-t0 = 00:35.719

    PL/SQL procedure successfully completed.

    ReplyDelete
  8. Tom Kyte on statement restart:
    "Why they were restarted - not really relevant - the fact is - they CAN be restarted, we can show that they can be restarted, it is documented that they can be."
    (http://tkyte.blogspot.com/2010/04/that-old-restart-problem-again.html)

    But I’m also a bit curious why it happens...

    ReplyDelete
  9. I am wondering if somebody filed SR against this behavior. I may only imagine how it may affect multi-million rows operation.

    By the way I have feeling that it (restarts, not caused by concurrency) was introduced (intentionally or not) in 9i, while I have no proof.

    ReplyDelete
  10. al0 said...
    By the way I have feeling that it (restarts, not caused by concurrency) was introduced (intentionally or not) in 9i, while I have no proof.

    From Dark Side of the Force your feeling comes. By sparks from Vitaliy's post it may be lit. (You can find the numbers 8.1.7.2.1 in this discussion...)

    ReplyDelete
  11. From Dark Side of the Force your feeling comes. Yes, I missed the version in hte post to which Vitaly refers. BTW, other mail in the same discussion refers to the 7.3.3.6 (along with 8.1.6), but unfortunately text of that mail is garbaged (seems to be in the wrong encoding), so it is not possible to understand a context.

    ReplyDelete
  12. There are my explanation without comments on russian forum.
    When I have time I'll write my observations.

    http://www.sql.ru/forum/actualutils.aspx?action=gotomsg&tid=833618&msg=10331463

    ReplyDelete