23 September 2011

"Could" Changes Really Improve Performance? (7755)

The 22 September quiz asked you to analyze a piece of code and decide:

Which of the choices describe a change I can make to the plch_emp_loop procedure that will not affect the external behavior of the procedure (what it displays or which rows it updates), but could improve its performance?

The quiz was intended to be about your ability to identify loop invariants, expressions within the body of the loop whose outcomes do not change with each execution of the loop. A standard optimization step is to replace  the invariants with a variable (or better yet a constant!) that is not re-evaluated with each iteration of the loop.

Several players wrote with concerns about the way this question was worded and scored:

1. Hi Steven, I usually do not complain about quizzes, but I felt that yesterdays quiz was a little bit vague and not up to par or standards of the plsql challenge. First thing is how the question is formed. I don't like vague words like could to be in a quiz. The should always be will, specially when it comes to performance enhancements. But looking at the question itself. "Which of the choices describe a change I can make to the plch_emp_loop procedure that will not affect the external behavior of the procedure (what it displays or which rows it updates), but could improve its performance?". What if all the data in the table have salaries less that the minimum. Then this procedure never runs and the enhancements have no effect at all anyways. 2 answer came out wrong for me, probably because I emphasized too much on the performance part of the question rather that reading the code itself. Therefore I fell into the SYSDATE trap, marking it as correct. It does improve performance, there is no doubt of that, but it of course is a changing variable and therefore will change the internal running of the program. The other one was the number 8123 "l_year NUMBER := TO_CHAR (date_in, 'YYYY');" This will make the IF statement later an implicit conversion (CHAR => NUMBER), that will not neccessarily run faster and therefore I don't think it could be a correct answer.

2. How much performance gain do you expect to get from changing from the explicit conversion TO_CHAR (date_in, 'YYYY') to the implicit conversion of l_year? I tried it, and I didn't get any significant gain.

3. Answer 3 and 4 of the 22 sept quiz contradict! On 31-12-2010 to 01-01-2011 the year changes. PS answer 3 makes an assumption about what date should be used and that changing de date is the best programming solution.

4. Option 3 was scored as incorrect: " The question did not include any constraints on when and for how long this program runs. If during the execution of the procedure, the hour changes, then we cannot safely extract the TO_CHAR (SYSDATE, 'HH24') expression and evaluate it just once at the start of the procedure." This statement is correct but I think this is not a very good example. It is a correct that the hour can change during the running of the program, but do you want to have just a part of your employees updated during a run of the program; I don't think so!

5. An answer in quiz 5142 states "The question did not include any constraints on when and for how long this program runs", but the quiz question says "never takes more than 4 hours to complete" - so I would have thought it would be safe to capture l_hour - although I guess it depends on when the program was launched. I was also curious about the affect on performance on the function call to plch_config.min_salary, since it is just an assignment, no conversion required - performance gain almost nominal?

I will offer some responses here, but then open it up for discussion from players.
  • I used the word "could" because I wanted players to focus on a logical analysis of the code (what in theory could improve performance? and not so much on analyzing the actual gains of specific changes, which as some noted could be very minimal indeed).
  • Excellent point about the change to l_year resulting in an implicit conversion, when previously there was no conversion. I did not catch that and I believe it makes that choice ambiguous, not clearly a net performance gain when the loop invariant is extracted. I think I will probably need to issue a correction for that choice.
  • The answer for choice 3 was incorrect and has been changed to remove the leading sentence. I added time constraint text to the question, but forgot to remove this from the answer.
  • I do not see a contradiction between answers 3 and 4 (8122 and 8123 to use their IDs).The year does not change while the program executes, based on " it is only executed during the first month of each quarter (January, April, July and October) and never takes more than 4 hours to complete".
Other comments and points of analysis?

    13 comments:

    1. Although response 8123 replaces an explicit date-to-character conversion with an implicit character-to-number conversion, the change could result in improved performance since the implicit conversion might be less computationally expensive than the explicit conversion. Using EXTRACT to get the year from hire date may be more efficient than the implicit conversion. Without knowing the internal details of each of the conversion routines, it is not possible to determine if any of these changes is likely to result in improved performance. None of this invalidates the scoring: the quiz was about the possibility, not the probability, of improving performance without altering the results of the procedure.

      ReplyDelete
    2. I had selected option 4 as incorrect because I thought the implicit conversion of l_year would add to the processing. On testing however, I'm not sure this is the case...

      I made the following test cases to see what difference comparing the (number) l_year to the character vs. to_char-ing the date:

      -- Char <= Char
      declare
      l_emp_date date := sysdate;
      l_comp_date date := sysdate;
      l_number number := to_char(l_comp_date, 'yyyy');
      l_start number;
      l_varchar2 varchar2(4) := to_char(l_comp_date, 'yyyy');
      begin
      l_start := dbms_utility.get_time();
      for i in 1 .. 100000 loop
      for j in 1 .. 1000 loop
      if to_char(l_emp_date, 'yyyy') <= to_char(l_comp_date, 'yyyy') then
      null;
      end if;
      end loop;
      end loop;
      dbms_output.put_line((dbms_utility.get_time() - l_start));
      end;
      /
      -- Char <= number
      declare
      l_emp_date date := sysdate;
      l_comp_date date := sysdate;
      l_number number := to_char(l_comp_date, 'yyyy');
      l_start number;
      l_varchar2 varchar2(4) := to_char(l_comp_date, 'yyyy');
      begin
      l_start := dbms_utility.get_time();
      for i in 1 .. 100000 loop
      for j in 1 .. 1000 loop
      if to_char(l_emp_date, 'yyyy') <= l_number then
      null;
      end if;
      end loop;
      end loop;
      dbms_output.put_line((dbms_utility.get_time() - l_start));
      end;
      /
      -- Char <= Char, but convert the date outside the loop
      declare
      l_emp_date date := sysdate;
      l_comp_date date := sysdate;
      l_number number := to_char(l_comp_date, 'yyyy');
      l_start number;
      l_varchar2 varchar2(4) := to_char(l_comp_date, 'yyyy');
      begin
      l_start := dbms_utility.get_time();
      for i in 1 .. 100000 loop
      for j in 1 .. 1000 loop
      if to_char(l_emp_date, 'yyyy') <= l_varchar2 then
      null;
      end if;
      end loop;
      end loop;
      dbms_output.put_line((dbms_utility.get_time() - l_start));
      end;
      /

      Running on 10.2.0.3, I get both the first and last cases (both char comparisons) as around 17s but the middle case (char compared to number) as around 14s - about 3s faster. This has held up over multiple rounds of tests.

      I'm not sure why the last case about the same amount of time as the first; the second to_char() isn't invoked in the loop. So the improvement could be due to the fact that number comparisons are more efficent than varchar; I'm interested if any others can reproduce my results and have a more solid explanation!

      ReplyDelete
    3. Hello All,

      I would just add two more remarks, in addition to the implicit conversion one that I also thought about a little bit longer while answering the quiz:


      1. In Choice 8120, defining the local l_name as having the anchored type
      plch_employees.last_name%type
      might raise a VALUE_ERROR in the declaration section if the NAME_IN value passed in happens
      to be longer than the declared 100 bytes of plch_employees.last_name ...

      This was probably not "the stone to stumble upon" ... so I treated it like something to be ignored in the context of this quiz.

      2. Regarding the overall logic of updating employees based on the current hour,
      yes, it has probably caused many of us "to raise eyebrows" regarding the requirement itself...

      In fact, if the 4 hours of running the procedure happen to encompass hour 12 at noon,
      then any of the accepted performance emprovers ( the 3 correct choices ) are likely to influence the number of the rows updated.
      Put otherwise, if the procedure runs faster due to any of the other improvements, then more employees "get the chance" to earn their 1$ salary raise before the current time hits
      hour 12 ... which in fact changes what was called "the external effect of the procedure",
      here the number of records updated.

      Putting it this way, strictly speaking, each of the correct improvements "carries the potential" to turn its own choice into being incorrect.

      Yes, the quiz style was somewhat different from what we are used to have in the daily quizzes,
      but, however, interesting in its own way.

      All the performance improvements suggested are in fact best practices, worth to be remembered
      and used naturally in current development.

      Thanks a lot & Best Regards,
      Iudith

      ReplyDelete
    4. I had emailed Steven earlier about he same issue.
      Even though it will give me a score of 50% I think I have to say all answers to this quiz are invalid.

      The results of the procedure are SYSDATE dependent (the 12 noon mark Iudith pointed out above) so anything that changes the execution time of the procedure can potentially impact the results which breaks the "(what it displays or which rows it updates)" rule.

      For example, if the procedure normally runs at 11am and takes 2-3 hours to complete.
      The first hour of execution will update rows because of this clause because it's before noon and the rest will be skipped. (This is poor design but the quiz doesn't ask about this.)

      If we now make the procedure so fast that it completes in 50 minutes, then all rows will be updated. This, we have changed the rows updated.

      That's an extreme example but simply altering the execution time by 1 second is sufficient to potentially change results.

      Thus... NONE of the answers are correct.

      Either they don't help performance or they do help performance and thus, potentially change the functionality and hence disqualify themselves.

      ReplyDelete
    5. Default installation of 11g comes with plsql_optimize_level 2.

      As per documentation - Level 1 itself "Applies a wide range of optimizations to PL/SQL programs including the elimination of unnecessary computations and exceptions"

      Most of the answers discussed here are falling under "elimination of unnecessary computation" and this is automatically eliminated.

      Hence both version of program would achieve the same result in terms of performance.

      ReplyDelete
    6. Don't change the question. It was valid enough to demonstrate the effect of taking out code of a loop. But I lake the comment of iudith. The fact that faster code changes the outcome which should not be the case.

      ReplyDelete
    7. Hello All,

      I just tried to test the sample cases submitted above by Chris Saxon, and my results on
      11.1.0.7.0 are as follows:

      -- first test
      Run1 -- Char <= Char: 180
      Run2 -- Char <= number: 174
      Run3 -- Char <= Char: 334

      -- second test
      Run1 -- Char <= Char: 171
      Run2 -- Char <= number: 178
      Run3 -- Char <= Char: 156

      So, while the first two blocks had consistent result, the 3-rd one took much more time
      "to warm up" ...

      Thanks & Best Regards,
      Iudith

      ReplyDelete
    8. Using the logic of "could" improve performance, then capturing l_hour "could" be an improvement if the process was launched before 8am.

      I thought this answer would be one of the better improvements to the code, and should have had a good chance of not affecting the outcome.

      I have not tested it yet, but I thought perhaps even more-so than referring to plch_config.min_salary

      Scott

      ReplyDelete
    9. I ran some brief tests, checking V$SQL's CPU_TIME
      and ELAPSED_TIME. There were benefits in only doing those conversions once but nothing worth the trouble.
      The repeated calls to SYSDATE (and the context switch to SQL to obtain it) have a bigger impact. I once tuned a routine by grabbing SYSDATE once and using dbms_utility.get_time to keep the value 'fresh' rather than requery it from the database....but that wasn't one of the options here.

      ReplyDelete
    10. Unfortunately the whole purpose of this quiz, testing knowledge of "invariants" has been superceded by the performance issue, and fallout from that. I thought of another exception. Just the flipside of the example I gave originally. Instead of 11am, what if the process ran at 11pm? Again, if the procedure runs for 2-3 hours, it will skip updates for the first hour because they are after noon, but it will do updates after midnight when the process rolls into the next day because sysdate will be before noon. If the procedure becomes faster it's possible that all of the updates will be skipped. Ironically, the more it skips, the faster it will become thus exacerbating the symptom.

      The explanation for 8122 has already identified the issue. The problem then is that explanation isn't being applied to all of the answers even though the same logic holds. I still vote for a rescore (even though mine will get worse) and changing the text of the quiz so that changes in performance won't affect the external behavior.

      ReplyDelete
    11. Hello All,

      When it comes to BUSINESS data processing, probably the most realistic scenario that may come close to the one presented in this quiz is to have a requirement that, depending on the time of the day, either performs ALL of the updates or none of them.
      In such a case, of course, the check should be made OUTSIDE THE LOOP.

      I can also think of a scenario that is even closer to that of the quiz, for example,
      a daily batch process that collects tables statistics by some special criteria, having a restricted time window (range of hours) in which it can be run.
      That is, in each run, only part of the tables will be analyzed in the given time frame,
      specifically, starting always with those whose statistics are the oldest collected.

      In such a case it DOES make sense to check the current time BEFORE EACH ITERATION of the loop
      and stop when the time frame is exceeded.
      But, of course, the outcome of a single run will be each time and ON PURPOSE a different one.

      All in all, back to the quiz, if we incline to consider all the choices as wrong
      because of the quiz requirement of "preserving the external effect", then I think that we
      can and probably should rather consider the question scenario itself as being incorrect per se for such a requirement.

      Basic logic says that a FALSE can imply anything, so, if we are at the point of questioning the question itself, it does not make much sense to discuss the answers correctness at all ...

      Just a few more thoughts ...

      Thanks & Best Regards,
      Iudith

      ReplyDelete
    12. I agree that the code as presented is a poor design for a "real" program. However, many of the quizzes involve practices that wouldn't be recommended but still present valid learning material. Similarly some quizzes have answers that are poor design but still valid within the context of the quiz constraints.

      The quiz didn't ask us to determine the validity of the scenario. The requirements of the quiz are very clear (make it faster, don't change results) and as such, the requirements of the answers are pretty clear too. Either the answers do NOT affect performance OR they DO and hence affect the results. Therefore, none of them are valid solutions.

      The fact that the original scenario was ambiguous and the code was not a good model is, I think, immaterial.

      If people object to having all answers be wrong, then just call it a wash. Simply give everyone full credit or give everyone 0 credit and then adjust the question/scenario so there are no ambiguities. I enjoy playing the game of the quiz, but I'd much rather have the quiz/answers be consistent and correct for future use.

      ReplyDelete
    13. Well put, Sean, and thanks for the insightful analyses by the other posters.

      Due to sloppiness in the way I set up the question, there is no avoiding the fact that any improvement in performance could change which rows are updated. So as it stands, all choices are incorrect.

      What's a quizmaster to do? Here's the plan:

      1. I will mark this quiz as invalid for purposes of ranking. It will not be used to determine ranking in the quarter, participation in the playoff, etc.

      2. I will change the question text so that it is clear that the procedure always finishes in the morning. It is the simplest way to keep as much of the quiz (and your answers) intact.

      3. I will give all players credit for the HH24 choice, which will be set to CORRECT.

      4. If any player marked all choices as incorrect (thus assuming that they applied the analysis laid out in this blog), they should submit a request for a score adjustment and they will be given 100% credit for the quiz. Only 16 players submitted a "No choices are correct" answer. You know who you are. :-)

      Regards, Steven

      ReplyDelete