02 December 2010

Ambiguities in 1 December quiz on emulating indexes in collections? (1741)

The 1 December quiz tested your knowledge of using associative arrays to emulate multiple indexes into a single collection of data (at least, that was the intention. Several players felt that there were ambiguities in the question. I will provide some of their comments below, but leave it to them and others to elaborate more fully in their responses to the blog. 1. "I thought that the "perform faster than repeated SQL queries" phrase was ambiguous. I would assume I could run several (ie repeated) SQL queries for specific rows in less time than it would take for returning the same rows through the package if we included the initialisation time which is querying the whole table and performing the memory allocation. However, if we took SQL query 2, I would expect the package call 2 to be quicker." 2. "While the 1st answer is simply incorrect (AFAIK, there is no such thing as TYPE ... IS INDEX ON...) and the 2nd is obviously ineffective when doing lookups by partname, there's the 3rd answer I'm unsure about. Its execution time is roughly the same for any number of rows but as for me you cannot determine, if is it faster then plain SQL or not without some actual testing on some real data." 3. "The today quiz is somewhat ambiguous - the problem is with "roughly the same". I do not know implementation details of INDEX BY table but may relatively safely assume that they are map-based (either hash or tree). For this reason their access time should be n*log(n). Not sure if it covered by "roughly"." 4. "There are extra assumptions made, that are not real. In a normal situation I would say that none of the procedures are safe to do. There are indexes on the database. That is fast enough. Don't create code that makes it complex to survive changes to database. Maintenance on systems costs more then the development. Optimizing speed could be done most of time by creating good indexes or partitioning the data. " 5. "the question stated "that the execution times for both functions will be (a) faster than repeated SQL queries of the plch_parts table and (b) roughly the same, regardless of the number of rows in the table?" Your explanation just showed that your solution works but does not *prove* a) or b), is does not even explain it by stating that using a collection is faster than an unidexed query on a relatively big table, especially when (unlimited memory) the whole table might be cached." 6. "I've been enjoying the daily quiz since it began, and this is the first time I've felt the need to question one of the answers! In the 1 December quiz, one of the answers related to looking up a key from one associative array, then using that key into the second array. At the time I agonised over whether the performance would be "roughly the same" as the function that did just one lookup. On most systems the times we are talking about would be small, but for very large arrays, and if the function is called very often, then two lookups instead of one could be significant to response time. So I decided that "twice as long" is not "roughly the same" and did not tick this choice as correct; this was marked as incorrect. Therefore I question whether the specification of this question using "roughly the same" is sufficiently clear to allow us to choose the right answer." OK, let's leave it at that. I will make a few comments and then open it up for discussion. First, it is true that I did not include in the verification code the scripts needed to prove my claims. I will try to find some time to do that over the next couple of days. But if one of you would have the time to do so, and post your code and results, that would be fantastic (either proving my claim or disproving it). Second, yes, certainly using language like "roughly the same" leaves room for interpretation and some ambiguity. But I hope and plan to prove with my performance scripts that it is a reasonable statement to make. Finally, regarding the comment of my assumptions not being "safe," yes, I agree. You cannot assume for your production applications that you have unlimited memory, for example. And I am glad you pointed this out. When crafting quizzes that must by their nature involve the smallest amount of code and complexity as possible (and this question had LOTS of code in it - too much, I fear), I must make some not-real-world assumptions. Cheers, SF

9 comments:

  1. I agree with the concerns about "roughly the same" statement. I think that the indexes defined on the table provide "roughly the same" performance especially under "unlimited memory" conditions, where I assume the index would be resident in memory, and they are safer, and more scalable.

    ReplyDelete
  2. Hello All,
    As we all know, performance issues are always the most fishy ones ...
    If the comparison is between performing those
    "single row lookups" again and again by a simple
    SQL select lookup versus a PL/SQL table lookup
    then yes, the PL/SQL table lookup will be faster.

    But if such a lookup can be included as an additional join in the main select of the whole
    batch process, than that would be the fastest way
    to go.

    If, as stated in the assumption, ALL the index values will be effectively used for the lookup,
    then including the lookup table in the join
    will probably generate a HASH join between the lookup table and the main query's result set,
    which means that most probably the lookup table
    will be accessed ONCE only, using a FULL TABLE SCAN, which is practically the same thing that the "LOAD_ARRAYS" subprogram in the quiz is doing.

    If speaking about best practices, then of course,
    we use in our applications such "do it by yourself data caching", we even have some clever such uses, but usually for more complex cases rather than for simple one-by-one lookups.

    Performance is a different "kind of beast" than PL/SQL, it requires almost natively "a different kind of person" than PL/SQL.

    Many times, what is most ellegant, most modular and most conformant with coding best practices in PL/SQL may become a nightmare for those who tune for performance ...

    Best Regards,
    Iudith

    ReplyDelete
  3. I deliberately chose none of the options as correct. I couldn't possibly choose any option which loaded data into arrays when the problem implied there could be many thousands of entries and that initialisation is required when you perform the first single row lookup using the function. That, by the definition of the question is not better performance or even similar. But now seeing the answers, I can see where you were going with it (but I still wouldn't do it).

    This question again proves that getting simple, short and reproducible test cases is a very demanding and difficult job (take note Oracle support!).

    Have a think about how long it would take you to put together a complex question (with wrong answers) like this and then to make it bullet proof from over a thousand experienced people around the world.

    It's a tough job and Steven does it very well.

    Glen

    ReplyDelete
  4. Please check out the verification code for the 1 December quiz, search for text "Here is code you can to validate the performance claims made in this quiz:".

    I put together a performance comparison of repeated SQL queries vs collection caches. Here's what I found:

    Compare performance for 100000 iterations

    "Repeated queries" completed in: 3.4
    "Collection lookup by partname" completed in: .24
    "Collection lookup by partnum" completed in: .05

    Check performance of "full collection scan" for 1000 iterations

    "Repeated queries" completed in: .03
    "Full collection scan by partname" completed in: 5.19
    "Collection lookup by partnum" completed in: 0

    In other words, by switching for repeated executing of a query to retrieval by part number and part name, the performance went down from 3.4 seconds to .05 and .24, respectively.

    The additional overhead of doing a string-indexed lookup is, in this example, .00019 hundredths of a second, per lookup.

    So I say: "roughly the same."

    The cost of doing a scan through the collection is very high, demonstrating the value of creating a second collection to serve as an index into the first.

    ReplyDelete
  5. Steven, your explanation of the solution where both associative arrays contain the full row information incorrectly states that the table indexed by name contains only the primary key value (a cut-and-paste ghost?).

    The requirement’s description with the phrase roughly the same is ambiguous on at least two levels: what does roughly the same mean, and to what does it apply? The significance of the ambiguity inherent in roughly the same is strongly influenced by one’s interpretation of its application. (Warning: Splitting of linguistic hairs ahead!) I interpreted the requirement as being that each function, when evaluated independently, should return in the same time regardless of the number of items in the associative array; however, because you said both functions instead of each function, it could be interpreted as meaning that either function should run as quickly as the other. From the comments you listed, at least one person appears to have responded based on the latter interpretation.

    The presence of unlimited memory implies that hash lookups used to locate items in an associative array will be free of collisions and thus will execute in constant time regardless of the number of items in the collection. Given the preceding assumption, interpreting the same time requirement as applying to each function independently makes the imprecise meaning of roughly irrelevant. If the alternate interpretation is used, whereby each function’s execution time is compared to that of the other, then clearly the implementation that first looks up the part number by part name and then calls the part number lookup function will run more slowly than the part number lookup. This alternate interpretation could also invalidate the choice where each function directly retrieves the desired row since the relative performance of hash lookups of numeric types as compared to character string types is an unknown.

    Even though there is some ambiguity in the statement of the requirements, I don’t feel that it warrants a change in scoring.

    The argument that repeated SQL queries could be as fast as the equivalent lookups into a PL/SQL associative array is easily refuted. All SQL queries invoked from PL/SQL require that the SQL be processed by the SQL engine. Even if the query has been cached, this cache must be searched before the query can be executed. When the query is executed the appropriate index must be searched to identify the associated row and then the row data must be retrieved. Even if each of the lookups in the SQL engine (SQL cache, index, data) perform as efficiently as indexing into a PL/SQL associative array, there are still three lookups instead of one.

    I quite enjoy the discussions that follow many of the quizzes and view them as an opportunity to learn and occasionally to teach. I am especially pleased when the dialogue debates the merits of various concepts and approaches. Even discussions centered on perceived ambiguity have proven valuable to me since they have encouraged me to be more precise in my technical writing.

    This quiz definitely had a lot of code. I think that is not always a bad thing for an advanced question even though it does result in significant time penalties. I hope that those who were able to correctly respond to the quiz the fastest will do the rest of us a favor by sharing their secrets for speedy and accurate code evaluation. Such skills would be extremely valuable to me and my colleagues.

    ReplyDelete
  6. Thanks, jhall62, I have fixed that typo in the answer.

    ReplyDelete
  7. While I felt the "repeated SQL queries" in (a) was ambiguous, I answered in the context of the batch job described. For example, "would the batch job run quicker with this package rather than it's current repeated SQL queries?".

    I felt that (b) (roughly the same regardless of the number of rows in the table), was not about absolute performance at all, but was only about how the performance would scale with changes in row counts.

    Finally, being picky with Steven's performance testing though - the quiz was for 10gR2 and the performance testing was done in 11gR2.

    ReplyDelete
  8. And good points about "both". I will separate that one, long sentence into two, to make the meaning more clear.

    ReplyDelete
  9. As for being picky, I am sure you will find if and when you run that code on a 10.2 database that the performance will delta will be the same or even greater.

    ReplyDelete