27 May 2010

26 May 2010 quiz: UTL_FILE and newline character: were we wrong?(323)

Dozens of players have questioned the validity of several of the answers we have marked as correct for the quiz on 26 May 2010.
We've seen two different objections:
1. 1. "The answer using CHR(10) is not correct as CHR(10) is not the platform-specific line terminator(s) on Windows and some other OSs".
2. 2. "The answers that put a newline after "abc" are not valid answers according to your specification."
We'll answer each objection in turn.
CHR(10)
We can certainly understand the response we got, since we felt the same way when it was initially brought to our attention. At that point, however, we took a closer look and discovered some very interesting behavior by Oracle and UTL_FILE regarding the handling of line terminators. It is correct that CHR(10) (LineFeed or LF) is not the platform-specific line terminator(s) on Windows (that would be CHR(13) || CHR(10) -- CarriageReturn + LineFeed or CF + LF) but the file is opened in text mode (default) so Oracle translates any LF to the platform-specific line terminator(s) on the platform of the host. So if it runs on Windows, LF is translated to CR + LF, if it runs on Linux, LF stays as LF.
This can be verified on Windows through the following code sample (assuming that the directory TEMP points to c:\temp):
host del c:\temp\quiz.txt
DECLARE
l_file UTL_FILE.file_type;
BEGIN
l_file := UTL_FILE.FOPEN('TEMP', 'quiz.txt', 'W');
UTL_FILE.PUT (l_file, 'First' || chr(10));
UTL_FILE.PUT (l_file, 'Second' || chr(13) || chr(10));
UTL_FILE.PUT (l_file, 'Third' || chr(10) || chr(13));
UTL_FILE.PUT_LINE (l_file, 'Fourth');
UTL_FILE.PUTF (l_file, 'Fifth\n');
UTL_FILE.FCLOSE (l_file);
END;
/
host debug c:\temp\quiz.txt
If you're on Windows Vista or newer, debug is no longer part of your Windows version. ). To get a hex dump of a file under Windows one can use a free utility called HexDump32:
This will demonstrate that the contents of the file is actually (OD = CR, 0A = LF):
00000000 46 69 72 73 74 0D 0A 53-65 63 6F 6E 64 0D 0D 0A First..Second...
00000010 54 68 69 72 64 0D 0A 0D-46 6F 75 72 74 68 0D 0A Third...Fourth..
00000020 46 69 66 74 68 0D 0A - Fifth..
So when a file is opened in text mode, UTL_FILE does this:
1. First: CHR(10) is written as OD OA, CHR(13) + CHR(10).
2. Second: CHR(13) || CHR(10) is written as OD OD OA, CHR(13) + CHR(13) + CHR(10).
3. Third: CHR(10) || CHR(13) is written as OD OA OD, CHR(13) + CHR(10) + CHR(13).
4. Fourth: PUT_LINE correctly puts OD OA, CHR(13) + CHR(10).
5. Fifth: PUTF correctly puts \n as OD OA, CHR(13) + CHR(10).
On Linux, you can similarly verify the behavior (assuming that the directory TEMP points to /tmp):
host rm /tmp/quiz.txt
DECLARE
l_file UTL_FILE.file_type;
BEGIN
l_file := UTL_FILE.FOPEN('TEMP', 'quiz.txt', 'W');
UTL_FILE.PUT (l_file, 'First' || chr(10));
UTL_FILE.PUT (l_file, 'Second' || chr(13) || chr(10));
UTL_FILE.PUT (l_file, 'Third' || chr(10) || chr(13));
UTL_FILE.PUT_LINE (l_file, 'Fourth');
UTL_FILE.PUTF (l_file, 'Fifth\n');
UTL_FILE.FCLOSE (l_file);
END;
/
host hexdump –C /tmp/quiz.txt
Why does Oracle do this? Probably because UTL_FILE is implemented in C and this is how it's done in C's functions fopen and fprintf when a file is opened in text mode. Admittedly, it's not well described in the Oracle documentation.
You can also open files with UTL_FILE in byte mode but in this case you can only use procedures PUT_RAW and GET_RAW. If you use any of the other subprograms UTL_FILE will raise an UTL_FILE.INVALID_OPERATION exception. With byte mode, Oracle doesn't do any translation of newline characters, which is demonstrated in the following (to be run under Windows):
host del c:\temp\quiz.bin
DECLARE
l_file UTL_FILE.file_type;
BEGIN
l_file := UTL_FILE.FOPEN('TEMP', 'quiz.bin', 'wb');
UTL_FILE.PUT_RAW (l_file, utl_raw.cast_to_raw('First' || chr(10)));
UTL_FILE.PUT_RAW (l_file, utl_raw.cast_to_raw('Second' || chr(13) || chr(10)));
UTL_FILE.PUT_RAW (l_file, utl_raw.cast_to_raw('Third' || chr(10) || chr(13)));
UTL_FILE.FCLOSE (l_file);
END;
/
host debug c:\temp\quiz.bin
This demonstrates the following contents:
00000000 46 69 72 73 74 0A 53 65-63 6F 6E 64 0D 0A 54 68 First.Second..Th
00000010 69 72 64 0A 0D - ird..
Newline after "abc"
We argue that the line terminator (= newline) is part of the line (like an End Of File is part of a file so is End Of Line part of a line) so as long as "abc" is put on a separate line (with optional line terminator), coming after 2 empty lines (with line terminator) the code meets the specification.
Conclusion
For all the reasons stated above, we contend that the quiz of 26 May 2010 is correct; scores and ranking will remain "as is."

14 comments:

  1. All of the examples above are based on platforms with ASCII control characters,
    none of them address platforms with EBCDIC control characters (e.g., Oracle 10gR2 on IBM z/OS using the UTFE character set). I don't have access to Oracle on z/OS, so I will concede that passing "\n" through UTL_FILE.PUT will likely produce the prescribed effect (i.e., a line termination); however, it seems unreasonable to expect that UTL_FILE should transalate CHR(10) (i.e., RPT, the "repeat" control character in EBCDIC) into a line termination on an EBCDIC platform.

    It has been many years since I last used EBCDIC, but I distinctly remember the pains caused by working with both ASCII and EBCDIC character sets (e.g., different control characters, different sort orders when relying on the binary representation of characters, lazy character range checking, etc.).

    ReplyDelete
  2. The following is directly from the SQL Reference for 10gR2:

    "Use of the CHR function (either with or without the optional USING NCHAR_CS clause) results in code that is not portable between ASCII- and EBCDIC-based machine architectures."

    As you stated, the Oracle documentation does not describe some of the behavior of UTL_FILE very well; it is even less informative about the implications of running on an EBCDIC based platform.

    ReplyDelete
  3. Thanks for the detailed response. I did not get a higher score but learning something new is worth much much more :-)

    ReplyDelete
  4. Steven, thanks for the explanation. That's a lessons learned for me ;)

    ReplyDelete
  5. You said: "UTL_FILE.PUT puts text into the file buffer, but does not append a newline character.

    OK, but I can always provide the newline character myself, which is what I did in this block. Oracle will ensure that CHR(10) is translated to the platform-specific line terminator(s)."

    And I will reply:
    "You cannot depend on the behavior described above. Even if you observe that it worked like that, you cannot count on that."
    "Unless you can prove that Oracle 10g Release 2 with all combinations of platforms and patch sets/patches ensures the newline character will be translated to a platform specific end of line character, you can't rely on it.
    Similarly, many Oracle developers for years relied on GROUP BY also doing an ORDER BY, but this was changed with Oracle 10g Release 1."

    Please look at http://www.oracle.com/technology/oramag/oracle/06-mar/o26asktom.html for similar response about trigger ordering.
    Note: This reply is rephrased reply which I have received from Steven Feuerstein and later Finn Ellebaek Nielsen at Saturday, 2010-05-01

    ReplyDelete
  6. As we wrote UTL_FILE relies on the behavior of C's fopen and fprintf. I'm fairly confident that this will ensure the correct line terminator(s) on any platform that Oracle runs on. This is not going to change as we've had UTL_FILE provided by Oracle for quite a few years now.

    I don't have access to a database instance running on an EBCDIC-based platform. It would be interesting if any players could test the behavior there.

    Any takers?

    ReplyDelete
  7. According to the Oracle SQL Reference, "Use of the CHR function (either with or without the optional USING NCHAR_CS clause) results in code that is not portable between ASCII- and EBCDIC-based machine architectures."

    The explanations/examples defending the quiz's answers do not address behavior on EBCDIC platforms; only ASCII platforms were used. While I don't have access to Oracle on an EBCDIC platform (and thus can't prove my theory), it seems unreasonable to expect that UTL_FILE would translate CHR(10) (which is RPT, the "repeat" control character in EBCDIC) into a line termination in such an environment.

    Unexpected behavior caused by the use of literal control characters is one of many issues that I had to deal with many years ago when I wrote code for both EBCDIC and ASCII platforms.

    ReplyDelete
  8. As stated in the Oracle PL/SQL Packages and Types Reference, the UTL_FILE.PUTF procedure "works like a limited printf()" and thus strings such as "\n" will produce the appropriate platform-specific line terminator. However, the examples given in the original post clearly show that UTL_FILE is not relying solely on C's fprint or fprintf to process the passed strings. When using fprintf the "\n" in the format string may result in a multi-character line terminator depending on the platform; however, both fprint and fprintf pass control characters (e.g., LF (0x0A) and CR (0x0D)) unchanged to the output. The first example demonstrates that UTL_FILE.PUT is doing additional processing of at least some control characters. This is outside of the specified behavior of the C library routines and is not specified in the Oracle UTL_FILE documentation. Given this, there is no basis for assuming that character 0x0A ("linefeed" in ASCII, "repeat" in EBCDIC) will be changed into a line terminator by UTL_FILE on an EBCDIC platform (there are, however, many reasons to hope that it would not).

    I'm not particularly concerned about having the quiz scores adjusted, but do want to warn others against relying on undocumented behavior. As adderek noted, unspecified behavior should not be relied upon -- it may change.

    On a side note, I'm glad this forum has been established. As Johannes posted, the opportunity to acquire new knowledge is far more important than a few quiz points. The issues we discuss may inspire each of us to improve the quality of our code.

    ReplyDelete
  9. I'm afraid that fprintf does translate some control characters. If the file is opened in text mode \n is translated to the platform-specific line terminator(s).

    Example: Run this under Windows:

    #include

    int main() {
    FILE *file;

    file = fopen("c:\\temp\\quiz.txt", "wt");
    if (file) {
    fprintf(file, "First\n");
    fprintf(file, "Second\r\n");
    fclose(file);
    }

    file = fopen("c:\\temp\\quiz.bin", "wb");
    if (file) {
    fprintf(file, "First\n");
    fprintf(file, "Second\r\n");
    fclose(file);
    }
    }

    and you'll the the following contents:

    quiz.txt:

    0D31:0100 46 69 72 73 74 0D 0A 53-65 63 6F 6E 64 0D 0D 0A First..Second...

    quiz.bin:

    0D31:0100 46 69 72 73 74 0A 53 65-63 6F 6E 64 0D 0A First.Second..

    This is how C has always behaved.

    I'm not to happy about relying on semi-documented Oracle features either but the point here is that if you do the obvious (CHR(13) + CHR(10) with UTL_FILE.PUT) you get the wrong result.

    ReplyDelete
  10. I was one of the persons that issued both the issues. And I found the reasoning very clear. One of the reaons I prefer the PUTLINE or the \n. But the CHR(10) translation to two characters was a surprise.
    As said before, no rescoring needed. This discussion is far more useful.

    ReplyDelete
  11. You are correct about the behavior of fprintf on Windows platforms. Most of my C programming experience has been on Unix/Linux and embedded systems and I had forgotten about Windows’ disparate handling of text and binary mode files (oh, the dangers of inference from inadequate sampling).

    The “\n” used in the demonstration code is not a control character; it is an escape sequence that generates a platform-specific newline control character. IEEE Std. 1003.1 states:

    “In addition to the escape sequences shown in the Base Definitions volume of IEEE Std 1003.1-2001 Chapter 5, File Format Notation ( '\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v' ), "\ddd", where ddd is a one, two, or three-digit octal number, shall be written as a byte with the numeric value specified by the octal number.” (http://www.opengroup.org/onlinepubs/009695399/utilities/printf.html)

    On ASCII platforms the “\n” and “\012” escape sequences evaluate to the same byte value (decimal 10). On EBCDIC platforms the “\n” and “\045” escape sequences each evaluate to the same byte value (decimal 37). As shown in the demonstration code, on Windows platforms, when writing to a file that has been opened in text mode, newlines are translated to CR-LF pairs. This occurs regardless of whether “\n” or “\012” is used to embed the newline character (it should be noted that such behavior conflicts with the IEEE specification cited above). Passing CHR(10) to UTL_FILE.PUT is the equivalent of passing “\012” to the C I/O library utilities; it is not the equivalent of “\n” on all platforms.

    I failed to select two of the correct answers because I read too quickly and failed to notice that they used PUTF instead of PUT. I did not misread the candidate answer that used CHR(10) but answered based on the knowledge that Oracle 10gR2 is available for EBCDIC platforms and that CHR(10) is not a newline on such platforms. The interpretation of control codes is platform specific and is not properly part of PL/SQL. The documentation for UTL_FILE does not specify the control character values but instead states that PUT_LINE, “\n” with PUTF, or NEWLINE with PUT should be used to end a line.

    I recognize the difficulties with creating questions and candidate answers that are free of ambiguity and unstated assumptions and greatly appreciate the openness with which challenges have been accepted. Since neither the quiz administrators nor the participants seem to have access to non-ASCII platforms, I suggest that the declared assumptions be amended to exclude such platforms.

    ReplyDelete
  12. In relation to the quiz question from 2010-04-30 about the order of trigger execution order:

    "It seems that answer 'you cannot tell what order of the trigger is execution' is the correct one BUT IT IS NOT TRUE. Triggers are fired in the order that depends on their OID (order is reversed). So you actually can tell what will be the order. That way answer 'There is no way to know for sure' is not correct. Oracle will execute the trigger with highest OID as last - so it seems like answer 'triggers are executed in the order in which they are created' could be correct as well (because the OID is sequential) however the order is reversed - thus it is not correct. I have given the answer that you can never know (that is what Oracle documentation mentioned somewhere) although I don't really feel that the answer is correct."

    This is a cite from my mail. I can perform a set of experiments like you have presented - but is there any point to do it?
    Note that I have described this "semi-documented Oracle feature" and given an answer that I felt that is expected. It is not about how it works and I am not interested in re-calculation of the points. I am only interested in clarification of the answers and not saying that "this is the correct answer because we can find a situation when it will be true although".

    ReplyDelete
  13. I'm also not happy with using a non-document /hidden feature for the automatic CHR(10) translation in a program, and I'm sure that anyone that new the distinction between /r and /n would also think that chr(10) was not the solution unless they had make that tracking of that "feature".
    Using /n would be the best pratice in that case and not relying in an automatic chr(10) translation that was a surprise to everybody in the challenge (including the creators)

    ReplyDelete
  14. You know what's even worse than the automatic CHR(10) translation? The fact that it happens not only at the end of a line but whenever the character occurs!

    Let's assume the following VARCHAR2 in a variable vText:

    "This a two-line
    text"

    As in, you have a LF/CHR(10) stored in the varchar2 between "line" and "text". If you write this to a file like this:

    UTL_FILE.PUT_LINE(OutputFile, vText);

    Then you end up with the CHR(10) WITHIN the varchar being replaced with a CHR(13) CHR(10). Needless to say, this is very destructive if the file you were writing was a script that manipulates the database again. E.g. if you create an INSERT statement in the file like this, you will end up with faulty data in the newly created line. You might even exceed the size limit of the VARCHAR2 if the original one was close to it and had multiple linebreaks. This is, imho, a bug. The replacement of CHR(10) should only ever happen at the end of a line.

    ReplyDelete