In all database systems that I’ve worked on there’s been one requirement that’s universal… the ability to extract data for loading into Excel (note, I specifically mean Microsoft Excel and not just “a spreadsheet package”). While virtually all tools have some sort a facility to accomplish this, if you need to automate the data extract and are coding it in PL/SQL then it normally comes down to producing a CSV (comma separated values) file.
Oracle 11g included an enhancement that made the creation of CSV files from cursors trivially easy; the routine DBMS_SQL.TO_CURSOR_NUMBER. The key problem prior to the introduction of this routine has been interpreting the content of the query; either the structure of the cursor was predefined or dynamic SQL and extensive use of DBMS_SQL was required. Using DBMS_SQL.TO_CURSOR_NUMBER it’s possible to open a REF CURSOR, interpret its structure and take action on it… such as outputting its content in CSV format. The package PKG_CSV is designed to do exactly this.
There are 3 parts to PKG_CSV:
- routines that define how data types are to be cast to strings
- routines that define how fields are to be delimited (a comma is the norm but sometimes requirements specify otherwise) and other special character handling
- routines that do the core output from REF CURSOR to CSV format
Casting Data Type to Strings
Within PKG_CSV, the TO_CHAR function takes care of formatting the various data types to a character representation but for many data types a format specifier is necessary. Default format specifiers have been set in PKG_CSV but a series of routines with the name “set_datatype_format” are provided to permit customisation. Since the format specifiers are stored as package variables customisations are specific to the session only.
On the subject of data types, not all SQL data types are supported. I drew an abitrary line as to what the package would support. For example, I decided not to include CLOBs since outputting a large text object, complete with new lines, is at odds with the CSV format. Naturally, you are free to fill any gaps according to your requirements.
Another point to note about data types is that DBMS_SQL uses the type codes that may be found in the Oracle SQL Reference. These are contained as package level constants in the body. However, I found when testing the package that the codes for certain data types did not match the constants that were noted in the documentation. Whether this is a documentation inconsistency or a database platform/version inconsistency I don’t know. Ideally, the package should probably not have these hard coded but load them up from a table on package initialisation.
Since we’re dealing with translating data to strings its also worth considering the size limitations that PL/SQL imposes on us. A VARCHAR2 column in a table has a maximum size of 4000 bytes but a VARCHAR2 variable in PL/SQL can hold up to 32k bytes. The latter limitation is the one we need to be aware of and for a single line in a CSV file it should be acceptable.
Delimiters and other characters
CSV files will often quote string fields. This is useful if the strings contain commas (or whatever character is being used to delimit the fields). However, the problem then becomes how to handle embedded delimiters. The package has 3 strategies; double up on the delimiter, escape the delimiter or remove embedded delimiters. The default strategy for the package is to double quote strings with a doubling up of embedded double quotes.
Another complication for producing CSV files is with strings that contain new lines and other special characters. PKG_CSV tracks a set of characters that will be removed from string fields. By default carriage return and line feed characters are stripped but this is configurable via the procedure set_remove_chars.
Core Output Routines
At its core, PKG_CSV simply takes a REF CURSOR, converts it to a DBMS_SQL cursor, interrogates the cursor to determine its structure, draws from the cursor formatting each row obtained before outputting the result. The routines exposed are:
This function accepts a REF CURSOR and returns a cursor number. The routine also performs the interrogation of the structure of the cursor.
Closes the specified handle if it’s open. Does nothing, not even raise an exception, if the handle is closed or does not exist.
A function that returns TRUE if the specified handle is open, false otherwise
A function that returns the cursor column names as a string.
A function that returns a CSV formatted line from the cursor.
A function that returns the entire contents from a cursor in CSV format as a CLOB.
A procedure that sends the entire contents of a cursor to a file.
Using the above, creating a CSV file from a cursor is simply a matter of:
DECLARE l_cursor SYS_REFCURSOR; l_filename VARCHAR2(100) := 'some_file.csv'; l_directory VARCHAR2(100) := 'SOME_DIRECTORY'; BEGIN OPEN l_cursor FOR SELECT * FROM dba_objects; pkg_csv.to_file (l_cursor, l_directory,l_filename); END;
Processing each line from the cursor can be done by something like the following:
DECLARE l_cursor SYS_REFCURSOR; l_handle NUMBER; l_csv_line VARCHAR2(10000); l_more_data BOOLEAN := TRUE; BEGIN OPEN l_cursor FOR SELECT * FROM dba_objects WHERE rownum <= 200; l_handle := pkg_csv.open_handle (l_cursor); dbms_output.put_line (pkg_csv.get_headers (l_handle)); WHILE (l_more_data) LOOP BEGIN l_csv_line := pkg_csv.get_line (l_handle); dbms_output.put_line (l_csv_line); EXCEPTION WHEN NO_DATA_FOUND THEN l_more_data := FALSE; END; END LOOP; -- while (l_more_data) END;
The ability to convert each row from the cursor, as implement by the GET_LINE routine, is crying out for a pipeline function implementation. Unfortunately the underlying call to DBMS_SQL.TO_CURSOR_NUMBER requires that the REF_CURSOR be passed in as an IN OUT parameter. Pipeline functions only support IN parameters.
One concern with flexible approaches to solutions is the performance penalty that may be incurred. To find out if PKG_CSV introduces significant overhead to a more direct approach for producing a CSV file I created the following script to write out 2,000 entries from DBA_OBJECTS:
SET ECHO ON SET TIMING ON DECLARE l_cursor SYS_REFCURSOR; BEGIN OPEN l_cursor FOR SELECT * FROM dba_objects WHERE rownum <= 2000; pkg_csv.to_file (l_cursor, 'DEV_DATA','csv_text.txt'); END; / DECLARE TYPE t_tab IS TABLE OF dba_objects%ROWTYPE; l_tab t_tab; l_text VARCHAR2(4000); l_file_handle utl_file.file_type; BEGIN SELECT * BULK COLLECT INTO l_tab FROM dba_objects WHERE rownum <= 2000; l_file_handle := utl_file.fopen (location => 'DEV_DATA' ,filename => 'DBA_OBJECTS_direct.txt' ,open_mode => 'w' ,max_linesize => 4000); FOR i IN l_tab.FIRST..l_tab.LAST LOOP l_text := '"' || l_tab(i).owner || '"' || ',' || '"' || l_tab(i).object_name || '"' || ',' || '"' || l_tab(i).subobject_name || '"' || ',' || TO_CHAR(l_tab(i).object_id) || ',' || TO_CHAR(l_tab(i).data_object_id) || ',' || TO_CHAR(l_tab(i).object_type) || ',' || TO_CHAR(l_tab(i).created,'dd/mm/yyyy hh24:mi:ss') || ',' || TO_CHAR(l_tab(i).last_ddl_time,'dd/mm/yyyy hh24:mi:ss') || ',' || '"' || l_tab(i).timestamp || '"' || ',' || '"' || l_tab(i).status || '"' || ',' || '"' || l_tab(i).temporary || '"' || ',' || '"' || l_tab(i).generated || '"' || ',' || '"' || l_tab(i).secondary || '"' || ',' || TO_CHAR(l_tab(i).namespace) || ',' || '"' || l_tab(i).edition_name || '"'; utl_file.put_line (l_file_handle, l_text, TRUE); END LOOP; utl_file.fclose(l_file_handle); END; /
On my notebook, the PKG_CSV approach took 1 minute 10 seconds and the direct approach took 1 minute 8 seconds, indicating that any overhead introduced by PKG_CSV is minor.
Download scripts shown in this post.
Hi. I’ve tested your package. Creating a 800 MB csv file it took 3 hours ..Is there a way to optimize performances?
With any performance related problem the key to solving it is to work out where the time is going. In the case of PKG_CSV the time can be roughly categorised into time spent obtaining the data, i.e. running the query, and time spent processing it in PKG_CSV and writing it to disk.
The first step for you is to work out how much of the 3 hours is spent in each part. Assuming you have sufficient space within the database, I would take the query that you’re using and create a new table out of it using CTAS (CREATE TABLE AS SELECT). This will give you a reasonable understanding of how long the query you have takes to run. If this is the majority of the 3 hours then you have a query tuning problem.
If it turns out the PKG_CSV is accounting for the majority of the time then a hard-coded approach using a cursor loop, BULK COLLECT into a collection and output via UTL_FILE may fair better as it will not need to do the data type analysis/translation that PKG_CSV does.
I hope the above is helpful. Without further details I’m not able to offer anything more specific.