vankuik.nl: Parsing and transforming RDF files in PLSQL

When you say 'XML', you'd probably also say 'Java' -- until now, because a stock Oracle database installation gives you a PL/SQL wrapper with which you can access the XML API's. From now on, after 'Java' you'll say 'PL/SQL'.

To get an idea of the possibilities, I developed a PL/SQL portlet which parses XML files and walks through the results to pick the interesting pieces and displays those to the Portal user. While I can't give you the source code, the examples should give more than enough clues to build your own portlet.

For this article, it would be useful to have some basic understanding of XML, but it's not a strict requirement.

RDF

The XML files of which the contents will be parsed and displayed, is created in the RDF dialect. RDF stands for Resource Description Framework and is used by many sites to expose their headlines. One example of such a site is Slashdot. This is a real community site where users can submit links to interesting news on the web. The link is then displayed along with a short description and users of the website can then discuss the news item.

When the submission is accepted by the site owners, a new RDF file is generated which can be found here.

Such an RDF file looks like this: http://slashdot.org/slashdot.rdf

With such a file, we'll have all the information for a portlet which shows the headlines of Slashdot: we have a name, a link and a description to the main site itself. We have the headlines, which we can make hyperlinks pointing directly to the items themselves. We even have a nice image and a URL to submit a searchstring to!

The Portlet

The portlet consists of two parts: one part that actually parses the RDF files and displays the headlines, and a batchjob which retrieves the RDF files in the background. We'll concentrate on the displaying part first.

Parsing and displaying

The brunt of the work is done by two procedures, show() and print_rdf090(). That first one retrieves the current user its preferences (which sites will be displayed in the portlet). It loops over the second one, which does the actual work of displaying the headlines.

The following code displays a piece of the declare part of the procedure. Using the source code of Slashdot's RDF file, you'll probably have a good idea what each variable is meant for.

    452    procedure print_rdf090
    453    ( p_site_id in rdf_sites.site_id%type
    454    )
    455    is
    456       cursor c_sit
    457       ( b_site_id in rdf_sites.site_id%type
    458       )
    459       is
    460       select sit.rdf_file
    461       from   rdf_sites sit
    462       where  site_id   = b_site_id;
    463       r_sit  c_sit%rowtype;
    464
    465       l_found           boolean; -- To check whether cursor was successful
    466
    467       l_parser          xmlparser.parser;
    468       l_xmldoc          xmldom.domdocument;
    469       l_doc_elt         xmldom.domelement;
    470       l_root_node       xmldom.domnode;
    471       l_rdf_childnodes  xmldom.domnodelist;
    472       l_channel_node    xmldom.domnode;
    473       l_item_node       xmldom.domnode;
    474       l_item_childnodes xmldom.domnodelist;
    475       l_node            xmldom.domnode;    -- A temporary DOM node
    476       l_title           varchar2(4000);
    477       l_link            varchar2(4000);
    478
    479    begin
    480       open c_sit(p_site_id);
    481       fetch c_sit into r_sit;
    482       l_found := c_sit%found;
    483       close c_sit;
    484
    485       if not l_found
    486       then
    487          raise no_data_found;
    488       else
    489          -- We found a record, but the clob isn't filled.
    490          if dbms_lob.getlength(r_sit.rdf_file) = 0
    491          then
    492             raise no_data_found;
    493          end if;
    494       end if;

And the following fragment of code will print the name of the website in the form of a hyperlink. The call to wwui_api_portlet.portlet_subheader_color() makes sure that we follow the Portal UI.

    495
    496       -- Parse document
    497       l_parser := xmlparser.newparser;
    498       xmlparser.parseclob(l_parser, r_sit.rdf_file);
    499       l_xmldoc := xmlparser.getdocument(l_parser);
    500
    501       -- Get the child nodes of the root node
    502       l_doc_elt := xmldom.getdocumentelement(l_xmldoc);
    503       l_root_node := xmldom.makenode(l_doc_elt);
    504       l_rdf_childnodes := xmldom.getchildnodes(l_root_node);
    505
    506       -- Print information in <channel> tag
    507       l_channel_node := xmldom.item(l_rdf_childnodes, 0);
    508       l_node := xmldom.getfirstchild(l_channel_node); -- go to title element
    509       l_node := xmldom.getfirstchild(l_node); -- Go to text of <title>
    510       l_title := xmldom.getNodeValue(l_node);
    511
    512       l_node := xmldom.getfirstchild(l_channel_node); -- go to title element
    513       l_node := xmldom.getnextsibling(l_node); -- go to link element
    514       l_node := xmldom.getfirstchild(l_node); -- Go to text of <link>
    515       l_link := xmldom.getNodeValue(l_node);
    516
    517       -- Print header of the table
    518       htp.tableopen('cellspacing=0 cellpadding=0 width="100%"');
    519       htp.p('<tr class="' || wwui_api_portlet.portlet_subheader_color || '">');
    520       wwutl_htp.tabledataopen;
    521       htp.anchor(ctext => p(l_title), curl => l_link);
    522       wwutl_htp.tabledataclose;
    523       htp.tablerowclose;
    524
    525       -- Print items
    526       for i in 2..xmldom.getlength(l_rdf_childnodes) - 1
    527       loop
    528          htp.tablerowopen;
    529          l_item_node := xmldom.item(l_rdf_childnodes, i);  -- Get an <item>
    530
    531          -- Get the text and the link
    532          if xmldom.getnodename(l_item_node) = 'item'      -- Just to make sure
    533          then
    534             l_item_childnodes := xmldom.getchildnodes(l_item_node);
    535
    536             -- Get the <title> tag and get its child, a text node
    537             l_node := xmldom.item(l_item_childnodes, 0);
    538             l_node := xmldom.getfirstchild(l_node);
    539             l_title := xmldom.getNodeValue(l_node);
    540
    541             -- Get the <link> tag and get its child, a text node
    542             l_node := xmldom.item(l_item_childnodes, 1);
    543             l_node := xmldom.getfirstchild(l_node);
    544             l_link := xmldom.getNodeValue(l_node);
    545
    546             wwutl_htp.tabledataopen;
    547             htp.anchor( ctext => p(l_title), curl => l_link);
    548             wwutl_htp.tabledataclose;
    549          end if;
    550          htp.tablerowclose;
    551       end loop;
    552       htp.tableclose;
    553
    554       xmlparser.freeparser(l_parser);
    555    end print_rdf090;

In between the declaration of the variables and the above code, a cursor named c_sit is opened and a clob (Character Large OBject) is fetched in the variable r_sit.rdf_file. The clob contains the RDF file of the website we're currently looping over. It was put in the clob by the batchjob, which is discussed in the next section.

Line 497 to 499 parses the clob in a DOM document and the result is assigned to the l_xmldoc variable. This really shows the power of XML; we're not making a hundred calls to substr(), instr(), length() or are pouring out mystifying regular expressions -- we're just using a standard parser to put the XML-file in a standard data structure on which we work on using a standard API (the DOM). A reminder: that standard data structure is the 'infoset', that tree of nodes which represents the XML-file. A node can be a root node, an element node, an attribute node or a text node.

In line 502, we get the document element (which is the first tag of the document) and cast it to a node type. Then in line 504, we use the temporary variable l_node to get the text between the <title> tags. The following illustration shows a part of the tree of nodes which we are strolling down.

Lines 507-515 will now probably speak for themselves and as the comment in the code tells us, the data we assign to l_title and l_link are used to start printing a HTML table. After line 523 folllows a loop which gets the contents of the <item> tags and prints them as rows in the HTML table. The result is then as is shown in the screenshot below. Note that not Slashdot's headlines are displayed here, but FreshMeat and Linux.Com; as you can see, lots of sites use RDF.

Batchjob

We cache the RDF files by creating a batchjob which reads a number of URLs from a table, checks each URL for data and puts that data in clobs. The dbms_job and the utl_http package are used for this.

     16    procedure refresh
     17    is
     18       cursor c_sit
     19       ( b_err_limit      rdf_sites.errors%type
     20       )
     21       is
     22       select sit.site_id
     23       ,      sit.name
     24       ,      sit.url
     25       ,      sit.errors
     26       ,      sit.rdf_file
     27       from   rdf_sites   sit
     28       where  sit.errors  <= b_err_limit
     29       for    update of sit.rdf_file, sit.errors;
     30       r_sit  c_sit%rowtype;
     31
     32       l_errors      rdf_sites.errors%type;
     33       l_rdf_file    rdf_sites.rdf_file%type;
     34       l_err_limit   rdf_sites.errors%type;
     35
     36       l_req        utl_http.req;
     37       l_resp       utl_http.resp;
     38       l_name       varchar2(256);
     39       l_value      varchar2(1024);
     40       l_len        integer;
     41    begin
     42       -- Initialize variables
     43       l_err_limit := 3;  -- TODO get from prefstore
     44       utl_http.set_proxy(rdf_gen.get_proxy);
     45       utl_http.set_response_error_check(false);
     46       dbms_lob.createtemporary(l_rdf_file, true);
     47
     48       -- Loop through all sites and fetch the XML file
     49       for i_sit in c_sit(l_err_limit)
     50       loop
     51
     52          l_req := utl_http.begin_request(i_sit.url);
     53          utl_http.set_header(l_req, 'User-Agent',
     54             'Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)');
     55          l_resp := utl_http.get_response(l_req);
     56
     57          -- Look for clientside error and report it
     58          if  l_resp.status_code >= 400
     59          and l_resp.status_code <= 599
     60          then
     61             utl_http.end_response(l_resp);
     62
     63             update rdf_sites
     64             set    errors = i_sit.errors + 1
     65             where  current of c_sit;
     66          -- Everything went well, save the output for later use
     67          else
     68             -- First clean out the clob in the table
     69             l_len := dbms_lob.getlength(i_sit.rdf_file);
     70             if l_len > 0
     71             then
     72                dbms_lob.erase(i_sit.rdf_file, l_len);
     73                dbms_lob.trim(i_sit.rdf_file, 0);
     74                l_len := dbms_lob.getlength(i_sit.rdf_file);
     75             end if;
     76
     77             -- Now start grabbing from the page and append it to our temp clob
     78             dbms_lob.trim(l_rdf_file, 0);
     79             begin
     80                loop
     81                   utl_http.read_line(l_resp, l_value);
     82                   l_rdf_file := l_rdf_file || l_value;
     83                end loop;
     84             exception
     85                when utl_http.end_of_body then
     86                   utl_http.end_response(l_resp);
     87             end;
     88
     89             -- Temporary clob filled. Write it to the empty clob in the table.
     90             dbms_lob.append(i_sit.rdf_file, l_rdf_file);
     91
     92             -- Succesful retrieval, so reset the number of errors
     93             update rdf_sites
     94             set    errors   = 0
     95             where  current of c_sit;
     96
     97          end if;
     98       end loop;
     99    end refresh;

Since websites change every now and then or are unreachable, the code for the batchjob is made a little robust. When retrieving the RDF file does not succeed, a counter is incremented and at a certain limit (configurable by the Portal Page owner), the URL is skipped and an error message is logged in the Portal log tables.

XSLT

The section 'The Portlet' explains the structure of the code. Another possibility is not to scoop the data from the RDF using PL/SQL code, but to transform them directly to HTML by using XSLT.

You might be a guru concerning XSLT, but for the newbies: XSLT stands for XML Stylesheet Language for Transformations. It is an XML dialect with which a transformation of a source XML-document to a target can be described. For the target formats XML and HTML, things are made easy but really anything is possible.

We'll show the code below:

    822    procedure rdf2html
    823    ( p_site_id in rdf_sites.site_id%type
    824    )
    825    is
    826       cursor c_xsl
    827       ( b_name in rdf_xslt.name%type
    828       )
    829       is
    830       select xsl.xsl_file
    831       from   rdf_xslt     xsl
    832       where  xsl.name     = b_name;
    833       r_xsl  c_xsl%rowtype;
    834
    835       cursor c_sit
    836       ( b_site_id in rdf_sites.site_id%type
    837       )
    838       is
    839       select sit.rdf_file
    840       from   rdf_sites     sit
    841       where  sit.site_id   = b_site_id;
    842       r_sit  c_sit%rowtype;
    843
    844       l_found         boolean; -- To check whether cursor was successful
    845       C_XSL_NAME      constant varchar2(30) := 'rdf2html';
    846
    847       l_parser1       xmlparser.parser;
    848       l_parser2       xmlparser.parser;
    849       l_rdf_file      xmldom.domdocument;
    850       l_xsl_file      xmldom.domdocument;
    851       l_html          clob;
    852
    853       l_stylesheet    xslprocessor.stylesheet;
    854       l_engine        xslprocessor.processor;
    856    begin
    857       -- Initialize variables
    858       dbms_lob.createtemporary(l_html, true);
    859
    860       -- Get the rdf file
    861       open c_sit(p_site_id);
    862       fetch c_sit into r_sit;
    863       l_found := c_sit%found;
    864       close c_sit;
    865
    866       if not l_found
    867       then
    868          raise no_data_found;
    869       else
    870          -- We found a record, but the clob isn't filled.
    871          if dbms_lob.getlength(r_sit.rdf_file) = 0
    872          then
    873             raise no_data_found;
    874          end if;
    875       end if;
    876
    877       -- Get the stylesheet
    878       open c_xsl('rdf2html');
    879       fetch c_xsl into r_xsl;
    880       l_found := c_xsl%found;
    881       close c_xsl;
    882
    883       if not l_found
    884       then
    885          raise no_data_found;
    886       else
    887          -- We found a record, but the clob isn't filled.
    888          if dbms_lob.getlength(r_xsl.xsl_file) = 0
    889          then
    890             raise no_data_found;
    891          end if;
    892       end if;
    893
    894       -- Parse RDF document
    895       l_parser1 := xmlparser.newparser;
    896       xmlparser.parseclob(l_parser1, r_sit.rdf_file);
    897       l_rdf_file := xmlparser.getdocument(l_parser1);
    898
    899       -- Parse XSL document
    900       l_parser2 := xmlparser.newparser;
    901       xmlparser.parseclob(l_parser2, r_xsl.xsl_file);
    902       l_xsl_file := xmlparser.getdocument(l_parser2);
    903
    904       -- Autobots, *transform* !!
    905       l_stylesheet := xslprocessor.newstylesheet(l_xsl_file, null);
    906       l_engine := xslprocessor.newprocessor;
    907       xslprocessor.processxsl(l_engine, l_stylesheet, l_rdf_file, l_html);
    908
    909       -- Clean up stuff
    910       xslprocessor.freestylesheet(l_stylesheet);
    911       xslprocessor.freeprocessor(l_engine);
    912       xmldom.freedocument(l_xsl_file);
    913       xmldom.freedocument(l_rdf_file);
    914       xmlparser.freeparser(l_parser2);
    915       xmlparser.freeparser(l_parser1);
    916
    917       -- We now have a clob l_html which has the result of the transformation.
    918       dump_clob(l_html);
    919       dbms_lob.freetemporary(l_html);
    920    end rdf2html;

Two cursors are opened here and the RDF and the XSLT files are thrown in clobs. You have seen the code in lines 894 to 902 before; the clobs are parsed. The interesting lines are 905, 906 and 907: we transform the RDF and put it in the clob l_html. With a loop around it, the clob can be sent to the user its browser by calling htp.p().

Conclusion

A remark: performance isn't tested, but it's definitely smartest to do the parsing in the batchjob and create a datamodel which facilitates the data in the RDF-files. This hasn't been done yet since the code is rather alpha-quality and also because debugging code in a batchjob is a PITA... Note that when this option is chosen, using XSLT doesn't make sense anymore.

Anyway, it should be clear by now that XML and PL/SQL can cooperate quite nicely. For more information, I recommend the book of Steve Muench: http://www.oreilly.com/catalog/orxmlapp/

Steve is Oracle's 'XML evangelist'. And of course, a bare bones explanation can be found on http://otn.oracle.com