How can I change the value of an html attribute in Prolog?

I find the predicate xml_quote_attribute / 2 in the library (sgml) SWI-Prolog. This predicate works with the first argument as input and the second argument as output:

?- xml_quote_attribute('<abc>', X).
X = '&lt;abc&gt;'.

      

But I couldn't figure out how I can do the reverse conversion. For example, the following query doesn't work:

?- xml_quote_attribute(X, '&lt;abc&gt;').
ERROR: Arguments are not sufficiently instantiated

      

Is there another predicate that does the job?

Bye

+3


source to share


4 answers


This is what Ruud's solution looks like using DCG + pushback / semicontext notation.

:- use_module(library(dcg/basics)).

html_unescape --> sgml_entity, !, html_unescape.
html_unescape, [C] --> [C], !, html_unescape.
html_unescape --> [].

sgml_entity, [C] --> "&#", integer(C), ";".
sgml_entity, "<" --> "&lt;".
sgml_entity, ">" --> "&gt;".
sgml_entity, "&" --> "&amp;".

      



Using DCG makes the code more readable. It also removes some of the unnecessary deviations noted by Cookie Monster as a result of the use append/3

to do this.

+4


source


Here's a naive solution using character code lists. This will most likely not give you the best performance, but for strings that are not too large, it might be okay.

html_unescape("", "") :- !.

html_unescape(Escaped, Unescaped) :-
    append("&", _, Escaped),
    !,
    append(E1, E2, Escaped),
    sgml_entity(E1, U1),
    !,
    html_unescape(E2, U2),
    append(U1, U2, Unescaped).

html_unescape(Escaped, Unescaped) :-
    append([C], E2, Escaped),
    html_unescape(E2, U2),
    append([C], U2, Unescaped).

sgml_entity(Escaped, [C]) :-
    append(["&#", L, ";"], Escaped),
    catch(number_codes(C, L), error(syntax_error(_), _), fail),
    !.

sgml_entity("&lt;", "<").
sgml_entity("&gt;", ">").
sgml_entity("&amp;", "&").

      

You will need to complete the SGML Object List yourself.



Output example:

?- html_unescape("&lt;a&gt; &#26361;&#25805;", L), format('~s', [L]).
<a> 曹操
L = [60, 97, 62, 32, 26361, 25805].

      

+2


source


If you don't mind linking an external module , then you can do a very efficient C implementation.

html_unescape.pl:

:- module(html_unescape, [ html_unescape/2 ]).
:- use_foreign_library(foreign('./html_unescape.so')).

      

html_unescape.c:

#include <stdio.h>
#include <string.h>
#include <SWI-Prolog.h>

static int to_utf8(char **unesc, unsigned ccode)
{
    int ok = 1;
    if (ccode < 0x80)
    {
        *(*unesc)++ = ccode;
    }
    else if (ccode < 0x800)
    {
        *(*unesc)++ = 192 + ccode / 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode - 0xd800u < 0x800)
    {
        ok = 0;
    }
    else if (ccode < 0x10000)
    {
        *(*unesc)++ = 224 + ccode / 4096;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode < 0x110000)
    {
        *(*unesc)++ = 240 + ccode / 262144;
        *(*unesc)++ = 128 + ccode / 4096 % 64;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else
    {
        ok = 0;
    }
    return ok;
}

static int numeric_entity(char **esc, char **unesc)
{
    int consumed;
    unsigned ccode;
    int ok = (sscanf(*esc, "&#%u;%n", &ccode, &consumed) > 0 ||
              sscanf(*esc, "&#x%x;%n", &ccode, &consumed) > 0) &&
             consumed > 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += consumed;
    }
    return ok;
}

static int symbolic_entity(char **esc, char **unesc, char *name, int ccode)
{
    int ok = strncmp(*esc, name, strlen(name)) == 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += strlen(name);
    }
    return ok;
}

static foreign_t pl_html_unescape(term_t escaped, term_t unescaped)
{
    char *esc;
    if (!PL_get_chars(escaped, &esc, CVT_ATOM | REP_UTF8))
    {
        PL_fail;
    }
    else if (strchr(esc, '&') == NULL)
    {
        return PL_unify(escaped, unescaped);
    }
    else
    {
        char buffer[strlen(esc) + 1];
        char *unesc = buffer;
        while (*esc != '\0')
        {
            if (*esc != '&' || !(numeric_entity(&esc, &unesc) ||
                                 symbolic_entity(&esc, &unesc, "&lt;", '<') ||
                                 symbolic_entity(&esc, &unesc, "&gt;", '>') ||
                                 symbolic_entity(&esc, &unesc, "&amp;", '&')))
                                    // TODO: more entities...
            {
                *unesc++ = *esc++;
            }
        }
        return PL_unify_chars(unescaped, PL_ATOM | REP_UTF8, unesc - buffer, buffer);
    }
}

install_t install_html_unescape()
{
    PL_register_foreign("html_unescape", 2, pl_html_unescape, 0);
}

      

The following statement will create a shared library html_unescape.so from html_unescape.c. Tested on Ubuntu 14.04; may differ on Windows.

swipl-ld -shared -o html_unescape html_unescape.c

      

Launching SWI-Prolog:

swipl html_unescape.pl

      

Output example:

?- html_unescape('&lt;a&gt; &#26361;&#25805;', S).
S = '<a> 曹操'.

      

With special thanks to the documentation and source code of SWI-Prolog and the C library for converting Unicode code points to UTF8?

+2


source


Not aiming as a definitive answer as it does not provide a solution for SWI-Prolog. For a Java based interpreter, the problem is that XML escaping is not part of J2SE, at least not in a simple form (haven't figured out how to use Xerxes or the like).

A possible route would be to interact with StringEscapeUtils (*) from Apache Commons. But again, this would not be necessary for Android, as there is a TextUtil class. So we made our own (* *) little change. It works like this:

?- text_escape('<abc>', X).
X = '&lt;abc&gt;'
?- text_escape(X, '&lt;abc&gt;').
X = '<abc>'

      

Note the use of the Java codePointAt () and charCount () methods, respectively, appendCodePoint () in the Java source code. So also could avoid and speed up the code above the base plane, i.e. in the range> 0xFFFF (not currently implemented, left as an exercise).

On the other hand, Apache libraries are at least version 2.6, Non surrogate pair does not know and will put two decimal entities in instead of a code point.

Bye

(*) Java: StringEscapeUtils Source class
http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/Entities.java#Entities. escape% 28java.io.Writer, java.lang.String% 29

(* *) Jekejeke Prolog: Module xml Source
http://www.jekejeke.ch/idatab/doclet/blog/en/docs/src/05_run/20_system/03_xml.html

+1


source







All Articles