regex - What is the syntax in an Oracle regexp_replace statement to ignore a tag and everything inside it? -
i need regular expression replace in oracle. want replace occurrences of word or phrase exists in document, long word or phrase not exist within set of tags. tags defined me (hot html or xml) , current concept is;
<term type=pos id=123>some phrase</term> i have created function wrapper (non-functional) regexp_replace looks this;
function annotate_one_term(in_text in varchar2, search_term in varchar2, term_type in varchar2, record_id in number) return clob regex_search varchar2(512); regex_replace varchar2(512); begin regex_search := '((<term.*?</term>|[^<])*?)(^|\w)('|| search_term ||')($|\w)'; regex_replace := '\1 <term id='|| to_char(record_id)||' type=' || term_type ||'>'|| search_term ||'</term> '; dbms_output.put_line('regex_search = ' || regex_search); dbms_output.put_line('regex_replace = ' || regex_replace); return trim(regexp_replace(in_text, regex_search, regex_replace,1,0,'in')); end annotate_one_term; when called this;
select annotate_one_term( annotate_one_term('dog elephant dog cat cat dog dogfish fishdog mouse dog', 'dog cat', 'pos', 123), 'dog', 'pos',456) dual; it returns;
<term id=456 type=pos>dog</term> elephant <term id=123 type=pos>dog cat</term> cat <term id=456 type=pos>dog</term> dogfish fishdog mouse <term id=456 type=pos>dog</term> which correct. if called this;
select annotate_one_term( annotate_one_term('elephant dog cat cat dogfish fishdog mouse', 'dog cat', 'pos', 123), 'dog', 'pos',456) dual; it returns;
elephant <term id=123 type=pos <term id=456 type=pos>dog</term> cat</term> cat dogfish fishdog mouse which wrong. appears eating ">" , finding word/phrase within tags.
i actively trying increase knowledge regular expressions, 1 has eluded me far.
i understand tried match "negatively", tried direct match using closing markup <\term> , seems work:
create or replace function annotate_one_term(in_text in varchar2, search_term in varchar2, term_type in varchar2, record_id in number) return clob regex_search varchar2(512); regex_replace varchar2(512); begin regex_search := '(?</term>| |^)' || search_term || '( |$)'; regex_replace := '\1<term id=' || to_char(record_id) || ' type=' || term_type || '>' || search_term || '</term>\2'; return trim(regexp_replace(in_text, regex_search, regex_replace,1,0,'in')); end annotate_one_term; we obtain:
select annotate_one_term( annotate_one_term('dog elephant dog cat cat dog dogfish fishdog mouse dog', 'dog cat', 'pos', 123), 'dog', 'pos',456) dual; gives :
<term id=456 type=pos>dog</term> elephant <term id=123 type=pos>dog cat</term> cat <term id=456 type=pos>dog</term> dogfish fishdog mouse <term id=456 type=pos>dog</term> and
select annotate_one_term( annotate_one_term('elephant dog cat cat dogfish fishdog mouse', 'dog cat', 'pos', 123), 'dog', 'pos',456) dual; gives:
elephant <term id=123 type=pos>dog cat</term> cat dogfish fishdog mouse as expected, no cross terms. have use tricks because oracle doesn't support lookahead/lookbehind assertions (at least in version, 11g).
Comments
Post a Comment