It is common for content in Arabic, Hebrew,...">

Extracting surrounding words in python from a string position

Suppose I have a line:

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

      

and I have the position of a word in this line, for example:

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

      

I need to extract a few words behind and a few words after each position. How do I implement it with Python and regular expressions?

eg:.

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

      

So:

>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

      

Thank!

+3


source to share


2 answers


At first I suggested using word character metacharacters, but this is not entirely correct because they do not consume any of the strings and \ B does not match what I wanted so badly.

Instead, I suggest using a basic word boundary definition i.e. the border between \ W and \ w. Search for one or more word (\ w) characters along with one or more non-word (\ W) characters in the correct order, repeated as many times as you like, on either side of the search string.



For example: (?:\w+\W+){,3}some string(?:\W+\w+){,3}

It finds up to three words before and up to three words after "some line".

0


source


You want to "match" your backlinks, say two words before and after where your regex matches. The easiest way to do this is to break your string and bind your search to the endpoints of the fragments . For example, to get two words before and after index 263 (your first one m.start()

), you would do:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

      

The first expression must be read backward from the end of the line: it is anchored at the end $

, possibly skipping a partial word if the match ended in a middle word ( \S*

), skipping some whitespace ( \s+

), and then matching two {2,}

sequences of vocabulary space \s+\S+

. This is not entirely true, because if we reach the beginning of the line, we want to return a short match.



The second regular expression does the same, but in the opposite direction.

For consistency, you probably want to start reading right after the end of the regex, rather than at the beginning. In this case, use m.end()

as the beginning of the second line.

It's pretty obvious how to use this with a regexp matchlist I guess.

0


source







All Articles