Getting the visible text on a page from an IHTMLDocument2 *

I am trying to get the text content of an Internet Explorer browser window.

I am following these steps:

  • get pointer to IHTMLDocument2
  • from IHTMLDocument2 I get body as IHTMLElement
    3. On body I call get_innerText

Edit


  • I am getting all the children of the body and trying to make a recursive call in all IHTMLElements
  • if I get any element that is not displayed or I get an element with a script tag, I ignore that element and all of its children.

My problem

  • that along with the text that is displayed on the page, I also get content for which style = "display: none"
  • For google.com, I also get javascript along with text.

I've tried a recursive approach, but I don't know how to deal with scenarios like this,

<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>

      

In this case, I will not be able to receive "Hello World 1"

Can anyone please help me with the best way to get text from IHTMLDocument2 *. I am using C ++ Win32, no MFC, ATL.

Thank you Ashish.

+3


source to share


1 answer


If you are repeating elements backwards document.body.all

, you will always walk through the elements inside out. So you don't have to go recursively. The DOM will do this for you. for example (The code is in Delphi):

procedure Test();
var
  document, el: OleVariant;
  i: Integer;
begin
  document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
  document.open;
  document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
  document.close;
  for i := document.body.all.length - 1 downto 0 do // iterate backwards
  begin
    el := document.body.all.item(i);
    // filter the elements
    if (el.style.display = 'none') then
    begin
      el.removeNode(true);
    end;
  end;
  ShowMessage(document.body.innerText);
end;

      




Side comment: Regarding your scenario with a recursive approach:

<div>Hello World 1<div style="display: none">Hello world 2</div></div>

      

If, for example, our element is the first DIV, el.getAdjacentText('afterBegin')

will return "Hello World 1"

. So we can probably iterate over the elements and collect getAdjacentText('afterBegin')

, but this is a bit more complicated because we need to check the parents of each element for el.currentStyle.display

.

+6


source







All Articles