Getting the visible text on a page from an IHTMLDocument2 *
I am trying to get the text content of an Internet Explorer browser window.
I am following these steps:
- get pointer to IHTMLDocument2
- from IHTMLDocument2 I get body as IHTMLElement
3. On body I call get_innerText
Edit
- I am getting all the children of the body and trying to make a recursive call in all IHTMLElements
- if I get any element that is not displayed or I get an element with a script tag, I ignore that element and all of its children.
My problem
- that along with the text that is displayed on the page, I also get content for which style = "display: none"
- For google.com, I also get javascript along with text.
I've tried a recursive approach, but I don't know how to deal with scenarios like this,
<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>
In this case, I will not be able to receive "Hello World 1"
Can anyone please help me with the best way to get text from IHTMLDocument2 *. I am using C ++ Win32, no MFC, ATL.
Thank you Ashish.
source to share
If you are repeating elements backwards document.body.all
, you will always walk through the elements inside out. So you don't have to go recursively. The DOM will do this for you. for example (The code is in Delphi):
procedure Test();
var
document, el: OleVariant;
i: Integer;
begin
document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
document.open;
document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
document.close;
for i := document.body.all.length - 1 downto 0 do // iterate backwards
begin
el := document.body.all.item(i);
// filter the elements
if (el.style.display = 'none') then
begin
el.removeNode(true);
end;
end;
ShowMessage(document.body.innerText);
end;
Side comment: Regarding your scenario with a recursive approach:
<div>Hello World 1<div style="display: none">Hello world 2</div></div>
If, for example, our element is the first DIV, el.getAdjacentText('afterBegin')
will return "Hello World 1"
. So we can probably iterate over the elements and collect getAdjacentText('afterBegin')
, but this is a bit more complicated because we need to check the parents of each element for el.currentStyle.display
.
source to share