Create NCX file with Notepad ++ and regex

I have an HTML content page containing a list of hyperlinked book sections:

<a href="final/main.html">Multimedia Implementation</a><br/>
<a href="final/toc.html">Table of Contents</a><br/>
<a href="final/pref01.html">About the Author</a><br/>
<a href="final/pref02.html">About the Technical Reviewers</a><br/>
<a href="final/pref03.html">Acknowledgments</a><br/>
<a href="final/part01.html">Part I: Introduction and Overview</a><br/>
<a href="final/ch01.html">Chapter 1. Technical Overview</a><br/>
...

      

I want to create an NCX file for a Kindle book that should contain the following data:

<navPoint id="n1" playOrder="1">
<navLabel>
<text>Multimedia Implementation</text>
</navLabel>
<content src="final/main.html"/>
</navPoint>
<navPoint id="n2" playOrder="2">
<navLabel>
<text>Table of Contents</text>
</navLabel>
<content src="final/toc.html"/>
</navPoint>
<navPoint id="n3" playOrder="3">
<navLabel>
<text>About the Author</text>
</navLabel>
<content src="final/pref01.html"/>
</navPoint>
...

      

I am using Notepad ++: is it possible to automate this process with regex?

+3


source to share


2 answers


You can't do everything with regex .. you can split the problem in two.

  • generate type strings <navPoint id="n1" playOrder="1">

    using program logic (increment variable)
  • The rest you can do with regex

Use the following regex to match:

<a\shref="([^"]*)">([^<]*)<\/a><br\/>

      



And replace with:

(generated string)<navLabel>\n<text>\2</text>\n<content src="\1"/>\n</navPoint>

      

See DEMO

+2


source


Yes, it is possible to replace links with tags <navpoint>

. The only thing I haven't found a solution to is incremental attribute numbering <navpoint>

id

and playOrder

...

The following regex will do most of the work:

/^<a[^>]*href="([^"]+)"[^>]*([^<]+).*$/gm

      

replaced by:



<navpoint id="n" playOrder="">\n<navLabel><text>$2</text></navLabel>\n<content src="$1" />\n</navpoint>\n

      

Regular Expression Details

/^<a     .. only parse lines that start with an `<a` tag
.*href=" .. find the first occurance of `href="`
([^"]+)  .. capture the text and stop when a " is found
"[^>]*>  .. find the end of the <a> tag
([^<]+)  .. capture the text and stop when a < is found (i.e. the </a> tag)
.*$/     .. continue to end of the line
gm       .. search the whole string and parse each line individually

      

A more detailed (but even more confusing) explanation is here: https://regex101.com/r/gA0yJ2/1 This link also shows how regex works. You can test the changes there if you like

0


source







All Articles