AGHParser|A HTML Content Extractor Library

About AGHParser

AGHParser is a HTML content extractor. It is completely modular, hence you are not required to use any confusing regular expression. You give it the name of the html file or the 'dirty' file stored in char array and you get the output in char array, so, no need to parse a XML file again (I have seen some HTML parsing libraries that put their output in XML files).

HTML files contain many formatting tags like <a>, <br>, etc. All these formattings are good for reading by humans, but if a program tries to extract only the information in it without all the formatting then it is a nigtmare for the programmer because of the vast and varied language of HTML. Varied because you can write tags like <br/>, <hr/>, etc. which donot have closing pairs as <b> has (</b>). I call these non-pair tags SINGLETONS. Transitional-type htmls can just write <br> and <hr> instead of <br/> and <hr/>. To add to woes now HTML files have embedded javascript and style sheet programs. Now, how will a parser know that the <table> tag in a file is actual html tag or it is part of a javascript print statement, if the parser doesn't know javascript also! Well AGHParser doesn't know Javascript hence it relies on the  tags that normally page designers put inside <script> and <style> tags. Well AGHParser can parser <script> tags even without the above mentioned comment tags but the inside of <script> tag block should not have a </script> tag anywhere.

So, back to where we were. To this library you give a tag to extract content from and you can also specify tags that you want to be removed from the output. Since, XML is generalized but simplified version of HTML, hence I think AGHParser can be used to parse XML files too (though it has not been tested).

How it started

Have you ever tried using the query 'define:Ogre' or any such queries where you preceed a word with the word 'define:' (note there is no gap between ':' and the 'word'). If you haven't then try http://www.google.com/search?q=define:Ogre now!

You will get well formatted meanings of words alongwith link to sources of the definition and also related phrases. The definitions are very accurate. So much so, that I fire up my browser everytime to check for meanings of words on Google. So I thought why not create a desktop application, a dictionary, which queries for any word given to it, to Google, using the url of the form www.google.com/search?q=define:TheWord. Try it; it works! You can use similar queries to search in Google Images and videos too! So the concept was to query Google.

Anyway, the first thing I needed was a html parser. I knew that many such libraries might already exist, but for the sake of learning I decided to create my own. After 3 days of coding and 2 days of debugging and 1 day for preparing this package; I am tired. Also due to my busy schedule now I think I can no longer make that software. That's why I have created this package and released this library unde GPL. The accompaining example program (GDefineParser) is a sort of command-line version of that dictionary. It's coded fast and dirty (it is also under GPL). So, please make the dictionary software (I will of course help if I am in a position to do so). ;-)

NOTE if you compile the example program in gcc then it may sometimes crash silently without giving any output. It works fine when compiled in MS VC++6.0 or MS VC++ 2005. I don't know why. I you know then please do inform me. :-)

AGHParser

About AGHParser

How it started

Background picture from wikimedia.org