Wednesday 11 September 2013

How to determine is HTML is an article?

How to determine is HTML is an article?

I have made a web crawler to index a bunch of websites (we'll use CNN for
example). Now just the site URL alone isn't enough to reliably determine
if the URL is actually article content and not a blog, video, or anything
else.
I have read on here and some users and suggested using boilerpipe for this
issue, but that does not do this functionality (it just scrapes content
from HTML code).
What is a good way, or algorithm for determining if a URL, and/or the
corresponding HTML code, is an article? Is there any APIs for this? I
realize this could different slightly from site to site.
Thank you, Rich

No comments:

Post a Comment