html-extraction
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Domain-specific language for extracting structured data from HTML documents