Skip to content

nhatndvtc/ContentExtraction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ContentExtraction

Extract the main content from html documents.

This package was implemented for comparison experiments of our paper.

Mitsuo Yoshida, Takashi Inui, Mikio Yamamoto. Automatic Extraction of Blog Posts and Comments from Blog Pages. IPSJ Journal (in Japanese), vol.54, no.12, pp.2502-2512, 2013.

Class

ContentExtractionUsingLossRatio

This class provides an implementation of an algorithm described in the following paper.

Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog post and comment extraction using information quantity of web format. In Information Retrieval Technology: 4th Asia Information Retrieval Symposium, pp.298-309, 2008.

ContentExtractionUsingMiBAT

This class provides an implementation of an algorithm described in the following paper.

Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, Hsiao-Wuen Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of the 19th ACM international conference on Information and knowledge management, pp.39–48, 2010.

License

This package is licensed under the terms of the MIT license. Please see LICENSE.txt for details.

Copyright (c) 2013 Mitsuo Yoshida

Other Tool

Additionally we used the following tools for comparison experiments.

Tim Weninger, William H. Hsu, Jiawei Han. CETR - Content Extraction via Tag Ratios. Proceedings of the 19th International Conference on World Wide Web (WWW '10), pp.971-980, 2010.

Nikolaos Pappas, Georgios Katsimpras, Efstathios Stamatatos. Extracting Informative Textual Parts from Web Pages Containing User-Generated Content. Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW '12), 2012.

This is a previous method of the authors. Our paper proposes the method that expanded this method.

Mitsuo Yoshida, Mikio Yamamoto. Primary Content Extraction from News Pages without Training Data. DBSJ Journal (in Japanese). vol.8, no.1, pp.29-34, 2009.

About

Extract the main content from html documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 100.0%