ContentExtraction

Extract the main content from html documents.

This package was implemented for comparison experiments of our paper.

Mitsuo Yoshida, Takashi Inui, Mikio Yamamoto. Automatic Extraction of Blog Posts and Comments from Blog Pages. IPSJ Journal (in Japanese), vol.54, no.12, pp.2502-2512, 2013.

Class

ContentExtractionUsingLossRatio

This class provides an implementation of an algorithm described in the following paper.

Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog post and comment extraction using information quantity of web format. In Information Retrieval Technology: 4th Asia Information Retrieval Symposium, pp.298-309, 2008.

ContentExtractionUsingMiBAT

This class provides an implementation of an algorithm described in the following paper.

Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, Hsiao-Wuen Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of the 19th ACM international conference on Information and knowledge management, pp.39–48, 2010.

License

This package is licensed under the terms of the MIT license. Please see LICENSE.txt for details.

Other Tool

Additionally we used the following tools for comparison experiments.

CETR

Tim Weninger, William H. Hsu, Jiawei Han. CETR - Content Extraction via Tag Ratios. Proceedings of the 19th International Conference on World Wide Web (WWW '10), pp.971-980, 2010.

webpage_segmentation

Nikolaos Pappas, Georgios Katsimpras, Efstathios Stamatatos. Extracting Informative Textual Parts from Web Pages Containing User-Generated Content. Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW '12), 2012.

ExtractUniqueBlock

This is a previous method of the authors. Our paper proposes the method that expanded this method.

Mitsuo Yoshida, Mikio Yamamoto. Primary Content Extraction from News Pages without Training Data. DBSJ Journal (in Japanese). vol.8, no.1, pp.29-34, 2009.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ContentExtraction		ContentExtraction
.gitignore		.gitignore
.tfignore		.tfignore
ContentExtraction.sln		ContentExtraction.sln
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentExtraction

ContentExtraction

.gitignore

.gitignore

.tfignore

.tfignore

ContentExtraction.sln

ContentExtraction.sln

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

ContentExtraction

Class

ContentExtractionUsingLossRatio

ContentExtractionUsingMiBAT

License

Other Tool

CETR

webpage_segmentation

ExtractUniqueBlock

About

Releases

Packages

Languages

License

nhatndvtc/ContentExtraction

Folders and files

Latest commit

History

Repository files navigation

ContentExtraction

Class

ContentExtractionUsingLossRatio

ContentExtractionUsingMiBAT

License

Other Tool

About

Resources

License

Stars

Watchers

Forks

Languages