3/19/2023 0 Comments Website meta data extractor![]() ![]() Mozilla Readability as well as Maxthon Reader also supports other detection mechanisms including *, but these methods are unreliable and difficult to work with as they involve DOM-distance and depth-difference calculations from the perceived beginning of the article body text. The below example will set the author name to “Cave Johnson”: Mozilla Readability, Maxthon Reader, and Mercury Reader will find the first instance of an meta element in the document and use its value as the primary candidate for the author name. This is the most straight-forward and even web standard compliant detection method. Author name in Mozilla Readability, Maxthon Reader, and Mercury Reader Now that that’s out of the way, lets get into the real nitty-gritty details of reading mode metadata-extraction. The dateline or byline will be removed from the document (assuming they’ve matched the value held by the parser) leaving a less than 25-characters paragraph which, as discussed in Part 1, will be removed from the document also. However, the above markup example will allow the parser to remove the byline paragraph. Note that the above isn’t a one-stop answer to proper metadata-extraction in all reading mode implementations. This works even in browsers that don’t make use of the byline. The following markup will solve this problem in all reading mode implementations assuming that the parser in question can properly detect both the author and publication time. It can be difficult to work out how to get rid of these. You may see an unwanted (in reading mode implementations that doesn’t support it) or duplicated (in reading mode implementations that does) byline or publication date. Before I get into that, I’ll quickly address a common problem which may very well be the reason why you’re reading this article. The rest of the article will go into details about each of the above parsers and how they pick out metadata. None of the parsers try to parse the date string or do any kind of localization on it. The ultimate Reading Mode metadata-extraction compatibility tableĭates, when included, are displayed in the format they’re in when detected on the page. The following table gives an overview of which implementations use which parsers, the metadata they extract, and the method they use to extract it. Bylines often include the publication time as well as the name of the author. A byline is often the very first paragraph below the headline of an article crediting the article author. However, some also include a byline as well. Inconsistent and bad reading experienceĮncourage publishers to fix their designs, and standardize reading mode now.Įvery reading mode wants to display an appropriate title for the displayed article.Visual page inspections, standard metadata, or guesswork? Everyone has their own ideas about how to best determine the metadata describing an article. Title, author, and date metadata extraction. ![]() Why is reader mode so slow to activate, anyway? There are many approaches to content-analysis and extraction, and most only work well with English-content. The history of reading mode, a look at the different parsers we have today and how they came to be, and a small criticism of the Apache 2.0 license. This article is part three in a series on web reading mode and reading mode parsers. This article goes into the nitty-gritty details of how reading mode parses certain metadata about articles from webpages. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |