问题

如何解析HTML / XML并从中提取信息?

This is a General Reference question for the tag



解决方法

Native XML Extensions

我更喜欢使用原生XML扩展之一,因为它们已捆绑在一起PHP,通常比所有第三方库更快,并给我所有的控制我需要的标记.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM能够解析和修改现实世界(破碎的)HTML,它可以执行 XPath查询< / a>.它基于 libxml .

使用DOM需要一些时间,但是这个时间是值得的.由于DOM是一个语言无关的接口,你会发现许多语言的实现,所以如果你需要改变你的编程语言,你很可能已经知道如何使用那种语言的DOM API.

一个基本用法示例可以在抓取A元素的href属性和一般概念概述,可以在 PHP中的DOMDocument

如何使用DOM扩展已在StackOverflow上得到广泛报道,所以如果你选择使用它,你可以确保你遇到的大多数问题可以通过搜索/浏览Stack Overflow解决.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader,如DOM,是基于libxml.我不知道如何触发HTML解析器模块,所以有机会使用XMLReader来解析破坏的HTML可能不如使用DOM那么强大,你可以明确地告诉它使用libxml的HTML解析器模块.

有关基本使用示例,请访问获取使用php 从h1标记中获取所有值

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

XML解析器库也基于libxml,并实现了 SAX 样式XML推送解析器.它可能是一个比DOM或SimpleXML更好的内存管理选择,但是比XMLReader实现的pull解析器更难以处理.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

当您知道HTML是有效的XHTML时,SimpleXML是一个选项.如果你需要解析破坏的HTML,甚至不考虑SimpleXml,因为它会窒息.

一个基本用法示例可以在一个简单的程序到CRUD节点和节点的xml文件值,并有 PHP手册中的大量附加示例.


3rd Party Libraries (libxml based)

如果您喜欢使用第三方库,建议您使用实际使用 DOM / libxml ,而不是字符串解析.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery (not updated for years)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

另请参阅: https://github.com/electrolinux/phpquery

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.


3rd-Party (not libxml-based)

建立在DOM / libxml的好处是,你获得良好的性能开箱即用,因为你是基于本机扩展.但是,并非所有第三方库都沿着这条路线走下去.其中有些列在下面

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

我一般不推荐这个解析器.代码库是可怕的,解析器本身是相当缓慢和内存饥饿.并非所有jQuery选择器(例如子选择器)都可以.任何基于libxml的库都应该能够轻松实现.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

再次,我不会推荐这个解析器.这是相当慢的高CPU使用率.还没有清除已创建DOM对象的内存的函数.这些问题特别是对于嵌套循环.文档本身不准确和拼写错误,自4月14日起没有回应修复.

Ganon

  • A universal tokenizer and HTML/XML/RSS DOM Parser
    • Ability to manipulate elements and their attributes
    • Supports invalid HTML and UTF8
  • Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
  • A HTML beautifier (like HTML Tidy)
    • Minify CSS and Javascript
    • Sort attributes, change character case, correct indentation, etc.
  • Extensible
    • Parsing documents using callbacks based on current character/token
    • Operations separated in smaller functions for easy overriding
  • Fast and Easy

从来没有使用过.不能告诉它是否有什么好处.


HTML 5

您可以使用上述语法解析HTML5,但可能有奇怪由于标记HTML5允许.所以对于HTML5,你想考虑使用一个专用的解析器,如

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

HTML5定稿完成后,我们可能会看到更多的专用解析器.还有一个由W3的标题为 html 5的操作方法解析,值得一试.


WebServices

如果你不想编程PHP,你也可以使用网络服务.一般来说,我发现这些功能非常少,但这只是我和我的用例.

YQL

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.

ScraperWiki.

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.


Regular Expressions

最后一次和最少推荐,您可以使用正则表达式 a>.一般来说,不建议使用HTML上的正则表达式.

您在网上找到的匹配标记的大部分片段都很脆弱.在大多数情况下,他们只工作在一个非常特定的HTML.微小标记更改,例如在某处添加空格,或者添加或更改标记中的属性,可能会导致正则表达式在未正确写入时失败.在HTML上使用regex之前,你应该知道你在做什么.

HTML解析器已经知道HTML的语法规则.正则表达式必须为你写的每个新正则表达式.正则表达式在某些情况下是正常的,但它确实取决于您的用例.

可以撰写更可靠的解析器,但是使用正则表达式编写一个完整,可靠的自定义解析器是浪费时间,当上述库已经存在,并做了更好的工作.

另请参阅解析HTML Cthulhu Way < / a>


Books

如果你想花一些钱,请查看

我不隶属于PHP Architect或作者.




相关问题推荐