Universal PHP content parser. PHP parser is easy

Facebook

It often happens that you need to pull out certain information from a site, or even better, for this information to be immediately added to the database or in some other way displayed on your resource.

There are a lot of ways to do this - for example, there is a powerful program whose purpose is to parse sites, called content downloader. Among its disadvantages is that it is desktop, that is, you will have to work with it either from your computer or from a remote server. Of course, the program is paid, so you will also have to pay some money to use it (there are several types of licenses).

In addition, there is also ZennoPoster, which has more advanced capabilities, since it can simulate a person’s work in a browser, however, it also has plenty of disadvantages.

Finally, you can write a parser in special scripting languages, like iMacros, but this is not always convenient, and the capabilities of such languages are very limited.

The best option is to write a PHP script that will connect from a remote hosting through a proxy, for example, to the desired resource, and immediately add the parsed information to the database.

What does this require? Basic knowledge of PHP, that is, the ability to work with data, good command of syntax, and experience with the cURL library.

How to rip out the necessary data from a page? First, you should definitely download the page itself, for example, using the cURL library, although you can also use the standard file_get_contents function if the hosting supports remote connection via fopen. cURL, by the way, is a very powerful tool for composing POST, GET requests, using proxies and in general anything your heart desires, plus it is installed on almost any hosting.

Now the data needs to be processed, here you need to choose how to rip information from the page. You can use standard PHP functions like strpos, substr, etc., but this is so crooked that it’s better not to even think about it.

The second thought comes - to use regular expressions for these purposes. Indeed, regular expressions are an excellent option for finding this or that information on a page, but there is one thing, you will have to write a lot, you may have to write an entire library before you bring the code to a more or less readable form, without reducing flexibility and functionality. In other words, regular expressions are good, but not in this case.

Fortunately, there are already ready-made libraries that allow you to focus directly on working with the page, as with the DOM (Document Object Model).

$doc = new DOMDocument(); $doc->loadHTML($data);

The first line creates an object, and the second creates a DOM from ordinary string data (which should contain the content of the page).

$searchNodes = $doc->getElementsByTagName("a");

Now the $searchNodes variable contains an array of found "a" tags.

Foreach ($searchNodes as $cur) ( echo $cur->getAttribute("href"); )

And this code will display all the values of the href fields (usually this is the address where the user ends up after clicking on the link).

More details about this powerful library can be found in the official documentation.

But if you want something even simpler and more convenient, then pay attention to the PHP Simple HTML DOM Parser library. It is very convenient and easy to learn, you can figure out what’s what in literally 10-15 minutes, however, it does not work very well with some types of data.

There are also libraries, but these two are the most convenient and easy to learn.

10
Sep

How to write a parser for a website

Recently I was given the task of writing a PHP site parser. The task was completed and thanks to it this note appeared. I've never done anything like this before, so don't judge too harshly. This is my first php parser.

And so where to start solving the question “how to write a parser”. Let's first figure out what it is. In common parlance, a parser or parser is a program that receives data (for example, a web page), somehow analyzes it, structures it, makes a selection, and then carries out some operations with it (we write the data to a file, to a database, or display it on the screen) . We need to complete this task within the framework of web programming.

For the sake of this note, I came up with this test task. You need to parse links to sites from the first 5 search results pages for a specific search query and display them on the screen. I decided to parse the results of the bing search engine. Why not write a Yandex or Google parser, you ask. Such seasoned search engines have good protection against parsing (captcha, ip ban, changing markup, cookies, etc.), and this is the topic of a separate article. In this regard, there are no such problems with bing. And so what we will need to do:

Get (parse) the content of an html page using php
Get the data we are interested in (specifically links)
Parse page navigation and get a link to the next page
Parse the page using the link again, get the data, get the next link
Perform the above operation N number of times
Display all received links on the screen

Receiving and parsing a page

First we will write the function, then we will analyze it

Function getBingLink($link)( $url="https://www.bing.com/search"; //get site content $content= file_get_contents($url.$link); //remove error output libxml_use_internal_errors(true) ; //get an object of the DOMDocument class $mydom = new DOMDocument(); //set the settings $mydom->preserveWhiteSpace = false; $mydom->resolveExternals = false; $mydom->validateOnParse = false; //parse the HTML $mydom- >loadHTML($content); //get an object of the DOMXpath class $xpath = new DOMXpath($mydom); //make a selection using xpath $items=$xpath->query("//*[@class="b_algo" ]/h2/a"); //we display the received links in a loop static $a=1; foreach ($items as $item)( $link=$item->
"; $a++; ) )

And so let's analyze the function. To get site content, we use the php function file_get_contents($url.$link) . We substitute the request address into it. There are many more methods for getting the content of an html page, for example cUrl, but in my opinion file_get_contents is the simplest. Then we call the DOMDocument object and so on. This is all standard and you can read more about it on the Internet. I would like to focus on the method of selecting the elements we need. For this purpose I use xpath. You can take a look at my xpath cheat sheet. There are other sampling methods such as regular expressions, Simple HTML DOM, phpQuery. But in my opinion, it’s better to understand xpath, this will provide additional opportunities when working with xml documents, the syntax is easier than with regular expressions, unlike css selectors, you can find an element by the text in it. For example, I’ll comment on the expression //*[@class="b_algo"]/h2/a . More detailed syntax can be found in my xpath cheat sheet. We select links from the entire page located in the h2 tag in a div with the b_algo class. Having made a selection, we will receive an array and in a loop we will display all the received links on the screen.

Parsing page navigation and getting a link to the next page

Let's write a new function and, according to tradition, analyze it later

Function getNextLink($link)( $url="https://www.bing.com/search"; $content= file_get_contents($url.$link); libxml_use_internal_errors(true); $mydom = new DOMDocument(); $ mydom->preserveWhiteSpace = false; $mydom->resolveExternals = false; $mydom->validateOnParse = false; $mydom->loadHTML($content); $xpath = new DOMXpath($mydom); $page = $xpath-> query("//*[@class="sb_pagS"]/../following::li/a"); foreach ($page as $p)( $nextlink=$p->getAttribute("href"); ) return $nextlink; )

Almost identical function, only the xpath request has changed. //*[@class="sb_pagS"]/../following::li/a we get an element with the class sb_pagS (this is the class of the active button for page navigation), we go up the element in the dom tree, we get the first neighboring element li and we get there is a link in it. This is the link to the next page.

We parse the output N number of times

Writing a function

Function getFullList($link)( static $j=1; getBingLink($link); $nlink=getNextLink($link); if($j

This function calls getBingLink($link) and getNextLink($link) until counter j runs out. The function is recursive, that is, it calls itself. Read more about recursion on the Internet. Please note that $j is static, meaning it is not deleted the next time the function is called. If this were not so, then the recursion would be infinite. I’ll also add from experience that if you want to go through the entire page navigation, then write the if condition while there is a $nlink variable. There are a couple more pitfalls. If the parser runs for a long time, this may cause an error due to the script execution time. Default is 30s. To increase the time at the beginning of the file, set ini_set("max_execution_time", "480"); and set the desired value. An error may also occur due to a large number of calls to one function (more than 100 times). Fixed by disabling the error, put ini_set("xdebug.max_nesting_level", 0) at the beginning of the script;

Now all we have to do is write an html form for entering a request and put the parser together. See listing below.

Universal PHP content parser. PHP parser is easy

Popular articles

Latest articles

Sections

Pages

Special projects

Contacts