Parsing the xml file. Parsing XML



publication of this article is allowed only with a link to the site of the author of the article

In this article, I will show an example of how to parse a large XML file. If your server (hosting) does not prohibit increasing the script running time, then you can parse an XML file weighing at least gigabytes, I personally parsed only files from ozone weighing 450 megabytes.

There are two problems when parsing large XML files:
1. Not enough memory.
2. There is not enough allocated time for the script to work.

The second problem with time can be solved if this is not prohibited by the server.
But it is difficult to solve the problem with memory, even if we are talking about your own server, then moving files of 500 megabytes is not very easy, and it’s simply not possible to increase memory on hosting and on VDS.

PHP has several built-in XML processing options - SimpleXML, DOM, SAX.
All of these options are detailed in many example articles, but all of the examples show how to work with a complete XML document.

Here is one example, we get an object from an XML file

Now you can process this object, BUT...
As you can see, the entire XML file is read into memory, then everything is parsed into an object.
That is, all data gets into memory, and if the allocated memory is not enough, then the script stops.

This option is not suitable for processing large files, you need to read the file line by line and process this data in turn.
At the same time, the validity check is also carried out as the data is processed, so you need to be able to rollback, for example, delete all the database entered in the case of a non-valid XML file, or make two passes through the file, first read for validity, then read for processing data.

Here is a theoretical example of parsing a large XML file.
This script reads one character from a file, collects this data into blocks and sends it to the XML parser.
This approach completely solves the memory problem and does not cause a load, but exacerbates the problem over time. How to try to solve the problem over time, read below.

Function webi_xml($file)
{

########
### data handling function

{
print $data ;
}
############################################



{
print $name ;
print_r($attrs);
}


## closing tag function
function endElement ($parser , $name )
{
print $name ;
}
############################################

($xml_parser , "data" );

// open file
$fp = fopen($file , "r" );

$perviy_vxod = 1 ; $data = "" ;



{

$simvol = fgetc($fp); $data .= $simvol ;


if($simvol != ">" ) ( continue;)


echo "

break;
}

$data = "" ;
}
fclose($fp);

webi_xml("1.xml");

?>

In this example, I put everything into one webi_xml () function and its call is visible at the very bottom.
The script itself consists of three main functions:
1. A function that catches the opening of the startElement() tag
2. A function that catches the closing of the endElement() tag
3. And the data receiving function data() .

Let's assume that the content of file 1.xml is some recipe



< title >simple bread
< ingredient amount = "3" unit = "стакан" >Flour
< ingredient amount = "0.25" unit = "грамм" >Yeast
< ingredient amount = "1.5" unit = "стакан" >warm water
< ingredient amount = "1" unit = "чайная ложка" >Salt
< instructions >
< step > Mix all ingredients and knead thoroughly.
< step > Cover with a cloth and leave for one hour in a warm room.
< step > Knead again, put on a baking sheet and put in the oven.
< step > Visit site site


We start by calling the generic function webi_xml("1.xml");
Further in this function, the parser starts and all tag names are converted to upper case so that all tags have the same case.

$xml_parser = xml_parser_create();
xml_parser_set_option ($xml_parser , XML_OPTION_CASE_FOLDING , true );

Now we specify which functions will work to catch the opening of the tag, closing and processing data

xml_set_element_handler($xml_parser , "startElement" , "endElement" );
xml_set_character_data_handler($xml_parser , "data" );

Next comes the opening of the specified file, iterate over the file one character at a time and each character is added to the string variable until the character is found > .
If this is the very first access to the file, then everything that is superfluous at the beginning of the file will be deleted along the way, everything that stands before , this is the tag XML should start with.
The first time a string variable will collect a string

And send it to the parser
xml_parse ($xml_parser , $data , feof ($fp ));
After processing the data, the string variable is reset and the collection of data into a string begins again and a string is formed for the second time

In the third
</b><br>in the fourth <br><b>simple bread

Please note that the string variable is always formed by the completed tag > and it is not necessary to send open and closed tags with data to the decomposer, for example
simple bread
It is important for this handler to get a whole unbroken tag, at least one open tag, and in the next step a closed tag, or immediately get 1000 lines of the file, it doesn’t matter, the main thing is that the tag does not break, for example

le>Simple bread
So it is impossible to send data to the handler, because the tag is broken.
You can come up with your own method of sending data to the handler, for example, collect 1 megabyte of data and send it to the handler to increase speed, just make sure that the tags always end, and the data can be broken
Simple</b><br><b>bread

Thus, in parts, as you wish, you can send a large file to the handler.

Now let's look at how this data is processed and how to get it.

Starting with the opening tags feature startElement ($parser , $name , $attrs )
Let's assume that processing has reached the line
< ingredient amount = "3" unit = "стакан" >Flour
Then inside the function the variable $name will be equal to ingredient that is, the name of the open tag (the matter has not yet reached the closing of the tag).
Also in this case, an array of attributes of this $attrs tag will be available, in which there will be data amount = "3" and unit = "glass".

After that, the data of the open tag was processed by the function data ($parser , $data )
The $data variable will contain everything that is between the opening and closing tag, in our case it is the text Muk

And the processing of our string by the function is completed endElement ($parser , $name )
This is the name of the closed tag, in our case $name will be equal to ingredient

And after that, it all went full circle again.

The above example only demonstrates the principle of XML processing, but for real application it needs to be finalized.
Usually, you have to parse large XML to enter data into the database, and for proper data processing, you need to know which open tag the data belongs to, what level of tag nesting, and which tags are open in the hierarchy above. With this information, you can process the file correctly without any problems.
To do this, you need to introduce several global variables that will collect information about open tags, nesting and data.
Here is an example that can be used

Function webi_xml($file)
{
global $webi_depth ; // counter to keep track of nesting depth
$webi_depth = 0 ;
global $webi_tag_open ; // will contain an array of currently open tags
$webi_tag_open = array();
global $webi_data_temp ; // this array will contain the data of one tag

####################################################
### data handling function
function data ($parser , $data )
{
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;
// add data to the array with nesting and currently opened tag
$webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "data" ].= $data ;
}
############################################

####################################################
### opening tag function
function startElement ($parser , $name , $attrs )
{
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;

// if the nesting level is not already zero, then one tag is already open
// and the data from it is already in the array, you can process them
if ($webi_depth)
{




" ;

print"
" ;
print_r($webi_tag_open); // array of open tags
print"


" ;

// after processing the data, delete them to free memory
unset($GLOBALS [ "webi_data_temp" ][ $webi_depth ]);
}

// now the opening of the next tag has begun and further processing will occur in the next step
$webi_depth++; // increase nesting

$webi_tag_open [ $webi_depth ]= $name ; // add open tag to info array
$webi_data_temp [ $webi_depth ][ $name ][ "attrs" ]= $attrs ; // now add tag attributes

}
###############################################

#################################################
## closing tag function
function endElement ($parser , $name ) (
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;

// data processing starts here, for example, adding to the database, saving to a file, etc.
// $webi_tag_open contains a chain of open tags by nesting level
// for example $webi_tag_open[$webi_depth] contains the name of the open tag whose information is currently being processed
// $webi_depth tag nesting level
// $webi_data_temp[$webi_depth][$webi_tag_open[$webi_depth]]["attrs"] array of tag attributes
// $webi_data_temp[$webi_depth][$webi_tag_open[$webi_depth]]["data"] tag data

Print "data " . $webi_tag_open [ $webi_depth ]. "--" .($webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "data" ]). "
" ;
print_r ($webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "attrs" ]);
print"
" ;
print_r($webi_tag_open);
print"


" ;

Unset($GLOBALS [ "webi_data_temp" ]); // after processing the data, delete the array with the data as a whole, since the tag was closed
unset($GLOBALS [ "webi_tag_open" ][ $webi_depth ]); // remove information about this opened tag... since it closed

$webi_depth --; // reduce nesting
}
############################################

$xml_parser = xml_parser_create();
xml_parser_set_option ($xml_parser , XML_OPTION_CASE_FOLDING , true );

// specify which functions will work when opening and closing tags
xml_set_element_handler($xml_parser , "startElement" , "endElement" );

// specify a function for working with data
xml_set_character_data_handler($xml_parser , "data" );

// open file
$fp = fopen($file , "r" );

$perviy_vxod = 1 ; // flag for checking the first input to the file
$data = "" ; // here we collect parts of the data from the file and send it to the xml parser

// loop until end of file found
while (! feof ($fp ) and $fp )
{
$simvol = fgetc($fp); // read one character from file
$data .= $simvol ; // add this character to the data to be sent

// if the character is not the end tag, then return to the beginning of the loop and add one more character to the data, and so on until the end tag is found
if($simvol != ">" ) ( continue;)
// if the closing tag was found, now send this collected data to processing

// check if this is the first entry in the file, then delete everything before the tag// since sometimes there may be garbage before the beginning of the XML (clumsy editors, or the file was received by the script from another server)
if($perviy_vxod ) ( $data = strstr ($data , "

// now we throw data into the xml parser
if (! xml_parse ($xml_parser , $data , feof ($fp ))) (

// here you can process and get errors for validity...
// as soon as an error is encountered, parsing stops
echo "
XML Error: " .xml_error_string (xml_get_error_code ($xml_parser ));
echo "at line" . xml_get_current_line_number($xml_parser );
break;
}

// after parsing, we throw off the collected data for the next step of the loop.
$data = "" ;
}
fclose($fp);
xml_parser_free($xml_parser );
// delete global variables
unset($GLOBALS [ "webi_depth" ]);
unset($GLOBALS [ "webi_tag_open" ]);
unset($GLOBALS [ "webi_data_temp" ]);

webi_xml("1.xml");

?>

The whole example was accompanied by comments, now test and experiment.
Please note that in the data manipulation function, data is not simply inserted into the array, but is added using " .=" since the data may not come in a whole form, and if you just make an assignment, then from time to time you will receive data in chunks.

Well, that's all, now there will be enough memory when processing a file of any size, but the script's running time can be increased in several ways.
Insert a function at the beginning of the script
set_time_limit(6000);
or
ini_set("max_execution_time" , "6000" );

Or add text to the .htaccess file
php_value max_execution_time 6000

These examples will increase the script running time to 6000 seconds.
You can increase the time in this way only in the off safe mode.

If you have access to edit php.ini you can increase the time with
max_execution_time = 6000

For example, on masterhost hosting, at the time of this writing, increasing the script time is prohibited, despite the disabled safe mode, but if you are a pro, you can build your php on the masterhost, but this is not in this article.

Some of the examples in this guide include an XML string. Instead of repeating it in every example, put this line in a file and include it in every example. This line is shown in the following example. In addition, you can create an XML document and read it with the function simplexml_load_file().

Example #1 Example.php file with XML string

$xmlstr =<<


PHP: The Parser Appears


Ms. coder
Onlivia Actora


Mr. coder
El ActÓr


Thus, it is a language. It's still a programming language. Or
is it a scripting language? It's all revealed in this documentary
similar to a horror movie.




7
5


XML
?>

SimpleXML is very easy to use! Try getting some string or number from the underlying XML document.

Example #2 Getting a part of a document

include "example.php" ;

echo $movies -> movie [ 0 ]-> plot ;
?>

Thus, it is a language. It's still a programming language. Or is it a scripting language? Everything is revealed in this documentary that looks like a horror movie.

In PHP, you can access an element in an XML document that contains invalid characters (such as a hyphen) in its name by enclosing the given element name in curly braces and apostrophes.

Example #3 Getting a string

include "example.php" ;

echo $movies -> movie ->( "great-lines" )-> line ;
?>

The result of running this example:

PHP solves all my problems on the web

Example #4 Accessing non-unique elements in SimpleXML

In the event that there are multiple instances of child elements in the same parent element, then the standard iteration methods must be applied.

include "example.php" ;

$movies = new SimpleXMLElement($xmlstr );

/* For each node , we will print the name separately . */
foreach ($movies -> movie -> characters -> character as $character ) (
echo $character -> name , " plays " , $character -> actor , PHP_EOL ;
}

?>

The result of running this example:

Ms. Coder plays Onlivia Actora Mr. Coder plays El ActÓr

Comment:

Properties ( $movies->movies in the previous example) are not arrays. It is an iterable object in the form of an array.

Example #5 Using attributes

So far, we've only gotten the names and values ​​of the elements. SimpleXML can also access element attributes. An element attribute can be accessed in the same way as array elements ( array).

include "example.php" ;

$movies = new SimpleXMLElement($xmlstr );

/* Access the node first movie.
* We will also display the rating scale. */
foreach ($movies -> movie [ 0 ]-> rating as $rating ) (
switch((string) $rating [ "type" ]) ( // Get element attributes by index
case "thumbs" :
echo $rating , " thumbs up" ;
break;
case "stars" :
echo $rating , "stars" ;
break;
}
}
?>

The result of running this example:

7 thumbs up5 stars

Example #6 Comparing elements and attributes with text

To compare an element or attribute to a string, or to pass it to a function as text, you must cast it to a string using (string). Otherwise, PHP will treat the element as an object.

include "example.php" ;

$movies = new SimpleXMLElement($xmlstr );

if ((string) $movies -> movie -> title == "PHP: The Parser Appears") {
print "My favorite film.";
}

echo htmlentities ((string) $movies -> movie -> title );
?>

The result of running this example:

My Favorite Movie. PHP: The Parser Appears

Example #7 Comparing two elements

Two SimpleXMLElements are considered different even if they point to the same object as of PHP 5.2.0.

include "example.php" ;

$movies1 = new SimpleXMLElement($xmlstr );
$movies2 = new SimpleXMLElement($xmlstr );
var_dump ($movies1 == $movies2 ); // false since PHP 5.2.0
?>

The result of running this example:

Beispiel #8 Using XPath

SimpleXML includes native XPath support. Search all items :

include "example.php" ;

$movies = new SimpleXMLElement($xmlstr );

foreach ($movies -> xpath("//character" ) as $character ) (
echo $character -> name , " plays " , $character -> actor , PHP_EOL ;
}
?>

"// " serves as a wildcard. To specify an absolute path, omit one of the slashes.

The result of running this example:

Ms. Coder plays Onlivia Actora Mr. Coder plays by El ActÓr

Example #9 Setting values

Data in SimpleXML does not need to be immutable. An object allows you to manipulate all elements.

include "example.php" ;
$movies = new SimpleXMLElement($xmlstr );

$movies -> movie [ 0 ]-> characters -> character [ 0 ]-> name = "Miss Coder" ;

echo $movies -> asXML();
?>

The result of running this example:

PHP: The Parser Appears Miss Coder Onlivia Actora Mr. coder El ActÓr 7 5

Example #10 Adding elements and attributes

As of PHP 5.1.3, SimpleXML has the ability to easily add child elements and attributes.

include "example.php" ;
$movies = new SimpleXMLElement($xmlstr );

$character = $movies -> movie [ 0 ]-> characters -> addChild("character" );
$character -> addChild("name" , "Mr. Parser" );
$character -> addChild("actor" , "John Doe" );

$rating = $movies -> movie [ 0 ]-> addChild("rating" , "PG" );
$rating -> addAttribute ("type" , "mpaa" );

echo $movies -> asXML();
?>

The result of running this example:

PHP: The Parser Appears Ms. coder Onlivia Actora Mr. coder El ActÓr Mr. parserJohn Doe Thus, it is a language. It's still a programming language. Or is it a scripting language? Everything is revealed in this documentary that looks like a horror movie. PHP solves all my problems on the web 7 5 PG

Beispiel #11 Interacting with the DOM

PHP can convert XML nodes from SimpleXML to DOM format and vice versa. This example shows how you can modify a DOM element in SimpleXML.

$dom = new DOMDocument ;
$dom -> loadXML( "nonsense" );
if (! $dom ) (
echo "Error parsing document";
exit;
}

$books = simplexml_import_dom($dom );

echo $books -> book [ 0 ]-> title ;
?>

The result of running this example:

4 years ago

There is a common "trick" often proposed to convert a SimpleXML object to an array, by running it through json_encode() and then json_decode(). I "d like to explain why this is a bad idea.

Most simply, because the whole point of SimpleXML is to be easier to use and more powerful than a plain array. For instance, you can writebar -> baz [ "bing" ] ?> and it means the same thing asbar [ 0 ]-> baz [ 0 ][ "bing" ] ?> , regardless of how many bar or baz elements there are in the XML; and if you writebar [ 0 ]-> baz [ 0 ] ?> you get all the string content of that node - including CDATA sections - regardless of whether it also has child elements or attributes. You also have access to namespace information, the ability to make simple edits to the XML, and even the ability to "import" into a DOM object, for much more powerful manipulation. All of this is lost by turning the object into an array rather than reading understanding the examples on this page.

Additionally, because it is not designed for this purpose, the conversion to JSON and back will actually lose information in some situations. For instance, any elements or attributes in a namespace will simply be discarded, and any text content will be discarded if an element also has children or attributes. Sometimes, this won't matter, but if you get in the habit of converting everything to arrays, it's going to sting you eventually.

Of course, you could write a smarter conversion, which didn't have these limitations, but at that point, you are getting no value out of SimpleXML at all, and should just use the lower level XML Parser functions, or the XMLReader class, to create your structure. You still won't have the extra convenience functionality of SimpleXML, but that's your loss.

2 years ago

If your xml string contains booleans encoded with "0" and "1", you will run into problems when you cast the element directly to bool:

$xmlstr =<<

1
0

XML
$values ​​= new SimpleXMLElement($xmlstr);
$truevalue = (bool)$values->truevalue; // true
$falsevalue = (bool)$values->falsevalue; // also true!!!

Instead of you need to cast to string or int first:

$truevalue = (bool)(int)$values->truevalue; // true
$falsevalue = (bool)(int)$values->falsevalue; // false

9 years ago

If you need to output valid xml in your response, don"t forget to set your header content type to xml in addition to echoing out the result of asXML():

$xml = simplexml_load_file("...");
...
... xml stuff
...

//output xml in your response:
header("Content-Type: text/xml");
echo $xml -> asXML();
?>

9 years ago

From the README file:

SimpleXML is meant to be an easy way to access XML data.

SimpleXML objects follow four basic rules:

1) properties denote element iterators
2) numeric indices denote elements
3) non numeric indices denote attributes
4) string conversion allows to access TEXT data

When iterating properties then the extension always iterates over
all nodes with that element name. Thus method children() must be
called to iterate over subnodes. But also doing the following:
foreach ($obj->node_name as $elem) (
// do something with $elem
}
always results in iteration of "node_name" elements. So no further
check is needed to distinguish the number of nodes of that type.

When an elements TEXT data is being accessed through a property
then the result does not include the TEXT data of subelements.

Known issues
============

Due to engine problems it is currently not possible to access
a subelement by index 0: $object->property.

8 years ago

Using stuff like: is_object($xml->module->admin) to check if there actually is a node called "admin", doesn't seem to work as expected, since simplexml always returns an object- in that case an empty one - even if a particular node does not exist.
For me good old empty() function seems to work just fine in such cases.

8 years ago

A quick tip on xpath queries and default namespaces. It looks like the XML-system behind SimpleXML has the same workings as I believe the XML-system .NET uses: when one needs to address something in the default namespace, one will have to declare the namespace using registerXPathNamespace and then use its prefix to address the otherwise in the default namespace living element.

$string =<<

Forty What?
Joe
Jane

I know that "s the answer -- but what"s the question?


XML

$xml = simplexml_load_string ($string );
$xml -> registerXPathNamespace("def" , "http://www.w3.org/2005/Atom");

$nodes = $xml -> xpath("//def:document/def:title" );

?>

9 years ago

While SimpleXMLElement claims to be iterable, it does not seem to implement the standard Iterator interface functions like::next and::reset properly. Therefore while foreach() works, functions like next(), current(), or each() don"t seem to work as you would expect -- the pointer never seems to move or keeps getting reset.

6 years ago

If the XML document's encoding is other than UTF-8, the encoding declaration must come immediately after version="..." and before standalone="...". This is a requirement of the XML standard.

If encoding XML-document differs from UTF-8. Encoding declaration should follow immediately after the version = "..." and before standalone = "...". This requirement is standard XML.


Ok

Russian language. English language
Fatal error: Uncaught exception "Exception" with message "String could not be parsed as XML" in...

Parsing XML essentially means going through an XML document and returning the corresponding data. Although an increasing number of web services return data in JSON format, most still use XML, so it is important to master XML parsing if you want to use the full range of available APIs.

Using the extension SimpleXML in PHP, which was added back in PHP 5.0, working with XML is very easy and simple. In this article, I will show you how to do it.

Usage Basics

Let's start with the following example languages.xml:


>

> 1972>
> Dennis Ritchie >
>

> 1995>
> Rasmus Lerdorf >
>

> 1995>
> James Gosling >
>
>

This XML document contains a list of programming languages ​​with some information about each language: the year of its implementation and the name of its creator.

The first step is to load the XML using the functions either simplexml_load_file(), or simplexml_load_string(). As the name of the functions implies, the first one will load XML from a file, and the second one will load XML from a string.

Both functions read the entire DOM tree into memory and return an object SimpleXMLElement. In the example above, the object is stored in the $languages ​​variable. You can use functions var_dump() or print_r() to get detailed information about the returned object, if you like.

SimpleXMLElement Object
[lang] => Array
[ 0 ] => SimpleXMLElementObject
[@attributes] => Array
[name] => C
[appeared] => 1972
[ creator] => Dennis Ritchie
[ 1 ] => SimpleXMLElement Object
[@attributes] => Array
[name] => PHP
[appeared] => 1995
[ creator] => Rasmus Lerdorf
[ 2 ] => SimpleXMLElement Object
[@attributes] => Array
[name] => Java
[appeared] => 1995
[ creator] => James Gosling
)
)

This XML contains the root element languages, which contains three elements lang. Each array element corresponds to an element language in an XML document.

You can access the properties of an object using the operator -> . For example, $languages->lang will return you a SimpleXMLElement object that matches the first element language. This object contains two properties: appeared and creator.

$languages ​​-> lang [ 0 ] -> appeared ;
$languages ​​-> lang [ 0 ] -> creator ;

Displaying a list of languages ​​and displaying their properties is very easy with a standard loop such as foreach.

foreach ($languages ​​-> lang as $lang ) (
printf (
"" ,
$lang["name"] ,
$lang -> appeared ,
$lang -> creator
) ;
}

Notice how I accessed the lang element's attribute name to get the name of the language. This way you can access any attribute of an element represented as a SimpleXMLElement object.

Working with namespaces

While working with the XML of various web services, you will often encounter element namespaces. Let's change our languages.xml to show an example of using a namespace:



xmlns:dc =>

> 1972>
> Dennis Ritchie >
>

> 1995>
> Rasmus Lerdorf >
>

> 1995>
> James Gosling >
>
>

Now element creator placed in the namespace dc, which points to http://purl.org/dc/elements/1.1/. If you try to print the language creators using our previous code, it won't work. In order to read element namespaces you need to use one of the following approaches.

The first approach is to use the URI names directly in the code when referring to the element's namespace. The following example shows how this is done:

$dc = $languages ​​-> lang [ 1 ] -> children( "http://purl.org/dc/elements/1.1/") ;
echo $dc -> creator ;

Method children() takes a namespace and returns child elements that start with a prefix. It takes two arguments, the first is the XML namespace and the second is an optional argument that defaults to false. If the second argument is set to TRUE, the namespace will be treated as a prefix. If FALSE, then the namespace will be treated as the URL namespace.

The second approach is to read the URI names from the document and use them when referring to the element's namespace. This is actually the best way to access elements because you don't have to be hardcoded into a URI.

$namespaces = $languages ​​-> getNamespaces (true ) ;
$dc = $languages ​​-> lang [ 1 ] -> children ($namespaces [ "dc" ] ) ;

echo $dc -> creator ;

Method GetNamespaces() returns an array of prefix names and their associated URIs. It takes an additional parameter which defaults to false. If you install it like true, then this method will return the names used in parent and child nodes. Otherwise, it finds namespaces used only in the parent node.

Now you can iterate through the list of languages ​​like this:

$languages ​​= simplexml_load_file ("languages.xml" ) ;
$ns = $languages ​​-> getNamespaces (true ) ;

foreach ($languages ​​-> lang as $lang ) (
$dc = $lang -> children ($ns [ "dc" ] ) ;
printf (
"

%s appeared in %d and was created by %s .

" ,
$lang["name"] ,
$lang -> appeared ,
$dc -> creator
) ;
}

Case Study - Parsing a YouTube Video Channel

Let's look at an example that receives an RSS feed from a YouTube channel and displays links to all videos from it. To do this, please contact the following address:

http://gdata.youtube.com/feeds/api/users/xxx/uploads

The URL returns a list of the latest videos from the given channel in XML format. We will parse the XML and get the following information for each video:

  • Link to video
  • Miniature
  • Name

We'll start by searching and loading the XML:

$channel = "ChannelName" ;
$url = "http://gdata.youtube.com/feeds/api/users/". $channel. "/uploads" ;
$xml = file_get_contents ($url ) ;

$feed = simplexml_load_string ($xml ) ;
$ns = $feed -> getNameSpaces (true ) ;

If you look at the XML feed, you can see that there are several elements there. entity, each of which stores detailed information about a specific video from the channel. But we only use image thumbnails, video address and title. These three elements are children of the element group, which in turn is a child of entry:

>

>



Title... >

>

>

We'll just go through all the elements entry, and extract the necessary information for each of them. note that player, thumbnail and title are in the media namespace. Thus, we must proceed as in the previous example. We get the names from the document and use the namespace when referring to the elements.

foreach ($feed -> entry as $entry ) (
$group = $entry -> children ($ns [ "media" ] ) ;
$group = $group -> group ;
$thumbnail_attrs = $group -> thumbnail [ 1 ] -> attributes () ;
$image = $thumbnail_attrs [ "url" ] ;
$player = $group -> player -> attributes () ;
$link = $player["url"] ;
$title = $group -> title ;
printf ( "

" ,
$player , $image , $title ) ;
}

Conclusion

Now that you know how to use SimpleXML to parse XML data, you can improve your skills by parsing different XML feeds with different APIs. But it's important to keep in mind that SimpleXML reads the entire DOM into memory, so if you're parsing a large dataset, you may run out of memory. To learn more about SimpleXML read the documentation.


If you have any questions, please use our

Now we will study working with XML. XML is a format for exchanging data between sites. It is very similar to HTML, only XML allows its own tags and attributes.

Why is XML needed for parsing? Sometimes it happens that the site you need to parse has an API that allows you to get what you want without much effort. Therefore, immediately advice - before parsing the site, check if it has an API.

What is an API? This is a set of functions with which you can send a request to this site and get the desired response. This answer most often comes in XML format. So let's start studying it.

Working with XML in PHP

Let's say you have XML. It can be in a string, stored in a file, or served on request to a specific URL.

Let the XML be stored in a string. In this case, you need to create an object from this line using new SimpleXMLElement:

$str = " Kolya 25 1000 "; $xml = new SimpleXMLElement($str);

Now we have in a variable $xml an object with parsed XML is stored. By accessing the properties of this object, you can access the content of the XML tags. How exactly - we will analyze a little lower.

If the XML is stored in a file or returned by accessing a URL (which is most often the case), then you should use the function simplexml_load_file which makes the same object $xml:

Kolya 25 1000

$xml = simplexml_load_file(file path or url);

Working methods

In the examples below, our XML is stored in a file or URL.

Let the following XML be given:

Kolya 25 1000

Let's get the name, age and salary of an employee:

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

As you can see, the $xml object has properties corresponding to the tags.

You may have noticed that the tag does not appear anywhere in circulation. This is because it is the root tag. You can rename it, for example, to - and nothing will change:

Kolya 25 1000

$xml = simplexml_load_file(file path or url); echo $xml->name; //displays "Kolya" echo $xml->age; //outputs 25 echo $xml->salary; //outputs 1000

There can be only one root tag in XML, just like the root tag in plain HTML.

Let's modify our XML a bit:

Kolya 25 1000

In this case, we get a chain of calls:

$xml = simplexml_load_file(file path or url); echo $xml->worker->name; //displays "Kolya" echo $xml->worker->age; //outputs 25 echo $xml->worker->salary; //outputs 1000

Working with Attributes

Let some data be stored in attributes:

Number 1

$xml = simplexml_load_file(file path or url); echo $xml->worker["name"]; //displays "Kolya" echo $xml->worker["age"]; //outputs 25 echo $xml->worker["salary"]; //outputs 1000 echo $xml->worker; //prints "Number 1"

Tags with hyphens

In XML, tags (and attributes) with a hyphen are allowed. In this case, such tags are accessed like this:

Kolya Ivanov

$xml = simplexml_load_file(file path or url); echo $xml->worker->(first-name); //displays "Kolya" echo $xml->worker->(last-name); //displays "Ivanov"

Loop iteration

Let now we have not one worker, but several. In this case, we can iterate over our object with a foreach loop:

Kolya 25 1000 Vasya 26 2000 Peter 27 3000

$xml = simplexml_load_file(file path or url); foreach ($xml as $worker) ( echo $worker->name; //prints "Kolya", "Vasya", "Petya" )

From object to normal array

If you don't feel comfortable working with an object, you can convert it to a normal PHP array with the following trick:

$xml = simplexml_load_file(file path or url); var_dump(json_decode(json_encode($xml), true));

More information

Parsing based on sitemap.xml

Often, a site has a sitemap.xml file. This file stores links to all pages of the site for the convenience of indexing them by search engines (indexing is, in fact, parsing the site by Yandex and Google).

In general, we should not care much why this file is needed, the main thing is that if it exists, you can not climb the pages of the site by any tricky methods, but simply use this file.

How to check the presence of this file: let's parse the site site.ru, then refer to site.ru/sitemap.xml in the browser - if you see something, then it is there, and if you don't see it, then alas.

If there is a sitemap, then it contains links to all pages of the site in XML format. Feel free to take this XML, parse it, separate links to the pages you need in any way convenient for you (for example, by parsing the URL that was described in the spider method).

As a result, you get a list of links for parsing, all that remains is to go to them and parse the content you need.

Read more about the sitemap.xml device in wikipedia.

What do you do next:

Start solving problems at the following link: tasks for the lesson.

When everything is decided - go to the study of a new topic.







2022 gtavrl.ru.