Curl better than simplexml_load_file when web screen scrapping

As in my previous post, I briefly introduced how you can use SimpleXML object to load various web services’ API query urls, if you so choose to mashup a website with data provided by the various websites, such as YouTube and Amazon. Of course, the pay off comes when you use some affiliate code in your query, such as the famouse Amazon’s ECS and you did a pretty well job on your page layout & design, and seo to attract many visitors to you sites.
Many of these kind of mashup web sites, besides will integrate the Google MAP and fancy Ajax effect, are involved in doing a lot of web screen scrapping; it basically means, you get the data, say from, parse data according a series common/standard xml tags, and format the data use xhtml and css according your design. Nearly every web service return the data in xml for all the request, whether it was a simple url query with your affiliate code embeded or it was an API call. Actually, this is why PHP5 newly improved and introduced SimpleXML is so useful, due to its great ability to pare xml files and strings.
Howere, since the simplexml_load_file has the same effect like fopen — if you enable php to load remote files, that is, pass a remote url to the function, it will generated a new SimpleXML element on the fly for you — many web sites author just use this function to get theire xml data. It may works fine if you only have a few urls need to send out to query the various sites; but, when you have a bunch of the urls for several different sites, say Yahoo, YouTube or Amazon — according my experience, some program will generate a bunch keywords for a lot of the web service query on the fly then use “big” loop to send out the web service request one by one — things will start getting crappy. You will start to see a series of warning and error message on the XML data received, be it not well formartted (can’t match start and end tags), or the “Premature end of data in tag” error. I am surprised that so many pages on the web have or had this problem, just do a Google Search on the phrase “Premature end of data in tag”. But, by my own analysis, the resulting of this kind warning or error seem is not due to the program logic, but due to the built-in feature of the “fopen” on PHP ( maybe? ) — I think those big brand, like YouTube and Amazon will not sent out response in xml format which is not well formed from time to time. Because, this kind of warning or errors seems artbitrary; it gives the same kind warning or error on different urls, and sometimes, the there no warning at all. If you click the links on the on the Google SERP: eventhough, it shows there is a error on the page on the SERP, but after you click into the web page, you often find that the page is fine. It means, at some point of time, when the Googlebot visit the page, the page generated the warning or error message.
One way to work around for this, is that you don’t use the data directly; after you retrieve&extract the data from a web service query, you save it on your local database; then you write other program routines to process the date from your database. You set up cron job do the web service query in the back ground, so there are no warning or error message will appear on the front web pages. This has another advantage, which is that in case the remote web service is down, you still have the data on your local database.
But, this actually not the way to fix the problem. You don’t solve the problem, you just move the problem to a different location. There are still error and warning message in your server’s log files. There is actually a solution to this.
It actually very simple, use the php_curl extension to perform the query. Curl is a good tool, and many hosting services include it. By starting initial the curl with the target url, $url, you want to query:

$ch = curl_init() or die ( curl_error() );
curl_setopt( $ch, CURLOPT_URL, $url );

then, set the URLOPT_RETURNTRANSFER option to true (this is important):

curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );

and then, execute the qery and save the return date into a local xml string:
$data = curl_exec( $ch );

after close the curl, create the new SimpleXML element on the local xml string;

curl_close( $ch );
$xml = new SimpleXMLElement($data);

This way, there will be no more warning or error message.

P.S. Don’t forget, if you want to merge the CDATA in the xml string, you can still use new SimpleXMLElement($data, ‘SimpleXMLElement’, LIBXML_NOCDATA);

P.P.S. Better yet, you can also use another neat php extension, tidy ( php_tidy ), to clean up and format your xml string nicely according the configuration you gave to the tidy.

1 Comment - Leave a comment
  1. lakshan说道:

    thank you very much for this article..


电子邮件地址不会被公开。 必填项已用*标注