PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Thu Dec 18, 2014 3:54 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Tue Jun 28, 2011 11:15 am 
Offline
Forum Newbie

Joined: Tue Jun 28, 2011 10:51 am
Posts: 5
Hello everyone, I'll start by saying I'm not sure if this is the most appropriate place to put this question, so moderator please move as you see fit :)

I have written an XML parser for a fairly small file (115kb). I initially wrote it using an OO approach, and was concerned by how long the page took to load (~4-5 seconds). I thought maybe the post-processing (I read it into an array, then process through it, adding output, formatting, etc) was slowing it down, so I cut that all out, and it still took quite a long time. I decided to whip up a second version that removed the OO style and switch to a functional approach, and it cut the time down to ~1 second! A huge improvement that I was not expecting. I did clear the cache and tested the loading time to make sure I was getting a reliable answer.

Does this make sense to others? I would like to stick with a more OO approach as it is easier to structure and expand should I need to, but not if it is going to cost me this extra time. Is the accessing of the object properties that much slower to do than simply accessing a variable or am I doing something in the OO code that is causing extra instances of the class to be created?

The OO approach
Syntax: [ Download ] [ Hide ]
   class PubsParser
   {
      private $xmlDebug;
      private $xmlDebugOutput;
      private $p;

      private $inEntry;
      private $publication;
      private $data;
      private $curAuthor;
      private $inBook;
      private $pubLevel;
      private $curPub;
      private $curPubCount;
      private $publications;
      const NO_PUB_TYPE = 'No publication type declared';

      public function __construct($xmlDebug=false, $xmlDebugOutput=false)
      {
         $this->inEntry = false;
         $this->data = "";
         $this->pubLevel = 0;
         $this->curPubCount = 0;
         $this->publications = array();
         $this->xmlDebug = $xmlDebug;
         $this->xmlDebugOutput = $xmlDebugOutput;
      }

      public function debugPrint($s)
      {
         if($this->xmlDebugOutput)
         {
            echo $s;
         }
      }

      public function parseFile($file="pubs.xml")
      {
         $this->p = xml_parser_create();
         xml_parser_set_option($this->p, XML_OPTION_SKIP_WHITE, 1);
         xml_parser_set_option($this->p, XML_OPTION_CASE_FOLDING, 0);

         if(!$this->xmlDebug)
         {
            xml_set_object($this->p, $this);
            xml_set_element_handler($this->p, 'startElement', 'endElement');
            xml_set_character_data_handler($this->p, 'contents');
            $fp = fopen($file, "r") or die("Could not open file");
            while($data = fread($fp, filesize($file)))
            {
               if(!xml_parse($this->p, $data, feof($fp)))
               {
                  die(sprintf("XML error: %s at line %d",
                                      xml_error_string(xml_get_error_code($this->p)),
                             xml_get_current_line_number($this->p)));
               }
            }
         }
         else
         {
            xml_parse_into_struct($this->p, implode("", file($file)), $val, $inx);
            print_r($val);
            print_r($inx);
         }
      }

      public function getPubs()
      {
         return $this->publications;
      }

      public function startElement($p, $element, $attrib)
      {
         $this->data;
         $this->debugPrint("starting element: $element with data: *$this->data*".PHP_EOL);

         switch($element)
         {
            case 'publication':
               $this->inEntry = true;
               $type = array_key_exists('type', $attrib) ? $attrib['type'] : die(PubsParser::NO_PUB_TYPE." on line: ".xml_get_current_line_number($p));
               $this->curPubCount++;
               switch($type)
               {
                  case 'article':
                     $this->curPub = new Article();
                     break;
                  case 'bookcontrib':
                     $this->curPub = new BookContrib();
                     break;
                  case 'patent':
                     $this->curPub = new Patent();
                     break;
                  case 'patentpub':
                     $this->curPub = new PatentPub();
                     break;
               }
               $this->publication[] = $this->curPub;
               break;
            case 'publist':
               break;
            default:
               if($this->inEntry)
               {
                  switch($element)
                  {
                     case 'author':
                        $this->curAuthor = array();
                        break;
                     case 'book':
                        $this->inBook = true;
                        $this->curPub->setBook(new Book());
                        $this->publication[] = $this->curPub->getBook();
                        $this->curPub = $this->curPub->getBook();
                        break;
                  }
               }
               else
               {
                  echo $element;
                  echo 'not in entry';
               }
               break;
         }
      }

      public function endElement($p, $element)
      {
         $this->debugPrint("ending element: $element with data of: $data".PHP_EOL);
         if($element == 'publication')
         {
            $this->curPub = array_pop($this->publication);
            $this->publications[] = $this->curPub;
         }

         $curPub = $this->curPub;
         $data = $this->data;
         $data = trim($data);

         switch($element)
         {
            case 'title':
               $curPub->setTitle($data);
               break;
            case 'author':
               $curPub->addAuthor($this->curAuthor);
               break;
            case 'first':
            case 'middle':
            case 'last':
            case 'suffix':
               $this->curAuthor[$element] = $data;
               break;
            case 'journal':
               $curPub->setJournal($data);
               break;
            case 'year':
               $curPub->setYear($data);
               break;
            case 'volume':
               $curPub->setVolume($data);
               break;
            case 'spage':
            case 'epage':
               $curPub->addPage($data);
               break;
            case 'note':
               $curPub->setNote($data);
               break;
            case 'book':
               array_pop($this->publication);
               $curPub = $this->publication[count($this->publication)-1];
               break;
            case 'publisher':
               $curPub->setPublisher($data);
               break;
            case 'location':
               $curPub->setLocation($data);
               break;
            case 'series':
               $curPub->setSeries($data);
               break;
            case 'country':
               $curPub->setCountry($data);
               break;
            case 'patentnum':
               $curPub->setPatentNum($data);
               break;
            case 'date':
               $curPub->setDate($data);
               break;
            case 'pagecount':
               $curPub->setPageCount($data);
               break;
            case 'link':
               $curPub->setLink($data);
               break;
            case 'patenttype':
               $curPub->setPatentType($data);
               break;
            case 'volsupplement':
               $curPub->setVolSupplement($data);
               break;
         }

         $this->data = "";
      }

      public function contents($p, $content)
      {
         $this->debugPrint("writing contents: *$content*; to data: *$this->data*".PHP_EOL);
         $this->data .= $content;
         $this->debugPrint("data now contains: *$this->data*".PHP_EOL);
      }
   }

The functional approach
Syntax: [ Download ] [ Hide ]
   $xmlDebugOutput = false;
   $xmlDebug = false;
   $data = "";
   $inEntry = false;
   $curPubCount = 0;
   $curPub;
   $publication;
   $curAuthor;
   $inBook;
   $publications = array();
   DEFINE('NO_PUB_TYPE','No publication type declared');

   function parseFile()
   {
      global $xmlDebug;
      $p = xml_parser_create();
      $file = "pubs.xml";

      xml_parser_set_option($p, XML_OPTION_SKIP_WHITE, 1);
      xml_parser_set_option($p, XML_OPTION_CASE_FOLDING, 0);

      if(!$xmlDebug)
      {
         xml_set_element_handler($p, 'startElement', 'endElement');
         xml_set_character_data_handler($p, 'contents');
         $fp = fopen($file, "r") or die("Could not open file");
         while($data = fread($fp, filesize($file)))
         {
            if(!xml_parse($p, $data, feof($fp)))
            {
               die(sprintf("XML error: %s at line %d",
                                   xml_error_string(xml_get_error_code($p)),
                          xml_get_current_line_number($p)));
            }
         }
      }
      else
      {
         xml_parse_into_struct($p, implode("", file($file)), $val, $inx);
         print_r($val);
         print_r($inx);
      }
   }

   function startElement($p, $element, $attrib)
   {
      global $data, $inEntry, $curPubCount, $curPub, $publication, $curAuthor, $inBook;
      $data = "";
      debugPrint("starting element: $element with data: *$data*".PHP_EOL);

      switch($element)
      {
         case 'publication':
            $inEntry = true;
            $type = array_key_exists('type', $attrib) ? $attrib['type'] : die(NO_PUB_TYPE." on line: ".xml_get_current_line_number($p));
            $curPubCount++;
            switch($type)
            {
               case 'article':
                  $curPub = new Article();
                  break;
               case 'bookcontrib':
                  $curPub = new BookContrib();
                  break;
               case 'patent':
                  $curPub = new Patent();
                  break;
               case 'patentpub':
                  $curPub = new PatentPub();
                  break;
            }
            $publication[] = $curPub;
            break;
         case 'publist':
            break;
         default:
            if($inEntry)
            {
               switch($element)
               {
                  case 'author':
                     $curAuthor = array();
                     break;
                  case 'book':
                     $inBook = true;
                     $curPub->setBook(new Book());
                     $publication[] = $curPub->getBook();
                     $curPub = $curPub->getBook();
                     break;
               }
            }
            else
            {
               echo $element;
               echo 'not in entry';
            }
            break;
      }
   }

   function endElement($p, $element)
   {
      global $curPub, $publications, $data, $curAuthor, $publication;
      debugPrint("ending element: $element with data of: $data".PHP_EOL);
      if($element == 'publication')
      {
         $curPub = array_pop($publication);
         $publications[] = $curPub;
      }

      $curPub = $curPub;
      $data = $data;
      $data = trim($data);

      switch($element)
      {
         case 'title':
            $curPub->setTitle($data);
            break;
         case 'author':
            $curPub->addAuthor($curAuthor);
            break;
         case 'first':
         case 'middle':
         case 'last':
         case 'suffix':
            $curAuthor[$element] = $data;
            break;
         case 'journal':
            $curPub->setJournal($data);
            break;
         case 'year':
            $curPub->setYear($data);
            break;
         case 'volume':
            $curPub->setVolume($data);
            break;
         case 'spage':
         case 'epage':
            $curPub->addPage($data);
            break;
         case 'note':
            $curPub->setNote($data);
            break;
         case 'book':
            array_pop($publication);
            $curPub = $publication[count($publication)-1];
            break;
         case 'publisher':
            $curPub->setPublisher($data);
            break;
         case 'location':
            $curPub->setLocation($data);
            break;
         case 'series':
            $curPub->setSeries($data);
            break;
         case 'country':
            $curPub->setCountry($data);
            break;
         case 'patentnum':
            $curPub->setPatentNum($data);
            break;
         case 'date':
            $curPub->setDate($data);
            break;
         case 'pagecount':
            $curPub->setPageCount($data);
            break;
         case 'link':
            $curPub->setLink($data);
            break;
         case 'patenttype':
            $curPub->setPatentType($data);
            break;
         case 'volsupplement':
            $curPub->setVolSupplement($data);
            break;
      }
      $data = "";
   }

   function contents($p, $content)
   {
      global $data;
      debugPrint("writing contents: *$content*; to data: *$data*".PHP_EOL);
      $data .= $content;
      debugPrint("data now contains: *$data*".PHP_EOL);
   }

   function debugPrint($s)
   {
      global $xmlDebugOutput;
      if($xmlDebugOutput)
      {
         echo $s;
      }
   }
 

Code that produces the output
Syntax: [ Download ] [ Hide ]
  if(isset($_GET['func']))
   {
      require_once("exPubs.php");
      parseFile();
      $pubs = $publications;
   }
   else
   {
      require_once("PubsParser.php");
      $p = new PubsParser();
      $p->parseFile();
      $pubs = $p->getPubs();
   }

   foreach($pubs as $pub)
   {
      $pub->printPublication();
   }


There is quite a bit more formatting that takes place in that foreach loop, but even at this stripped down version, the functional code runs in 530ms and the OO code runs in about 3.5s.


Top
 Profile  
 
PostPosted: Tue Jun 28, 2011 4:10 pm 
Offline
Site Administrator
User avatar

Joined: Wed Aug 25, 2004 7:54 pm
Posts: 12721
Location: New York, NY, US
Just glancing at the code the one difference I see is that you copy a lot of data from a property to a local variable in these methods (e.g. $curPub = $this->curPub;) whereas in the functional code you do the strange $curPub = $curPub; which PHP probably optimizes to do nothing. Maybe try just using the properties directly.

_________________
(#10850)


Top
 Profile  
 
PostPosted: Tue Jun 28, 2011 4:30 pm 
Offline
Forum Newbie

Joined: Tue Jun 28, 2011 10:51 am
Posts: 5
Ah, the $curPub = $curPub is an artifact from me porting the OO code to functional. Zealous copy/paste without careful checking afterwards. I will remove the copy to a local variable. I was trying to avoid excessive access thinking the local access will be faster than the property access, but now realize that especially in the endElement function, the call really only occurs once per the switch case.

So I removed the call and switched them to direct property access. It is still a great deal slower than the functional code. I mean, this isn't a hardship, but certainly a result I would not have predicted, and something that is definitely intriguing to further understand.


Top
 Profile  
 
PostPosted: Wed Jun 29, 2011 7:47 am 
Offline
Forum Contributor
User avatar

Joined: Sun Jan 14, 2007 11:44 am
Posts: 104
Location: Cracow, Poland
Object method calls are a bit slower than function calls. May I know why you are using this old XML extension instead of XMLReader, which uses Libxml directly, instead of a compatibility layer?

http://docs.php.net/xmlreader


Top
 Profile  
 
PostPosted: Wed Jun 29, 2011 9:13 am 
Offline
Forum Newbie

Joined: Tue Jun 28, 2011 10:51 am
Posts: 5
Yes, I did try to use that at first because it would have been much easier, however, the PHP my server (which is not in my control) is running does not appear to have included XMLReader into it's install configuration, and I am not aware of a way to install it separately. I am trying to push the department to upgrade it's PHP installation we are sitting at 5.0.4 right now :(

I did just notice that SimpleXML is installed, so I may look into that, but at this point, not sure if it is worth reinventing things at this point.

I am always a fan of OO code, as I think it serves well to structure and logically explain the program, but I suppose in this case the speed of the functional code trumps that desire, would you agree?


Top
 Profile  
 
PostPosted: Wed Jun 29, 2011 12:01 pm 
Offline
Briney Mod
User avatar

Joined: Mon Jan 19, 2004 7:11 pm
Posts: 6434
Location: 53.01N x 112.48W
Optimized procedural/functional code is almost always going to be faster than OOP code. The purpose of OOP is not to make code run faster, but to make it easier to write & debug.

_________________
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.


Top
 Profile  
 
PostPosted: Wed Jun 29, 2011 12:14 pm 
Offline
Forum Newbie

Joined: Tue Jun 28, 2011 10:51 am
Posts: 5
Thanks for the input. That makes perfect sense, I guess I was just surprised how much of a difference in speed was noticed for the procedural code in this case.

I guess I'll stick with the functional code and work to keep it well-maintained and keep it from growing into something ugly and unmanageable :)

Thanks!


Top
 Profile  
 
PostPosted: Wed Jun 29, 2011 5:52 pm 
Offline
Site Administrator
User avatar

Joined: Wed Aug 25, 2004 7:54 pm
Posts: 12721
Location: New York, NY, US
macman104 wrote:
So I removed the call and switched them to direct property access. It is still a great deal slower than the functional code. I mean, this isn't a hardship, but certainly a result I would not have predicted, and something that is definitely intriguing to further understand.
It seems strange. If the difference between the two files is really just global variables vs properties and you just changed between global declarations and using $this->, then there really should not be much difference. The OO code should be a little slower, but not many times slower.

Also strange is that even in the functional code you appear to create a bunch of objects.

Do you know which part of the code is the part that slows down in the OO code? Have you put timings in each method/function? It would be interesting to know.

_________________
(#10850)


Top
 Profile  
 
PostPosted: Thu Jun 30, 2011 12:03 am 
Offline
Forum Newbie

Joined: Tue Jun 28, 2011 10:51 am
Posts: 5
Christopher wrote:
macman104 wrote:
So I removed the call and switched them to direct property access. It is still a great deal slower than the functional code. I mean, this isn't a hardship, but certainly a result I would not have predicted, and something that is definitely intriguing to further understand.
It seems strange. If the difference between the two files is really just global variables vs properties and you just changed between global declarations and using $this->, then there really should not be much difference. The OO code should be a little slower, but not many times slower.

Also strange is that even in the functional code you appear to create a bunch of objects.

Do you know which part of the code is the part that slows down in the OO code? Have you put timings in each method/function? It would be interesting to know.


I have not put timings into the method/functions, but I do agree that it seems unusually slow, and would be interesting to explore! I think something to consider too is that in the OO code the startElement endElement contents handler functions are now object functions and get called many hundreds of times throughout the xml execution, maybe accounting for the difference?

I would certainly be happy to have someone else run these to confirm that there is not something strange going on with my files. Could you instruct how to add timings to the functions? I have somewhat moved on as I was presenting the changes to my advisor to the website, but certainly something I would love to pursue with some guidance in understanding the cause of this better.


Top
 Profile  
 
PostPosted: Fri Jul 08, 2011 12:22 am 
Offline
Site Administrator
User avatar

Joined: Wed Aug 25, 2004 7:54 pm
Posts: 12721
Location: New York, NY, US
I thought about trying to run your code to test it, but it uses a number of other objects, such as Article, that are not included.

_________________
(#10850)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group