02 January 2013

PHP Crawler Programming 4 the WIN

So I just wrote about 2012 and I mentioned a thing about programming. I have a programming task that  I find interesting and would like to share it. But I have to take a shower and go to bed so I will be quick. I have been tasked to make an e-store. I am going to go opencart. The challenge here is that the supplier up the chain doesn't really know what he is doing. He has ordered his store from a contractor and now he has to pay an arm and a leg for the extension needed to export what I need. So I'm not going to get what I need. What I need is an XML or CSV or access to an SQL, something that has the product's name, description, picture and price. Everything you need to sell something. Then I need a way to automatically import the data into the store.

I am writing everything in PHP this year so I went PHP with this too. But I sold PHP and myself short and made the worst procedural script mess possible. Enter crawler. Many websites still use GET variables to keep persistent data from page to page. This means that in most cases you can just copy the URL from the address bar and use it to go through all the items on the site by just incrementing the id. In my case the product_id. Then for each URL you just use curl to download the pages and write them to your database. Once you have the data you need to interpret it. I did that with the help of DOM. The document object model is usually more interesting to javascript than to PHP, but the build in DOMDocument class can load the downloaded pages into an easy to read and manipulate indexed object from which you can read the information you want. The data I got I wrote to the same table where I saved the pages. A nice trick to use when you are not sure what you want to send to an SQL is to serialize and encode the information you want. The script that goes through the downloaded pages and reads them via DOM takes the data and makes it into an Array that has the name, price, description and the URL of the images and then it serializes it and encodes it into a cryptic string that can be written into a simple text field in the database table. Once you have these arrays you go to step 3, injecting the data into the opencart database. If you look at the structure of the database you can quickly figure out what to inject where. And you use curl again to download the images and place them in the image folder, make sure you add the names of the images to the array that holds the product information so that you can set them later. Now, categories in opencart sort of suck. You need to search in the database for the description and from there you get the category_id and then you check the rest of the stuff. Categories are layered, there is a parent_id that shows what's at top. This took me a lot of time to code and this is why.

When I started learning PHP there wasn't much of object oriented programming philosophy in it. And there is a simple qbasic feeling to a project that has 4 files that go 1 2 3 4. I really though that this task is small enough for simple linear procedural programming. There was an argument about simple programs not needing objects, and not using objects being for lesser programmers. I decided I would be a great programmer and have been making objects left and right. For some reason not this time. So I hit the wall when I needed a procedure to turn on into itself and call itself from itself. You may think why object why not function. But there is a scope and loop issue here. And then when you resolve it you have a simple aesthetics issue here. Modular programming allows you to make self contained logic blocks that have a few simple lines of code packaged and put away. This allows you to hide things you don't need, makes the code easier to work with and manageable. I now have 4 files, two of them are about a hundred lines long each, and when the script hits an error I wish I would just die. This has made me realize that no task is small enough for you to simplify it. Take all the tricks from the book and just throw them at the task. Instead of assuming that it will be simple you should just diligently build your script from the ground up by the best practices you know. Maybe you will waste some time on some small things but when you get surprised by a task you will be ready for it. A class I always use is my sql class, it extends the PDO build in class and has the routines for writing and reading. I wrote a class that makes curl into a browser for use in a different script. I could have used it for this task since it has methods to get pages and to download files. Instead I made this blob monster of code that can only step through pages, and another one for the images. In retrospect it turns out that I rewrote code I had and it was not as good as the first time I wrote it. I hope my suffering the past 3 weeks is helpful to somebody. You know how to code, don't skip corners.