02 January 2013

PHP Crawler Programming 4 the WIN

So I just wrote about 2012 and I mentioned a thing about programming. I have a programming task that  I find interesting and would like to share it. But I have to take a shower and go to bed so I will be quick. I have been tasked to make an e-store. I am going to go opencart. The challenge here is that the supplier up the chain doesn't really know what he is doing. He has ordered his store from a contractor and now he has to pay an arm and a leg for the extension needed to export what I need. So I'm not going to get what I need. What I need is an XML or CSV or access to an SQL, something that has the product's name, description, picture and price. Everything you need to sell something. Then I need a way to automatically import the data into the store.

I am writing everything in PHP this year so I went PHP with this too. But I sold PHP and myself short and made the worst procedural script mess possible. Enter crawler. Many websites still use GET variables to keep persistent data from page to page. This means that in most cases you can just copy the URL from the address bar and use it to go through all the items on the site by just incrementing the id. In my case the product_id. Then for each URL you just use curl to download the pages and write them to your database. Once you have the data you need to interpret it. I did that with the help of DOM. The document object model is usually more interesting to javascript than to PHP, but the build in DOMDocument class can load the downloaded pages into an easy to read and manipulate indexed object from which you can read the information you want. The data I got I wrote to the same table where I saved the pages. A nice trick to use when you are not sure what you want to send to an SQL is to serialize and encode the information you want. The script that goes through the downloaded pages and reads them via DOM takes the data and makes it into an Array that has the name, price, description and the URL of the images and then it serializes it and encodes it into a cryptic string that can be written into a simple text field in the database table. Once you have these arrays you go to step 3, injecting the data into the opencart database. If you look at the structure of the database you can quickly figure out what to inject where. And you use curl again to download the images and place them in the image folder, make sure you add the names of the images to the array that holds the product information so that you can set them later. Now, categories in opencart sort of suck. You need to search in the database for the description and from there you get the category_id and then you check the rest of the stuff. Categories are layered, there is a parent_id that shows what's at top. This took me a lot of time to code and this is why.

When I started learning PHP there wasn't much of object oriented programming philosophy in it. And there is a simple qbasic feeling to a project that has 4 files that go 1 2 3 4. I really though that this task is small enough for simple linear procedural programming. There was an argument about simple programs not needing objects, and not using objects being for lesser programmers. I decided I would be a great programmer and have been making objects left and right. For some reason not this time. So I hit the wall when I needed a procedure to turn on into itself and call itself from itself. You may think why object why not function. But there is a scope and loop issue here. And then when you resolve it you have a simple aesthetics issue here. Modular programming allows you to make self contained logic blocks that have a few simple lines of code packaged and put away. This allows you to hide things you don't need, makes the code easier to work with and manageable. I now have 4 files, two of them are about a hundred lines long each, and when the script hits an error I wish I would just die. This has made me realize that no task is small enough for you to simplify it. Take all the tricks from the book and just throw them at the task. Instead of assuming that it will be simple you should just diligently build your script from the ground up by the best practices you know. Maybe you will waste some time on some small things but when you get surprised by a task you will be ready for it. A class I always use is my sql class, it extends the PDO build in class and has the routines for writing and reading. I wrote a class that makes curl into a browser for use in a different script. I could have used it for this task since it has methods to get pages and to download files. Instead I made this blob monster of code that can only step through pages, and another one for the images. In retrospect it turns out that I rewrote code I had and it was not as good as the first time I wrote it. I hope my suffering the past 3 weeks is helpful to somebody. You know how to code, don't skip corners.

01 January 2013

2012 Personal review

Hi. It's me again. Blogging once every 2 months is the type of schedule I can keep, but today is January 1-st 2013 and I have some free time to share some personal stuff with the cosmos. So...

2012 was both hard and exciting for me, although it hasn't been a successful year at least it was interesting. The company I have been working for the past 7-ish years changed management at the end of 2011, but I remained with my old employer as I thought at the time that it would be a type of career advancement for me. It turned out to be a huge step back. As my main interests are in the IT field and his are in the goods and services. So I have been re-purposed from technical consultant into a do whatever is needed type of manager guy. And manager doesn't quite fit the bill since my employer likes to have a hands on approach to everything, but he isn't any good at anything except trade. So my job has been a solid year of time wasting.

On the other hand at the end of 2011, I co-founded a company with an old colleague. And it has grown under his management. Currently it is still a small operation, and the profits are not enough to give a viable income for two people, which is why I am still at my day job while he has taken a full time single man IT department and is serving several clients that have been acquired this year. He also managed to start an online PC store, and as the workload rises so do the profits, and so does the moment I can join him draw nearer.

Although it is still a one man operation, some fun stuff has spilled out. As IT is mostly a locationless and borderless activity. I have been dragged into some fun administration and development tasks.
My last adventure are bitcoin miners, who would have thought that this could be a service. The thing is that some of our client have had the need for some graphics cards recently, but that need has went away. So they where left with some interesting hardware, and somebody came up with the idea to mine bitcoins instead of trying to sell them off. So we put together 3 bitcoin mining systems and we are currently producing somewhere around 2900 Mhash a second. I really don't know if this is a lot or a little, but we are producing just a bit over the expected amount for each of the 3 systems, so...

I mentioned the web store earlier. So that was also fun. The store that the company I co-founded made is hosted on a shared hosting system at my current employer's data center. And I made the shared hosting system, which was fun. If you are the one regular reader on this blog (I'm sure there is one, and whoever you are thanks for the support) , you will know that I'm an opensuse guy. But I decided to use Cent OS for this particular task. I like Cent OS, and I found that it can get the job done. For shared hosting you would want to be able to give every hosted website its own work space. And there are many ways to do so. But the method I fell in love with was the Webmin/Virtualmin method. Webmin is a AIO Linux Management Web Interface that allows you to use a user friendly web GUI to configure your system. Virtualmin allows you to create separate isolated Webmin accounts for the different hosted web services and clients. I don't know if this is the best way to do this, but the Cent OS install and the Webmin and Virtualmin installs and configuration took me one day(5 hrs), so totally win.

The other interesting thing is the web stores themselves. And I am saying stores because after the PC web store made by my company started, my day job employer decided he wants to do the same thing, e-cigarettes are currently big in Bulgaria, and although I don't care about them and I don't see them as a viable busyness, I still like the challenge of bringing up an e-commerce site. So I decided to use opencart. Initially I was told that we are going to have about 4000 items in the store. So I decided that manual input of that amount of different types of goods is just not viable. So I contacted our supplier and asked for some type of database access or something like that in order to automate the entire process. Unfortunately the company doesn't know what IT is. The have a web store that was made by some company, but that company was also non responsive to the demands of e-commerce. They offered an SQL dump. Which would have been a one time thing and would have left us with a manual data refreshing process. I have one thing to say to you if you are a supplier of goods or other stuff, make sure that your products database is in some way accessible to the distributors you are working with. Make sure you can supply an up to date XML or CSV or even read only access to a database with the products and their characteristics. Otherwise you are retarded. So since they were unresponsive to my needs I decided to backdoor them, but more on that in my next post.

So what else? Ah, game development. Right. Project Konflikt still has no artwork and the engine is still pending a total rewrite. We started another project that will be full on 3D and we are making a mock-up in the blender game engine. More on that another time. Write me a comment, tell me what 2012 has been for you.