PHP spider / crawler / scraper script
The script is executed using command line / linux shell, mine data from a page directory of companies as the starting page.
Features:
- It must be able to find data on paginated page, grabbing company infos such as company name, address, postal code, website, etc.
- It must be able to grab data, even the data has different pattern inside HTML or URL.
- Extensible as it able to create custom spider on different website,
- Has a session / failsafe feature, means it can continue if it's halted.
- Has a summary tool built in.
Library / technologies used:
- RegEx
- SimpleTest. By extending the SimpleBrowser class, the script is able to act as if it is a web browser
- SPYC

