Web Page Scraping
One of the ways we've been gathering data for years has
been called "screen scraping." You log onto a program, find
certain text in certain portions of the page, and "scrape"
it off for your own application's needs.
This has become a whole new animal with the explosion of
the Internet. So much more information is available now, but
it's constantly changing as well! We are able to use RSS, blogs,
XML, Web Services and other types of feeds to obtain much of
the information you may need on your site, but every once in a while
we still have to get out there and scrape. The examples below
demonstrate how this works.
Original Web Page
The frame below shows you a "stats" page for our website. Notice
the table about halfway down that shows the percentage of hits
from each country.
A bit hard to read, isn't it? Imagine all you really wanted to
see was this chart. Normally you'd have to display the page in an
IFRAME HTML element (like this), use a FRAMESET, or send the user
off-site to view the page. Who wants that, when you could deliver
THIS to them:
 | 341 | 77.32% | United States |  |
 | 29 | 6.58% | India |  |
 | 17 | 3.85% | Unknown | - |
 | 8 | 1.81% | Canada |  |
 | 6 | 1.36% | South Africa |  |
 | 6 | 1.36% | Ukraine |  |
 | 4 | 0.91% | Australia |  |
 | 3 | 0.68% | United Kingdom |  |
 | 3 | 0.68% | Philippines |  |
 | 3 | 0.68% | Netherlands |  |
 | 3 | 0.68% | Denmark |  |
 | 3 | 0.68% | Hong Kong |  |
 | 2 | 0.45% | Singapore |  |
 | 2 | 0.45% | Germany |  |
 | 2 | 0.45% | Israel |  |
 | 1 | 0.23% | Lithuania |  |
 | 1 | 0.23% | Italy |  |
 | 1 | 0.23% | Jamaica |  |
 | 1 | 0.23% | Malta |  |
 | 1 | 0.23% | Switzerland |  |
 | 1 | 0.23% | Malaysia |  |
 | 1 | 0.23% | Senegal |  |
 | 1 | 0.23% | Pakistan |  |
 | 1 | 0.23% | Russian Federation |  |
Conclusion
So what would you rather deliver on your site? A link or an IFRAME
displaying someone else's site, or would you rather scrape the information
you want from their site and make it your own? We aren't advocating
plagiarism or stealing copywritten information, and we encourage you to
give credit where credit is due, but isn't this the way to go?
We can build a site for you that contains pages with this type of
functionality, or we can provide you with development tools and controls
to do it on your own!
Our scraper offers the following features
- Accepts any accessible URL: HTML, RSS, XML, etc.
- Server side processing - no client scripts required!
- Simple operation: Set start and end text in HTML, and let it go!
- Replacement options: Remove certain characters or change them to your own.
This allows you to do things like hide images, alter text, etc.
- Optional 'backup plan:' display the page as an IFRAME if for some reason
the text to scrape cannot be found.
Another example of scraping functionality, this time completely
removing images from the scraped website.
| 341 | 77.32% | United States | |
| 29 | 6.58% | India | |
| 17 | 3.85% | Unknown | - |
| 8 | 1.81% | Canada | |
| 6 | 1.36% | South Africa | |
| 6 | 1.36% | Ukraine | |
| 4 | 0.91% | Australia | |
| 3 | 0.68% | United Kingdom | |
| 3 | 0.68% | Philippines | |
| 3 | 0.68% | Netherlands | |
| 3 | 0.68% | Denmark | |
| 3 | 0.68% | Hong Kong | |
| 2 | 0.45% | Singapore | |
| 2 | 0.45% | Germany | |
| 2 | 0.45% | Israel | |
| 1 | 0.23% | Lithuania | |
| 1 | 0.23% | Italy | |
| 1 | 0.23% | Jamaica | |
| 1 | 0.23% | Malta | |
| 1 | 0.23% | Switzerland | |
| 1 | 0.23% | Malaysia | |
| 1 | 0.23% | Senegal | |
| 1 | 0.23% | Pakistan | |
| 1 | 0.23% | Russian Federation | |
Bible Reading Scraper
One of our most recent projects involved this scraping technology. We used a database of Bible passages, along with BibleGateway.com's search engine, to create a smashup page of daily readings to read through the entire Bible in a year. You can check it out here.