Adsense Secret: Scraper Site

A scraper site is a website that copies all of its content from other websites using web scraping. No part of a scraper site is original. A search engine is not a scraper site: sites such as Yahoo and Google gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content in response to a user's search.

In the last few years, and due to the advent of the Google Adsense web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines. Open content sites such as Wikipedia are a common source of material for scraper sites.

Made for AdSense

Some scraper sites are created for monetizing the site using advertising programs such as Google AdSense. In such case, they are called Made for AdSense sites or MFA. This is also a derogatory term used to refer to websites that have no redeeming value except to get web visitors to the website for the sole purpose of clicking on advertisements.

Made for AdSense sites are considered sites that are spamming search engines and diluting the search results by providing surfers with less-than-satisfactory search results. The scraped content is considered redundant to that which would be shown by the search engine under normal circumstances had no MFA website been found in the listings.

These types of websites are being eliminated in various search engines and sometimes show up as supplemental results instead of being displayed in the initial search results.

Some sites engage in "Adsense Arbitrage"--they will buy Adwords spots for lower cost search terms and bring the visitor to a page that is mostly Adsense. The arbitrager then makes the difference between the low value clicks he bought from AdWords and the higher value clicks generated by this traffic on his MFA sites. In 2007, Google cracked down on this business model by closing the accounts of many arbitragers. Another way Google and Yahoo are combating the proliferation of arbitrage are through quality scoring systems. For example, in Google's case, Adwords penalizes "low quality" advertiser pages by placing a higher per click value to its campaigns. This effectively evaporates the arbitrager's profit margin.

Legality

Scraper sites may violate copyright law. Even taking content from an open content site can be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License (GFDL) and Creative Commons ShareAlike (CC-BY-SA) licenses require that a republisher inform readers of the license conditions, and give credit to the original author.

Techniques

Many scrapers will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the SERPs (Search Engine Results Pages). RSS feeds are vulnerable to scrapers.

Some scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a pay-per-click advertisement because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Ad networks such as Google AdSense claim to be constantly working to remove these sites from their programs, although there is an active polemic about this since these networks benefit directly from the clicks generated at these kind of sites. From the advertiser's point of view, the networks don't seem to be making enough effort to stop this problem.

Scrapers tend to be associated with link farms and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.

Web Scraping

Web scraping (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites may wish to store the information in their own databases or manipulate the data within a spreadsheet (Often, spreadsheets are only able to contain a fraction of the data scraped). Others may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.

Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses that know the locations of competitors can make better decisions about where to focus further growth. Another common, but controversial use of information taken from websites is reposting scraped data to other sites.

Scraper sites

A typical example application for web scraping is a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use excerpts or reproduction of text and content, to plagiarized content. In some instances, plagiarized content may be used as an illicit means to increase traffic and advertising revenue. The typical scraper website generates revenue using Google AdSense, hence the term 'Made for AdSense' or MFA website.

Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML/JavaScript-based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.

Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software called a bot or a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Web harvesters are typically demonised, while "webbots" are often typecast as benevolent.

There are legal web scraping sites that provide free content and are commonly used by webmasters looking to populate a hastily made site with web content, often to profit by some means from the traffic the article hopefully brings. This content does not help the ranking of the site in search engine results because the content is not original to that page. Original content is a priority of search engines. Use of free articles usually requires one to link back to the free article site, as well as to a link(s) provided by the author. This is however not necessary as some sites those which provide free articles might also have a clause in their terms of service that does not allow copying content - link back or not. The site Wikipedia.org, (particularly the English Wikipedia) is a common target for web scraping.

Legal issues

Although scraping is against the terms of use of some websites, the enforceability of these terms is unclear.While outright duplication of original expression will in many cases be illegal, the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.

U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels,which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.

In Australia, the 2003 Spam Act outlaws some forms of web harvesting.

Technical measures to stop bots

A web master can use various measures to stop or slow a bot. Some techniques include:

* Blocking an IP address. This will also block all browsing from that address.
* If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
* Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
* Bots can be blocked by excess traffic monitoring.
* Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
* Sometimes bots can be blocked with carefully crafted Javascript.
* Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.

Adsense Secret

Friday, November 21, 2008

Scraper Site

Adsense Secret Headline

Blog Archive

Blog List

Follower