Thursday, September 4, 2014

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.



Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday, September 3, 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.


In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Need to pull data from a website…web query? macro?


I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.


Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper


I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use Firebug to see how it is loaded.

You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions (comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.


Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Tuesday, September 2, 2014

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use

Anyone knows an online tool that can scrape a page and create a REST API for the scraped data?


I'm looking for a SaaS solution that is able to login to a platform, scrape data (reports) and then allow accessing the

data through an API. I have some reporting platforms that provide web reporting and email reporting but with no API.

Online reporting doesn't help and email reporting, although can be automated and scraped, isn't so reliable.

If you are willing to do the scraping through your own connection, have a look at Import IO. They have a desktop

application that you use to teach the system how to scrape a page, and then you run the crawler from that application -

and you can run it for as long as you like, as far as I can tell.

You may then upload your data to the Import cloud, from where it is available via an API on the import.io servers.

Useful data can be made public to donate it "to the commons" if you wish.


I did some more digging, found iMacros as a possible solution. Its Windows based, which is a drawback in my case, but

it does allow automation of the scraping and afterwards interaction via common web scripting languages like PHP and

ASP.net.


If you are familiar with jQuery, I think you can use node.js and Cheerio module, then you can create a simple

application to do auto scraping. Actually I have already built a site to do on line web scraping based on the above

mentioned tech, the site is www.datafiddle.net, you can take a look at it.


Source: http://stackoverflow.com/questions/19646028/anyone-knows-an-online-tool-that-can-scrape-a-page-and-create-a-

rest-api-for-the

Monday, September 1, 2014

Scrapy, scraping price data from StubHub


I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average

price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to

send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub