This commit is contained in:
Kyle Maxwell 2008-12-29 18:16:16 -08:00
parent 014bf88c74
commit 5ac8f6e55b
1 changed files with 86 additions and 0 deletions

86
INTRO Normal file
View File

@ -0,0 +1,86 @@
<html><textarea style="width:100%;height:100%">
Towards a universal scraping API
or, an introduction to dexter
Web scraping is a chore. Scraper scripts are brittle and slow, and everyone writes their own custom implementation, resulting in countless hours of repeated work. Let's work together to make it easier. Let's do what regular expressions did for text processing, and what SQL did for databases. Let's create a universal domain-specific language for web scraping.
What features do we need? The must haves:
- Concise
- Easy-to-learn
- Powerful
- Idiomatic
- Portable
- FAST!!!
In order to make this easy to learn, let's keep the best of what's working today. I really like Hpricot's ability to use either xml or css to specify tags to extract. (For those that don't know, you can use "h1 a" [css] or "//h1//a" [xpath] to represent all of the hyperlinks inside paragraphs in a document). Sometimes, i'd even like to mix xpath and css, i.e.: "substring-after(h1, ':')". Regular expressions are *really* useful, so let's support them too. Lets use the XPath2 syntax.
Now for some examples:
- 3rd paragraph:
p:nth-child(3)
- First sentence in that paragraph (period-delimited):
substring-before(p:nth-child(3), '.')
- Any simple phone number in an ordered list called "numbers"
re:match(ul#numbers>li, '\d{3}-\d{4}', 'g')
We support all of CSS3, XPath1, as well as all functions in XSLT 1.0 and EXSLT (required+regexp).
I think this is a pretty good way to grab a single piece of data from a page. It's simple and gives you all of the tools (CSS for simplicity, XPath for power, regex for detailed text handling) you are used to, in one expression.
We'd like to make our scraper script both portable and fast. For both these reasons, we need to be able to express the structure of the scraped data independently of the general-purpose programming language you happen to be working in. Jumping from XPath to Python and back means multiple passes over the document, and Python idioms prevent easy use of your scraper by Rubyists. If we can represent the entire scrape in a language-independent way, we can compile it into something that libxml2 can handle in one pass, giving screaming-fast (milliseconds per parse) performance.
To describe the output structure, lets use json. It's compact, and the Ruby/Python/etc bindings can use hashes/lists/dictionaries to represent the same structure. We can also have the scraper output json or native data structures. Here's an example script that grabs the title and all hyperlinks on a page:
{
"title": "h1",
"links": ["a"]
}
Applying this to http://www.yelp.com/biz/amnesia-san-francisco yields:
{
"title": "Amnesia",
"links": ["Yelp", "Welcome", "About Me", ... ]
}
You'll note that the output structure mirrors the input structure. In the Ruby binding, you can get both input and output natively:
> require "open-uri"
> require "dexter"
> Dexterous.new({"title" => "h1", "links" => ["a"]}).parse(:url => "http://www.yelp.com/biz/amnesia-san-francisco")
#=> {"title"=>"Amnesia", "links"=>["Yelp", "Welcome", "About Me"]}
We'll also add both explicit and implicit grouping Here's an extension of the previous example with explicit grouping:
{
"title": "h1",
"links(a)": [{
"text": ".",
"link": "@href"
}]
}
The json structure in the output still mirrors the input, but now you can get both the link text and the href.
Pages like craigslist are slightly trickier to group. Elements on this page go h4, p, p, p, h4, p, p, p. To group this, you could do:
{
"entry(p)":[{
"title": ".",
"date": "preceding::h4"
}]
}
If you instead wanted to group by date, you could use implicit grouping. It's implicit, because the parenthesized filter is omitted. Grouping happens by page order. We treat the first single (i.e. non-square-bracketed) value (the h4 in the below example) as the beginning of a new group, and adds following values to the group (i.e.: [h4, p, p, p], [h4, p, p], [h4, p]).
{
"entry":[{
"date": "h4",
"title": ["p"]
}]
}
In the next blog article, I'll talk about dex-validations, sharing, and automatic inference of dex scripts from web page structures. Hopefully, you have a taste of what dex scripts can do, and you like it. There's an alpha implementation under active development at []. I'd love to have more collaborators, bug reports, unit tests, docs, encouragement, etc.
</textarea></html>