python binding for parsley
Go to file
Kyle Maxwell 719eeac05e misc 2009-01-18 22:50:23 -08:00
python after memory leak code review 2009-01-06 21:49:49 -08:00
ruby thread safety fixes 2009-01-06 20:03:52 -08:00
test misc 2009-01-18 22:50:23 -08:00
.gitignore i hope this is the right autotools 2009-01-03 16:27:11 -08:00
AUTHORS
ChangeLog
INSTALL reorg TODO 2009-01-04 16:46:18 -08:00
INTRO docs, broken argp update 2008-12-29 18:47:06 -08:00
Makefile.am optional key support 2009-01-06 16:38:31 -08:00
Makefile.in optional key support 2009-01-06 16:38:31 -08:00
NEWS
OUTLINE misc 2009-01-18 22:50:23 -08:00
PAPER progress 2009-01-15 16:13:11 -08:00
Portfile port working 2009-01-03 01:19:50 -08:00
Portfile.in reverted 2009-01-03 16:25:16 -08:00
README.C-LANG thread safety fixes 2009-01-06 20:03:52 -08:00
README.markdown markdown anchor 2009-01-15 15:00:52 -08:00
TODO removed dex_error cruft 2009-01-06 17:25:59 -08:00
VERSION port working 2009-01-03 01:19:50 -08:00
aclocal.m4 i hope this is the right autotools 2009-01-03 16:27:11 -08:00
bootstrap.sh
config.guess i hope this is the right autotools 2009-01-03 16:27:11 -08:00
config.status fixed bootstrap bug 2009-01-05 12:28:38 -08:00
config.sub i hope this is the right autotools 2009-01-03 16:27:11 -08:00
configure reverted 2009-01-03 16:25:16 -08:00
configure.ac reverted 2009-01-03 16:25:16 -08:00
depcomp i hope this is the right autotools 2009-01-03 16:27:11 -08:00
dex_mem.c moving memory handling around, partial work on vex 2009-01-06 13:14:37 -08:00
dex_mem.h moving memory handling around, partial work on vex 2009-01-06 13:14:37 -08:00
dexter.c after memory leak code review 2009-01-06 21:49:49 -08:00
dexter.h thread safety fixes 2009-01-06 20:03:52 -08:00
dexter_main.c thread safety fixes 2009-01-06 20:03:52 -08:00
dexterc_main.c removed dex_error cruft 2009-01-06 17:25:59 -08:00
functions.c ignore malformed html, handle funny encodings 2009-01-04 14:14:38 -08:00
functions.h remote html works, some function aliases work 2008-12-30 16:27:20 -08:00
install-sh extras 2009-01-02 19:34:26 -08:00
kstring.c moving memory handling around, partial work on vex 2009-01-06 13:14:37 -08:00
kstring.h
libtool fixed bootstrap bug 2009-01-05 12:28:38 -08:00
ltmain.sh i hope this is the right autotools 2009-01-03 16:27:11 -08:00
missing i hope this is the right autotools 2009-01-03 16:27:11 -08:00
obstack.c
obstack.h
parser.y update newline function 2009-01-12 16:47:06 -08:00
printbuf.c
printbuf.h moving memory handling around, partial work on vex 2009-01-06 13:14:37 -08:00
regexp.c
scanner.l
util.c optional key support 2009-01-06 16:38:31 -08:00
util.h rm vex 2009-01-06 14:47:37 -08:00
xml2json.c reverted 2009-01-03 16:25:16 -08:00
xml2json.h
ylwrap extras 2009-01-02 19:34:26 -08:00

README.markdown

Overview 

Dexter is a simple language for data-extraction from XML-like documents (including HTML). Dexter is:

  1. Blazing fast -- Typical HTML parses are sub-50ms.
  2. Easy to write and understand -- Dexter uses your current knowledge of JSON, CSS, and XPath.
  3. Powerful. Dexter can understand full XPath, including standard and user-defined functions.

Examples

A simple script, or "dex", looks like this:

{
  "title": "h1",
  "links(a)": [
    {
      "text": ".",
      "href": "@href"
    }
  ]
}

This returns JSON or XML output with the same structure. Applying this dex to http://www.yelp.com/biz/amnesia-san-francisco yields either:

{
  "title": "Amnesia",
  "links": [
    {
      "href": "\/",
      "text": "Yelp"
    },
    {
      "href": "\/",
      "text": "Welcome"
    },
    {
      "href": "\/signup?return_url=%2Fuser_details",
      "text": " About Me"
    },
    .....
  ]
}

or equivalently:

<dexter:root>
  <title>Amnesia</title>
  <links>
    <dexter:group>
      <href>/</href>
      <text>Yelp</text>
    </dexter:group>
    <dexter:group>
      <href>/</href>
      <text>Welcome</text>
    </dexter:group>
    <dexter:group>
      <href>/signup?return_url=%2Fuser_details</href>
      <text> About Me</text>
    </dexter:group>
    .....
  </links>
</dexter:root>      

This dex could also have been expressed as:

{
  "title": "h1",
  "links(a)": [
    {
      "text": ".",
      "href": "@href"
    }
  ]
}

The "a" in links(a) is a "key selector" -- an explicit grouping (with scope) for the array. You can use any XPath 1.0 or CSS3 expression as a value or a key selector. Dexter will try to be smart, and figure out which you are using. You can use CSS selectors inside XPath functions -- "substring-after(h1>a, ':')" is a valid expression.

Variables

You can use $foo to access the value of the key "foo" in the current scope (i.e. nested curly brace depth). Also available are $parent.foo, $parent.parent.foo, $root.foo, $root.foo.bar, etc.

Custom Functions

You can write custom functions in XSLT (I'd like to also support C and JavaScript). They look like:

<func:function name="user:excited">
   <xsl:param name="input" />
   <func:result select="concat($input, '!!!!!!!')" />
</func:function>

If you run:

{
  "title": "user:excited(h1)",
}

on the Yelp page, you'll get:

{
  "title": "Amnesia!!!!!!!",
}