===========================
An Exmaple - A News Article
===========================

This document provides a simple example of extracting the terms of a BBC
article from May 29, 2009. We will use several term extraction tools to
compare the outcome.

  >>> text ='''
  ... Police shut Palestinian theatre in Jerusalem.
  ...
  ... Israeli police have shut down a Palestinian theatre in East Jerusalem.
  ...
  ... The action, on Thursday, prevented the closing event of an international
  ... literature festival from taking place.
  ...
  ... Police said they were acting on a court order, issued after intelligence
  ... indicated that the Palestinian Authority was involved in the event.
  ...
  ... Israel has occupied East Jerusalem since 1967 and has annexed the
  ... area. This is not recognised by the international community.
  ...
  ... The British consul-general in Jerusalem , Richard Makepeace, was
  ... attending the event.
  ...
  ... "I think all lovers of literature would regard this as a very
  ... regrettable moment and regrettable decision," he added.
  ...
  ... Mr Makepeace said the festival's closing event would be reorganised to
  ... take place at the British Council in Jerusalem.
  ...
  ... The Israeli authorities often take action against events in East
  ... Jerusalem they see as connected to the Palestinian Authority.
  ...
  ... Saturday's opening event at the same theatre was also shut down.
  ...
  ... A police notice said the closure was on the orders of Israel's internal
  ... security minister on the grounds of a breach of interim peace accords
  ... from the 1990s.
  ...
  ... These laid the framework for talks on establishing a Palestinian state
  ... alongside Israel, but left the status of Jerusalem to be determined by
  ... further negotiation.
  ...
  ... Israel has annexed East Jerusalem and declares it part of its eternal
  ... capital.
  ...
  ... Palestinians hope to establish their capital in the area.
  ... '''


Yahoo Keyword Extractor
-----------------------

Yahoo provides a service that extracts terms from a piece of content using
its immense search database.

http://developer.yahoo.com/search/content/V1/termExtraction.html

As you can see, the result is excellent::

  <ResultSet>
     <Result>british consul general</Result>
     <Result>east jerusalem</Result>
     <Result>literature festival</Result>
     <Result>richard makepeace</Result>
     <Result>international literature</Result>
     <Result>israeli authorities</Result>
     <Result>eternal capital</Result>
     <Result>peace accords</Result>
     <Result>security minister</Result>
     <Result>israeli police</Result>
     <Result>internal security</Result>
     <Result>palestinian state</Result>
     <Result>palestinian authority</Result>
     <Result>british council</Result>
     <Result>palestinians</Result>
     <Result>negotiation</Result>
     <Result>breach</Result>
     <Result>1990s</Result>
     <Result>closure</Result>
     <Result>israel</Result>
  </ResultSet>

Unfortunately, the service allows only 5000 requests per 24 hours. Also, there
is no strength indicator on the terms.


TreeTagger
----------

A POS tagger that uses some linguistics to tag a text. Here is its output::

  Police          NNS       Police
  shut            VVD       shut
  Palestinian     JJ        Palestinian
  theatre         NN        theatre
  in              IN        in
  Jerusalem       NP        Jerusalem
  .               SENT      .
  Israeli         JJ        Israeli
  police          NNS       police
  have            VHP       have
  shut            VVN       shut
  down            RP        down
  a               DT        a
  Palestinian     JJ        Palestinian
  theatre         NN        theatre
  in              IN        in
  East            NP        East
  Jerusalem       NP        Jerusalem
  .               SENT      .
  The             DT        the
  action          NN        action
  ,               ,         ,
  on              IN        on
  Thursday        NP        Thursday
  ,               ,         ,
  prevented       VVD       prevent
  the             DT        the
  closing         NN        closing
  event           NN        event
  of              IN        of
  an              DT        an
  international   JJ        international
  literature      NN        literature
  festival        NN        festival
  from            IN        from
  taking          VVG       take
  place           NN        place
  .               SENT      .
  Police          NNS       Police
  said            VVD       say
  they            PP        they
  were            VBD       be
  acting          VVG       act
  on              IN        on
  a               DT        a
  court           NN        court
  order           NN        order
  ,               ,         ,
  issued          VVN       issue
  after           IN        after
  intelligence    NN        intelligence
  indicated       VVN       indicate
  that            IN        that
  the             DT        the
  Palestinian     NP        Palestinian
  Authority       NP        Authority
  was             VBD       be
  involved        VVN       involve
  in              IN        in
  the             DT        the
  event           NN        event
  .               SENT      .
  Israel          NP        Israel
  has             VHZ       have
  occupied        VVN       occupy
  East            NP        East
  Jerusalem       NP        Jerusalem
  since           IN        since
  1967            CD        @card@
  and             CC        and
  has             VHZ       have
  annexed         VVN       annex
  the             DT        the
  area            NN        area
  .               SENT      .
  This            DT        this
  is              VBZ       be
  not             RB        not
  recognised      VVN       recognise
  by              IN        by
  the             DT        the
  international   JJ        international
  community       NN        community
  .               SENT      .
  The             DT        the
  British         JJ        British
  consul-general  NN        <unknown>
  in              IN        in
  Jerusalem       NP        Jerusalem
  ,               ,         ,
  Richard         NP        Richard
  Makepeace       NP        Makepeace
  ,               ,         ,
  was             VBD       be
  attending       VVG       attend
  the             DT        the
  event           NN        event
  .               SENT      .
  "               ``        "
  I               PP        I
  think           VVP       think
  all             DT        all
  lovers          NNS       lover
  of              IN        of
  literature      NN        literature
  would           MD        would
  regard          VV        regard
  this            DT        this
  as              IN        as
  a               DT        a
  very            RB        very
  regrettable     JJ        regrettable
  moment          NN        moment
  and             CC        and
  regrettable     JJ        regrettable
  decision        NN        decision
  ,               ,         ,
  "               ''        "
  he              PP        he
  added           VVD       add
  .               SENT      .
  Mr              NP        Mr
  Makepeace       NP        Makepeace
  said            VVD       say
  the             DT        the
  festival        NN        festival
  's              POS       's
  closing         NN        closing
  event           NN        event
  would           MD        would
  be              VB        be
  reorganised     VVN       <unknown>
  to              TO        to
  take            VV        take
  place           NN        place
  at              IN        at
  the             DT        the
  British         NP        British
  Council         NP        Council
  in              IN        in
  Jerusalem       NP        Jerusalem
  .               SENT      .
  The             DT        the
  Israeli         JJ        Israeli
  authorities     NNS       authority
  often           RB        often
  take            VVP       take
  action          NN        action
  against         IN        against
  events          NNS       event
  in              IN        in
  East            NP        East
  Jerusalem       NP        Jerusalem
  they            PP        they
  see             VVP       see
  as              RB        as
  connected       VVN       connect
  to              TO        to
  the             DT        the
  Palestinian     JJ        Palestinian
  Authority       NP        Authority
  .               SENT      .
  Saturday        NP        Saturday
  's              POS       's
  opening         NN        opening
  event           NN        event
  at              IN        at
  the             DT        the
  same            JJ        same
  theatre         NN        theatre
  was             VBD       be
  also            RB        also
  shut            VVN       shut
  down            RP        down
  .               SENT      .
  A               DT        a
  police          NN        police
  notice          NN        notice
  said            VVD       say
  the             DT        the
  closure         NN        closure
  was             VBD       be
  on              IN        on
  the             DT        the
  orders          NNS       order
  of              IN        of
  Israel          NP        Israel
  's              POS       's
  internal        JJ        internal
  security        NN        security
  minister        NN        minister
  on              IN        on
  the             DT        the
  grounds         NNS       ground
  of              IN        of
  a               DT        a
  breach          NN        breach
  of              IN        of
  interim         JJ        interim
  peace           NN        peace
  accords         NNS       accord
  from            IN        from
  the             DT        the
  1990s           NNS       1990s
  .               SENT      .
  These           DT        these
  laid            VVD       lay
  the             DT        the
  framework       NN        framework
  for             IN        for
  talks           NNS       talk
  on              IN        on
  establishing    VVG       establish
  a               DT        a
  Palestinian     JJ        Palestinian
  state NN        state
  alongside       IN        alongside
  Israel          NP        Israel
  ,               ,         ,
  but             CC        but
  left            VVD       leave
  the             DT        the
  status          NN        status
  of              IN        of
  Jerusalem       NP        Jerusalem
  to              TO        to
  be              VB        be
  determined      VVN       determine
  by              IN        by
  further         JJR       further
  negotiation     NN        negotiation
  .               SENT      .
  Israel          NP        Israel
  has             VHZ       have
  annexed         VVN       annex
  East            NP        East
  Jerusalem       NP        Jerusalem
  and             CC        and
  declares        VVZ       declare
  it              PP        it
  part            NN        part
  of              IN        of
  its             PP$       its
  eternal         JJ        eternal
  capital         NN        capital
  .               SENT      .
  Palestinians    NPS       Palestinians
  hope            VVP       hope
  to              TO        to
  establish       VV        establish
  their           PP$       their
  capital         NN        capital
  in              IN        in
  the             DT        the
  area            NN        area
  .               SENT      .

As you can see, the identification of TreeTagger is pretty good, but the
output would need some analysis to produce a useful set of terms. Furthermore,
TreeTagger is not free for commercial use.

Topia's Term Extractor
----------------------

Topia's Term Extractor tries to produce results somewhere between a POS
tagger like TreeTagger and Yahoo Keyword Extraction.

Since we are only interested in nouns, a very simple POS tagging algorithm can
be deployed, which will provide good results most of the time. We then use
some simple statistics and linguistics to produce a narrow but strong list of
terms for the content.

  >>> from topia.termextract import extract
  >>> extractor = extract.TermExtractor()

Let's look at the result of the tagger first:

  >>> printTaggedTerms(extractor.tagger(text)) #doctest: +REPORT_NDIFF
  police          NN    police
  shut            VBN   shut
  Palestinian     JJ    Palestinian
  theatre         NN    theatre
  in              IN    in
  Jerusalem       NNP   Jerusalem
  .               .     .
  Israeli         JJ    Israeli
  police          NN    police
  have            VBP   have
  shut            VBN   shut
  down            RB    down
  a               DT    a
  Palestinian     JJ    Palestinian
  theatre         NN    theatre
  in              IN    in
  East            NNP   East
  Jerusalem       NNP   Jerusalem
  .               .     .
  The             DT    The
  action          NN    action
  ,               ,     ,
  on              IN    on
  Thursday        NNP   Thursday
  ,               ,     ,
  prevented       VBN   prevented
  the             DT    the
  closing         VBG   closing
  event           NN    event
  of              IN    of
  an              DT    an
  international   JJ    international
  literature      NN    literature
  festival        NN    festival
  from            IN    from
  taking          VBG   taking
  place           NN    place
  .               .     .
  police          NN    police
  said            VBD   said
  they            PRP   they
  were            VBD   were
  acting          VBG   acting
  on              IN    on
  a               DT    a
  court           NN    court
  order           NN    order
  ,               ,     ,
  issued          VBN   issued
  after           IN    after
  intelligence    NN    intelligence
  indicated       VBD   indicated
  that            IN    that
  the             DT    the
  Palestinian     JJ    Palestinian
  Authority       NNP   Authority
  was             VBD   was
  involved        VBN   involved
  in              IN    in
  the             DT    the
  event           NN    event
  .               .     .
  Israel          NNP   Israel
  has             VBZ   has
  occupied        VBN   occupied
  East            NNP   East
  Jerusalem       NNP   Jerusalem
  since           IN    since
  1967            NN    1967
  and             CC    and
  has             VBZ   has
  annexed         VBD   annexed
  the             DT    the
  area            NN    area
  .               .     .
  This            DT    This
  is              VBZ   is
  not             RB    not
  recognised      VBD   recognised
  by              IN    by
  the             DT    the
  international   JJ    international
  community       NN    community
  .               .     .
  The             DT    The
  British         JJ    British
  consul-general  NN    consul-general
  in              IN    in
  Jerusalem       NNP   Jerusalem
  ,               ,     ,
  Richard         NNP   Richard
  Makepeace       NNP   Makepeace
  ,               ,     ,
  was             VBD   was
  attending       VBG   attending
  the             DT    the
  event           NN    event
  .               .     .
  "               "     "
  I               PRP   I
  think           VBP   think
  all             DT    all
  lovers          NNS   lover
  of              IN    of
  literature      NN    literature
  would           MD    would
  regard          VB    regard
  this            DT    this
  as              IN    as
  a               DT    a
  very            RB    very
  regrettable     JJ    regrettable
  moment          NN    moment
  and             CC    and
  regrettable     JJ    regrettable
  decision        NN    decision
  ,"              ,     ,"
  he              PRP   he
  added           VBD   added
  .               .     .
  Mr              NNP   Mr
  Makepeace       NNP   Makepeace
  said            VBD   said
  the             DT    the
  festival        NN    festival
  's              POS   's
  closing         VBG   closing
  event           NN    event
  would           MD    would
  be              VB    be
  reorganised     NN    reorganised
  to              TO    to
  take            VB    take
  place           NN    place
  at              IN    at
  the             DT    the
  British         JJ    British
  Council         NNP   Council
  in              IN    in
  Jerusalem       NNP   Jerusalem
  .               .     .
  The             DT    The
  Israeli         JJ    Israeli
  authorities     NNS   authority
  often           RB    often
  take            VB    take
  action          NN    action
  against         IN    against
  events          NNS   event
  in              IN    in
  East            NNP   East
  Jerusalem       NNP   Jerusalem
  they            PRP   they
  see             VB    see
  as              IN    as
  connected       VBN   connected
  to              TO    to
  the             DT    the
  Palestinian     JJ    Palestinian
  Authority       NNP   Authority
  .               .     .
  Saturday        NNP   Saturday
  's              POS   's
  opening         NN    opening
  event           NN    event
  at              IN    at
  the             DT    the
  same            JJ    same
  theatre         NN    theatre
  was             VBD   was
  also            RB    also
  shut            VBN   shut
  down            RB    down
  .               .     .
  A               DT    A
  police          NN    police
  notice          NN    notice
  said            VBD   said
  the             DT    the
  closure         NN    closure
  was             VBD   was
  on              IN    on
  the             DT    the
  orders          NNS   order
  of              IN    of
  Israel          NNP   Israel
  's              POS   's
  internal        JJ    internal
  security        NN    security
  minister        NN    minister
  on              IN    on
  the             DT    the
  grounds         NNS   ground
  of              IN    of
  a               DT    a
  breach          NN    breach
  of              IN    of
  interim         JJ    interim
  peace           NN    peace
  accords         NNS   accord
  from            IN    from
  the             DT    the
  1990            NN    1990
  s               PRP   s
  .               .     .
  These           DT    These
  laid            VBN   laid
  the             DT    the
  framework       NN    framework
  for             IN    for
  talks           NNS   talk
  on              IN    on
  establishing    VBG   establishing
  a               DT    a
  Palestinian     JJ    Palestinian
  state           NN    state
  alongside       IN    alongside
  Israel          NNP   Israel
  ,               ,     ,
  but             CC    but
  left            VBN   left
  the             DT    the
  status          NN    status
  of              IN    of
  Jerusalem       NNP   Jerusalem
  to              TO    to
  be              VB    be
  determined      VBN   determined
  by              IN    by
  further         JJ    further
  negotiation     NN    negotiation
  .               .     .
  Israel          NNP   Israel
  has             VBZ   has
  annexed         VBD   annexed
  East            NNP   East
  Jerusalem       NNP   Jerusalem
  and             CC    and
  declares        VBZ   declares
  it              PRP   it
  part            NN    part
  of              IN    of
  its             PRP$  its
  eternal         JJ    eternal
  capital         NN    capital
  .               .     .
  Palestinians    NNPS  Palestinian
  hope            NN    hope
  to              TO    to
  establish       VB    establish
  their           PRP$  their
  capital         NN    capital
  in              IN    in
  the             DT    the
  area            NN    area
  .               .     .

Let's now apply the extractor.

  >>> sorted(extractor(text))
  [('British Council', 1, 2),
   ('British consul-general', 1, 2),
   ('East', 4, 1),
   ('East Jerusalem', 4, 2),
   ('Israel', 4, 1),
   ('Israeli authorities', 1, 2),
   ('Israeli police', 1, 2),
   ('Jerusalem', 8, 1),
   ('Mr Makepeace', 1, 2),
   ('Palestinian', 6, 1),
   ('Palestinian Authority', 2, 2),
   ('Palestinian state', 1, 2),
   ('Palestinian theatre', 2, 2),
   ('Palestinians hope', 1, 2),
   ('Richard Makepeace', 1, 2),
   ('court order', 1, 2),
   ('event', 6, 1),
   ('literature festival', 1, 2),
   ('opening event', 1, 2),
   ('peace accords', 1, 2),
   ('police', 4, 1),
   ('police notice', 1, 2),
   ('security minister', 1, 2),
   ('theatre', 3, 1)]
