src/Gargantext/Text/Terms.hs

   1 {-|
   2 Module      : Gargantext.Text.Ngrams
   3 Description : Ngrams definition and tools
   4 Copyright   : (c) CNRS, 2017 - present
   5 License     : AGPL + CECILL v3
   6 Maintainer  : team@gargantext.org
   7 Stability   : experimental
   8 Portability : POSIX
   9
  10 An @n-gram@ is a contiguous sequence of n items from a given sample of
  11 text. In Gargantext application the items are words, n is a non negative
  12 integer.
  13
  14 Using Latin numerical prefixes, an n-gram of size 1 is referred to as a
  15 "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size
  16 3 is a "trigram". English cardinal numbers are sometimes used, e.g.,
  17 "four-gram", "five-gram", and so on.
  18
  19 Source: https://en.wikipedia.org/wiki/Ngrams
  20
  21 TODO
  22 group Ngrams -> Tree
  23 compute occ by node of Tree
  24 group occs according groups
  25
  26 compute cooccurrences
  27 compute graph
  28
  29 -}
  30
  31 {-# LANGUAGE NoImplicitPrelude #-}
  32
  33 module Gargantext.Text.Terms
  34   where
  35
  36 import Data.Text (Text)
  37
  38 import Gargantext.Prelude
  39 import Gargantext.Core
  40 import Gargantext.Core.Types
  41 import Gargantext.Text.Terms.Multi (multiterms)
  42 import Gargantext.Text.Terms.Mono  (monoterms')
  43
  44 data TermType = Mono | Multi
  45
  46 ------------------------------------------------------------------------
  47 terms :: TermType -> Maybe Lang -> Text -> IO [Terms]
  48 terms Mono  (Just lang)  txt = pure $ monoterms' lang txt
  49 terms Multi (Just lang ) txt = multiterms lang txt
  50 terms _      Nothing _ = panic "Lang needed"
  51 ------------------------------------------------------------------------
  52
  53 termTests :: Text
  54 termTests = "It is hard to detect important articles in a specific context. Information retrieval techniques based on full text search can be inaccurate to identify main topics and they are not able to provide an indication about the importance of the article. Generating a citation network is a good way to find most popular articles but this approach is not context aware. The text around a citation mark is generally a good summary of the referred article. So citation context analysis presents an opportunity to use the wisdom of crowd for detecting important articles in a context sensitive way. In this work, we analyze citation contexts to rank articles properly for a given topic. The model proposed uses citation contexts in order to create a directed and edge-labeled citation network based on the target topic. Then we apply common ranking algorithms in order to find important articles in this newly created network. We showed that this method successfully detects a good subset of most prominent articles in a given topic. The biggest contribution of this approach is that we are able to identify important articles for a given search term even though these articles do not contain this search term. This technique can be used in other linked documents including web pages, legal documents, and patents as well as scientific papers."
  55
  56
  57
  58
  59