2 Module : Gargantext.Text.Ngrams
3 Description : Ngrams definition and tools
4 Copyright : (c) CNRS, 2017 - present
5 License : AGPL + CECILL v3
6 Maintainer : team@gargantext.org
7 Stability : experimental
10 An @n-gram@ is a contiguous sequence of n items from a given sample of
11 text. In Gargantext application the items are words, n is a non negative
14 Using Latin numerical prefixes, an n-gram of size 1 is referred to as a
15 "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size
16 3 is a "trigram". English cardinal numbers are sometimes used, e.g.,
17 "four-gram", "five-gram", and so on.
19 Source: https://en.wikipedia.org/wiki/Ngrams
23 compute occ by node of Tree
24 group occs according groups
31 {-# LANGUAGE NoImplicitPrelude #-}
32 {-# LANGUAGE TemplateHaskell #-}
34 module Gargantext.Text.Terms
38 import Data.Text (Text)
39 import Data.Traversable
41 import Gargantext.Prelude
42 import Gargantext.Core
43 import Gargantext.Core.Types
44 import Gargantext.Text.Terms.Multi (multiterms)
45 import Gargantext.Text.Terms.Mono (monoTerms)
49 = Mono { _tt_lang :: lang }
50 | Multi { _tt_lang :: lang }
51 | MonoMulti { _tt_lang :: lang }
55 --group :: [Text] -> [Text]
59 -- map (filter (\t -> not . elem t)) $
60 ------------------------------------------------------------------------
61 -- | Sugar to extract terms from text (hiddeng mapM from end user).
62 --extractTerms :: Traversable t => TermType Lang -> t Text -> IO (t [Terms])
63 extractTerms :: TermType Lang -> [Text] -> IO [[Terms]]
64 extractTerms termTypeLang = mapM (terms termTypeLang)
65 ------------------------------------------------------------------------
68 -- Multi : multi terms
69 -- MonoMulti : mono and multi
70 -- TODO : multi terms should exclude mono (intersection is not empty yet)
71 terms :: TermType Lang -> Text -> IO [Terms]
72 terms (Mono lang) txt = pure $ monoTerms lang txt
73 terms (Multi lang) txt = multiterms lang txt
74 terms (MonoMulti lang) txt = terms (Multi lang) txt
75 -- terms (WithList list) txt = pure . concat $ extractTermsWithList list txt
76 ------------------------------------------------------------------------