26 December, 2017

Domain Indexes -- 2 : STOPLIST for a Context Index

In the previous example, I created a simple CONTEXTIndex.
For a CONTEXT Index, Oracle automatically applies a default "STOPLIST".  This is a list of common words that are automatically excluded from the Index.
If you notice in the previous post, the words "this", "is", "a", "to", "be" are all in the MY_TEXT column of the MY_TEXT_TABLE but do not appear in the list of tokens in DR$MY_TEXT_INDEX$I.


Can we expand and build our own STOPLIST ?

I start with a new table containing the same rows :

SQL> connect ctxuser/ctxuser
Connected.
SQL> create table another_text_table
  2  as select id_column as id_column,
  3  my_text as another_text
  4  from my_text_table;

Table created.

SQL> select another_text from another_text_table;

ANOTHER_TEXT
--------------------------------------------------------------------------------
This is a long piece of text written by Hemant
Another long text to be captured by the index

SQL> 


I then build a custom STOPLIST (which I name as "another_text") and then create the CONTEXT Index.

SQL> begin
  2  ctx_ddl.create_stoplist('another_text');
  3  ctx_ddl.add_stopword('another_text','Hemant');
  4  ctx_ddl.add_stopword('another_text','long');
  5  end;
  6  /

PL/SQL procedure successfully completed.

SQL> 
SQL> create index another_text_index
  2  on another_text_table(another_text)
  3  indextype is ctxsys.context
  4  parameters('stoplist another_text');

Index created.

SQL> 


This time, I don't have to use CTX_DDL.SYNC_INDEX as the Index is created after the table is populated. Nevertheless, if I execute DML (INSERT, UPDATE or DELETE) to change data, I will need a call to CTX_DDL.SYNC_INDEX to update the Index.

Now, I check the Tokens that are created in this index.

SQL> select token_text, token_count
  2  from dr$another_text_index$i
  3  order by token_text
  4  /

TOKEN_TEXT                                                       TOKEN_COUNT
---------------------------------------------------------------- -----------
A                                                                          1
ANOTHER                                                                    1
BE                                                                         1
BY                                                                         2
CAPTURED                                                                   1
INDEX                                                                      1
IS                                                                         1
OF                                                                         1
PIECE                                                                      1
TEXT                                                                       2
THE                                                                        1
THIS                                                                       1
TO                                                                         1
WRITTEN                                                                    1

14 rows selected.

SQL> 


The STOPLIST words that I defined ('Hemant' and 'long') are not in the Tokens list.  However, since I defined my own Custom STOPLIST, Oracle's default BASIC_STOPLIST has been overriden.  So, now the Tokens list includes words like "a", "be", "by", "is" etc.

Contrast this with the Tokens list in the CONTEXT index that was created on MY_TEXT_TABLE using the default BASIC_STOPLIST :

SQL> select token_text, token_count
  2  from dr$my_text_index$i
  3  order by token_text
  4  /

TOKEN_TEXT                                                       TOKEN_COUNT
---------------------------------------------------------------- -----------
ANOTHER                                                                    1
CAPTURED                                                                   1
HEMANT                                                                     1
INDEX                                                                      1
LONG                                                                       2
PIECE                                                                      1
TEXT                                                                       2
WRITTEN                                                                    1

8 rows selected.

SQL> 


So, if you want to build a Custom STOPLIST, make sure you identify all the "common" words you want excluded.
.
.
.

1 comment:

Rob said...

I see the context for the indexes. Yes, It should always check the tokens in every indexes that had been created.. Its quite broad explanation. Thanks a lot.