You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Jianfeng Jia (JIRA)" <ji...@apache.org> on 2016/10/19 20:24:59 UTC

[jira] [Commented] (ASTERIXDB-1699) Inverted Index fail to match the keyword

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589762#comment-15589762 ] 

Jianfeng Jia commented on ASTERIXDB-1699:
-----------------------------------------

It seems that the inverted index was somehow stopped growing. Here is the files in the inverted index folder:
{code}
-rw-r--r-- 1 jianfeng 129K Oct 17 00:04 2016-10-17-00-03-11-062_2016-10-16-22-51-44-465_f
-rw-r--r-- 1 jianfeng  82M Oct 17 00:05 2016-10-17-00-03-11-062_2016-10-16-22-51-44-465_i
-rw-r--r-- 1 jianfeng  48M Oct 17 00:05 2016-10-17-00-03-11-062_2016-10-16-22-51-44-465_b
-rw-r--r-- 1 jianfeng 257K Oct 17 00:05 2016-10-17-00-03-11-062_2016-10-16-22-51-44-465_d
-rw-r--r-- 1 jianfeng 129K Oct 17 00:45 2016-10-17-00-45-00-338_2016-10-17-00-04-01-056_f
-rw-r--r-- 1 jianfeng  83M Oct 17 00:46 2016-10-17-00-45-00-338_2016-10-17-00-04-01-056_i
-rw-r--r-- 1 jianfeng  49M Oct 17 00:46 2016-10-17-00-45-00-338_2016-10-17-00-04-01-056_b
-rw-r--r-- 1 jianfeng 257K Oct 17 00:46 2016-10-17-00-45-00-338_2016-10-17-00-04-01-056_d
-rw-r--r-- 1 jianfeng 129K Oct 17 13:23 2016-10-17-13-23-38-232_2016-10-17-00-45-55-667_f
-rw-r--r-- 1 jianfeng  81M Oct 17 13:24 2016-10-17-13-23-38-232_2016-10-17-00-45-55-667_i
-rw-r--r-- 1 jianfeng  48M Oct 17 13:24 2016-10-17-13-23-38-232_2016-10-17-00-45-55-667_b
-rw-r--r-- 1 jianfeng 257K Oct 17 13:24 2016-10-17-13-23-38-232_2016-10-17-00-45-55-667_d
-rw-r--r-- 1 jianfeng 129K Oct 18 23:43 2016-10-18-23-43-45-771_2016-10-17-13-25-38-229_f
-rw-r--r-- 1 jianfeng 2.3M Oct 18 23:43 2016-10-18-23-43-45-771_2016-10-17-13-25-38-229_i
-rw-r--r-- 1 jianfeng 3.3M Oct 18 23:43 2016-10-18-23-43-45-771_2016-10-17-13-25-38-229_b
-rw-r--r-- 1 jianfeng 257K Oct 18 23:43 2016-10-18-23-43-45-771_2016-10-17-13-25-38-229_d
-rw-r--r-- 1 jianfeng 129K Oct 19 00:33 2016-10-19-00-33-24-890_2016-10-19-00-33-24-890_i
-rw-r--r-- 1 jianfeng 129K Oct 19 00:33 2016-10-19-00-33-24-890_2016-10-19-00-33-24-890_f
-rw-r--r-- 1 jianfeng 385K Oct 19 00:33 2016-10-19-00-33-24-890_2016-10-19-00-33-24-890_b
-rw-r--r-- 1 jianfeng 257K Oct 19 00:33 2016-10-19-00-33-24-890_2016-10-19-00-33-24-890_d
-rw-r--r-- 1 jianfeng 129K Oct 19 06:24 2016-10-19-06-24-55-819_2016-10-19-06-24-55-819_i
-rw-r--r-- 1 jianfeng 129K Oct 19 06:24 2016-10-19-06-24-55-819_2016-10-19-06-24-55-819_f
-rw-r--r-- 1 jianfeng 385K Oct 19 06:24 2016-10-19-06-24-55-819_2016-10-19-06-24-55-819_b
-rw-r--r-- 1 jianfeng 257K Oct 19 06:24 2016-10-19-06-24-55-819_2016-10-19-06-24-55-819_d
-rw-r--r-- 1 jianfeng 129K Oct 19 06:26 2016-10-19-06-26-55-822_2016-10-19-06-26-55-822_i
-rw-r--r-- 1 jianfeng 129K Oct 19 06:26 2016-10-19-06-26-55-822_2016-10-19-06-26-55-822_f
-rw-r--r-- 1 jianfeng 385K Oct 19 06:26 2016-10-19-06-26-55-822_2016-10-19-06-26-55-822_b
-rw-r--r-- 1 jianfeng 257K Oct 19 06:26 2016-10-19-06-26-55-822_2016-10-19-06-26-55-822_d
-rw-r--r-- 1 jianfeng 129K Oct 19 08:48 2016-10-19-08-48-43-924_2016-10-19-08-48-43-924_i
-rw-r--r-- 1 jianfeng 129K Oct 19 08:48 2016-10-19-08-48-43-924_2016-10-19-08-48-43-924_f
-rw-r--r-- 1 jianfeng 385K Oct 19 08:48 2016-10-19-08-48-43-924_2016-10-19-08-48-43-924_b
-rw-r--r-- 1 jianfeng 257K Oct 19 08:48 2016-10-19-08-48-43-924_2016-10-19-08-48-43-924_d
-rw-r--r-- 1 jianfeng 129K Oct 19 09:03 2016-10-19-09-03-08-416_2016-10-19-09-03-08-416_i
-rw-r--r-- 1 jianfeng 129K Oct 19 09:03 2016-10-19-09-03-08-416_2016-10-19-09-03-08-416_f
-rw-r--r-- 1 jianfeng 385K Oct 19 09:03 2016-10-19-09-03-08-416_2016-10-19-09-03-08-416_b
-rw-r--r-- 1 jianfeng 257K Oct 19 09:03 2016-10-19-09-03-08-416_2016-10-19-09-03-08-416_d
-rw-r--r-- 1 jianfeng 129K Oct 19 10:52 2016-10-19-10-52-55-837_2016-10-19-10-52-55-837_i
-rw-r--r-- 1 jianfeng 129K Oct 19 10:52 2016-10-19-10-52-55-837_2016-10-19-10-52-55-837_f
-rw-r--r-- 1 jianfeng 385K Oct 19 10:52 2016-10-19-10-52-55-837_2016-10-19-10-52-55-837_b
-rw-r--r-- 1 jianfeng 257K Oct 19 10:52 2016-10-19-10-52-55-837_2016-10-19-10-52-55-837_d
-rw-r--r-- 1 jianfeng 129K Oct 19 10:54 2016-10-19-10-54-55-839_2016-10-19-10-54-55-839_i
-rw-r--r-- 1 jianfeng 129K Oct 19 10:54 2016-10-19-10-54-55-839_2016-10-19-10-54-55-839_f
-rw-r--r-- 1 jianfeng 385K Oct 19 10:54 2016-10-19-10-54-55-839_2016-10-19-10-54-55-839_b
-rw-r--r-- 1 jianfeng 257K Oct 19 10:54 2016-10-19-10-54-55-839_2016-10-19-10-54-55-839_d
{code}

For example, the total index size for 10-17 was 3 * 83M, but in 10-18 there was only 2.3M. 

> Inverted Index fail to match the keyword
> ----------------------------------------
>
>                 Key: ASTERIXDB-1699
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1699
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: Storage
>         Environment: master : 4819ea44723b87a68406d248782861cf6e5d3305
>            Reporter: Jianfeng Jia
>            Assignee: Ian Maxon
>
> Not very clear how to reproduce it on a smaller dataset. Here is the symptom: 
> If I run the following query
> {code}
> for $t in dataset twitter.ds_tweet
> where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' < datetime('2016-10-19T00:01:47.473Z') 
> and  /* +skip-index */ similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep')) > 0.0
> return $t.text
> {code}
> It will return some results
> {code}
> "No point in going to sleep now lol"
> "Can't sleep"
> "TL Sleep ��"
> "i can't sleep man����"
> "Blazed and I still can't sleep fackkkk.."
> "When you're proud of yourself for going to bed in time to get 6 hours of sleep #CollegeLyfeAmIRightIAmIt'sSoCrazyLol"
> "I would be sleep rn but have to lurk bc I'm no sucka & bc the fan isn't working��"
> "Since I can't sleep �� https://t.co/ALZE4psIqP"
> "Wish I Could Sleep"
> "Of course when I go to lay down finally, I am not tired. To sleep or not to sleep?? That's the real question."
> {code}
> If I'm using index
> {code}
> for $t in dataset twitter.ds_tweet
> where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' < datetime('2016-10-19T00:01:47.473Z') 
> and  similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep')) > 0.0
> return $t.text
> {code}
> It returns empty. 
> The debug port is on 8001 on each cloudberry nuc nc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)