You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Mck SembWever (JIRA)" <ji...@apache.org> on 2011/09/05 09:39:10 UTC

[jira] [Created] (CASSANDRA-3136) Allow CFIF to keep going despite unavailable ranges

Allow CFIF to keep going despite unavailable ranges
---------------------------------------------------

                 Key: CASSANDRA-3136
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3136
             Project: Cassandra
          Issue Type: Improvement
          Components: Hadoop
            Reporter: Mck SembWever
            Priority: Minor


>From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902

<use-case-1>
We use Cassandra as a storage for web-pages, we store the HTML, all
URLs that has the same HTML data and some computed data. We run Hadoop
MR jobs to compute lexical and thematical data for each page and for
exporting the data to a binary files for later use. URL gets to a
Cassandra on user request (a pageview) so if we delete an URL, it gets
back quickly if the page is active. Because of that and because there
is lots of data, we have the keyspace set to RF=1. We can drop the
whole keyspace and it will regenerate quickly and would contain only
fresh data, so we don't care about lossing a node.
</use-case-1>

<use-case-2>
trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
</use-case-2>

<use-case-3>
searching for something or some-pattern and one hit
is enough. If you get the hit it's a positive result regardless if
ranges were ignored, if you don't and you *know* there was a range
ignored along the way you can re-run the job later. 
For example such a job could be run at regular intervals in the day until a hit was found.
</use-case-3>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-3136) Allow CFIF to keep going despite unavailable ranges

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-3136.
---------------------------------------

    Resolution: Won't Fix

As explained when this was injected into another ticket, supporting this very niche scenario is not worth adding complexity to our Hadoop interface.  The "right" way to support fault-tolerant queries is to increase RF.

> Allow CFIF to keep going despite unavailable ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3136
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3136
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Mck SembWever
>            Priority: Minor
>
> From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902
> <use-case-1 from="Patrik Modesto">
> We use Cassandra as a storage for web-pages, we store the HTML, all
> URLs that has the same HTML data and some computed data. We run Hadoop
> MR jobs to compute lexical and thematical data for each page and for
> exporting the data to a binary files for later use. URL gets to a
> Cassandra on user request (a pageview) so if we delete an URL, it gets
> back quickly if the page is active. Because of that and because there
> is lots of data, we have the keyspace set to RF=1. We can drop the
> whole keyspace and it will regenerate quickly and would contain only
> fresh data, so we don't care about lossing a node.
> </use-case-1>
> <use-case-2>
> trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
> </use-case-2>
> <use-case-3>
> searching for something or some-pattern and one hit
> is enough. If you get the hit it's a positive result regardless if
> ranges were ignored, if you don't and you *know* there was a range
> ignored along the way you can re-run the job later. 
> For example such a job could be run at regular intervals in the day until a hit was found.
> </use-case-3>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3136) Allow CFIF to keep going despite unavailable ranges

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097280#comment-13097280 ] 

Mck SembWever commented on CASSANDRA-3136:
------------------------------------------

Ok... it was mentioned in CASSANDRA-2388 (by Patrik Modesto). but no one there paid it any attention as it didn't belong to that issue.

> Allow CFIF to keep going despite unavailable ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3136
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3136
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Mck SembWever
>            Priority: Minor
>
> From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902
> <use-case-1 from="Patrik Modesto">
> We use Cassandra as a storage for web-pages, we store the HTML, all
> URLs that has the same HTML data and some computed data. We run Hadoop
> MR jobs to compute lexical and thematical data for each page and for
> exporting the data to a binary files for later use. URL gets to a
> Cassandra on user request (a pageview) so if we delete an URL, it gets
> back quickly if the page is active. Because of that and because there
> is lots of data, we have the keyspace set to RF=1. We can drop the
> whole keyspace and it will regenerate quickly and would contain only
> fresh data, so we don't care about lossing a node.
> </use-case-1>
> <use-case-2>
> trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
> </use-case-2>
> <use-case-3>
> searching for something or some-pattern and one hit
> is enough. If you get the hit it's a positive result regardless if
> ranges were ignored, if you don't and you *know* there was a range
> ignored along the way you can re-run the job later. 
> For example such a job could be run at regular intervals in the day until a hit was found.
> </use-case-3>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-3136) Allow CFIF to keep going despite unavailable ranges

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097276#comment-13097276 ] 

Mck SembWever commented on CASSANDRA-3136:
------------------------------------------

bq. As explained when this was injected into another ticket...
What was this other ticket?

> Allow CFIF to keep going despite unavailable ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3136
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3136
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Mck SembWever
>            Priority: Minor
>
> From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902
> <use-case-1 from="Patrik Modesto">
> We use Cassandra as a storage for web-pages, we store the HTML, all
> URLs that has the same HTML data and some computed data. We run Hadoop
> MR jobs to compute lexical and thematical data for each page and for
> exporting the data to a binary files for later use. URL gets to a
> Cassandra on user request (a pageview) so if we delete an URL, it gets
> back quickly if the page is active. Because of that and because there
> is lots of data, we have the keyspace set to RF=1. We can drop the
> whole keyspace and it will regenerate quickly and would contain only
> fresh data, so we don't care about lossing a node.
> </use-case-1>
> <use-case-2>
> trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
> </use-case-2>
> <use-case-3>
> searching for something or some-pattern and one hit
> is enough. If you get the hit it's a positive result regardless if
> ranges were ignored, if you don't and you *know* there was a range
> ignored along the way you can re-run the job later. 
> For example such a job could be run at regular intervals in the day until a hit was found.
> </use-case-3>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-3136) Allow CFIF to keep going despite unavailable ranges

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated CASSANDRA-3136:
-------------------------------------

    Description: 
>From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902

<use-case-1 from="Patrik Modesto">
We use Cassandra as a storage for web-pages, we store the HTML, all
URLs that has the same HTML data and some computed data. We run Hadoop
MR jobs to compute lexical and thematical data for each page and for
exporting the data to a binary files for later use. URL gets to a
Cassandra on user request (a pageview) so if we delete an URL, it gets
back quickly if the page is active. Because of that and because there
is lots of data, we have the keyspace set to RF=1. We can drop the
whole keyspace and it will regenerate quickly and would contain only
fresh data, so we don't care about lossing a node.
</use-case-1>

<use-case-2>
trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
</use-case-2>

<use-case-3>
searching for something or some-pattern and one hit
is enough. If you get the hit it's a positive result regardless if
ranges were ignored, if you don't and you *know* there was a range
ignored along the way you can re-run the job later. 
For example such a job could be run at regular intervals in the day until a hit was found.
</use-case-3>

  was:
>From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902

<use-case-1>
We use Cassandra as a storage for web-pages, we store the HTML, all
URLs that has the same HTML data and some computed data. We run Hadoop
MR jobs to compute lexical and thematical data for each page and for
exporting the data to a binary files for later use. URL gets to a
Cassandra on user request (a pageview) so if we delete an URL, it gets
back quickly if the page is active. Because of that and because there
is lots of data, we have the keyspace set to RF=1. We can drop the
whole keyspace and it will regenerate quickly and would contain only
fresh data, so we don't care about lossing a node.
</use-case-1>

<use-case-2>
trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
</use-case-2>

<use-case-3>
searching for something or some-pattern and one hit
is enough. If you get the hit it's a positive result regardless if
ranges were ignored, if you don't and you *know* there was a range
ignored along the way you can re-run the job later. 
For example such a job could be run at regular intervals in the day until a hit was found.
</use-case-3>


> Allow CFIF to keep going despite unavailable ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3136
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3136
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Mck SembWever
>            Priority: Minor
>
> From http://thread.gmane.org/gmane.comp.db.cassandra.user/18902
> <use-case-1 from="Patrik Modesto">
> We use Cassandra as a storage for web-pages, we store the HTML, all
> URLs that has the same HTML data and some computed data. We run Hadoop
> MR jobs to compute lexical and thematical data for each page and for
> exporting the data to a binary files for later use. URL gets to a
> Cassandra on user request (a pageview) so if we delete an URL, it gets
> back quickly if the page is active. Because of that and because there
> is lots of data, we have the keyspace set to RF=1. We can drop the
> whole keyspace and it will regenerate quickly and would contain only
> fresh data, so we don't care about lossing a node.
> </use-case-1>
> <use-case-2>
> trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra.
> </use-case-2>
> <use-case-3>
> searching for something or some-pattern and one hit
> is enough. If you get the hit it's a positive result regardless if
> ranges were ignored, if you don't and you *know* there was a range
> ignored along the way you can re-run the job later. 
> For example such a job could be run at regular intervals in the day until a hit was found.
> </use-case-3>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira