What is the most efficient way to store a numeric range? The 2019 Stack Overflow Developer...

The phrase "to the numbers born"?

How can I add encounters in the Lost Mine of Phandelver campaign without giving PCs too much XP?

Why does the nucleus not repel itself?

Deal with toxic manager when you can't quit

Are there any other methods to apply to solving simultaneous equations?

What is the meaning of Triage in Cybersec world?

Old scifi movie from the 50s or 60s with men in solid red uniforms who interrogate a spy from the past

Why “相同意思的词” is called “同义词” instead of "同意词"?

Button changing its text & action. Good or terrible?

What do hard-Brexiteers want with respect to the Irish border?

Does adding complexity mean a more secure cipher?

What to do when moving next to a bird sanctuary with a loosely-domesticated cat?

Did Scotland spend $250,000 for the slogan "Welcome to Scotland"?

How do PCB vias affect signal quality?

Can there be female White Walkers?

Pokemon Turn Based battle (Python)

What do these terms in Caesar's Gallic Wars mean?

Ubuntu Server install with full GUI

Can I have a signal generator on while it's not connected?

Is it a good practice to use a static variable in a Test Class and use that in the actual class instead of Test.isRunningTest()?

How do you keep chess fun when your opponent constantly beats you?

Output the Arecibo Message

Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?

How come people say “Would of”?



What is the most efficient way to store a numeric range?



The 2019 Stack Overflow Developer Survey Results Are Inany document which says exactly what range of numbers are .NET BigIntegers designed for?A good schema to represent integer numbers from 0 to infinity, assuming you have infinite linear binary storage?Where can I find good example of techniques to compact data in-memory?What is the best way to store a table in C++Estimation of space is required to store 275305224 of 5x5 MagicSquares?How to store data that is recorded with different frequency?What is this compression algorithm? (Similar to RLE)What is the most efficient way to store this data?How does cuckoo hashing guarantee O(1) lookups in presence of persistent hash collisionsFast, lossless compression of a video stream





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







13















This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large fewer bits is required for the second value and in the case that the second value is large fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question























  • do you also have to store the start of the range?

    – Ewan
    18 hours ago











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    17 hours ago






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    17 hours ago






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    16 hours ago






  • 1





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    13 hours ago


















13















This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large fewer bits is required for the second value and in the case that the second value is large fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question























  • do you also have to store the start of the range?

    – Ewan
    18 hours ago











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    17 hours ago






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    17 hours ago






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    16 hours ago






  • 1





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    13 hours ago














13












13








13


2






This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large fewer bits is required for the second value and in the case that the second value is large fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question














This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large fewer bits is required for the second value and in the case that the second value is large fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?







data-structures numbers compression






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 18 hours ago









rghomerghome

388212




388212













  • do you also have to store the start of the range?

    – Ewan
    18 hours ago











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    17 hours ago






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    17 hours ago






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    16 hours ago






  • 1





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    13 hours ago



















  • do you also have to store the start of the range?

    – Ewan
    18 hours ago











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    17 hours ago






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    17 hours ago






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    16 hours ago






  • 1





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    13 hours ago

















do you also have to store the start of the range?

– Ewan
18 hours ago





do you also have to store the start of the range?

– Ewan
18 hours ago













@Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

– rghome
17 hours ago





@Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

– rghome
17 hours ago




2




2





so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

– Ewan
17 hours ago





so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

– Ewan
17 hours ago




1




1





While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

– NoChance
16 hours ago





While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

– NoChance
16 hours ago




1




1





@rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

– NoChance
13 hours ago





@rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

– NoChance
13 hours ago










4 Answers
4






active

oldest

votes


















32














Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






share|improve this answer


























  • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

    – rghome
    17 hours ago






  • 4





    The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

    – rghome
    17 hours ago











  • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

    – Glorfindel
    16 hours ago











  • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

    – rghome
    13 hours ago











  • @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

    – supercat
    10 hours ago



















7














For such small number of bits, it is infeasible to safe many bits as Glorfindel has pointed out.
However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



In the best case, it costs 32 + 5 + 1 = 38 bits.



This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



This means, for 10 000 values on a domain that takes 32 bits to encode, we get




  • 120 032 bits with the smart delta-encoding in the best case

  • 640 000 bits with the naive start, end encoding, always (no best or worst case)

  • 740 032 bits with the smart delta encoding in the worst case


If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



If your start points are distributed equally, this encoding doesn't really work that well.



This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






share|improve this answer































    0














    To expand on the answer from @Glorfindel:



    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






    share|improve this answer































      0














      This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



      The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 150-160 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 150-160. To encode the range 150-160, you output “00” and stop there.



      Let’s also suppose that the ranges 140-155 and 145-160 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 140-150.



      00: 140-150
      010: 140-155
      101: 145-160


      You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



      There are algorithms to find the optimal coding. I won’t try to explain it here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



      As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






      share|improve this answer
























        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "131"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsoftwareengineering.stackexchange.com%2fquestions%2f390192%2fwhat-is-the-most-efficient-way-to-store-a-numeric-range%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        32














        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer


























        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          17 hours ago






        • 4





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          17 hours ago











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          16 hours ago











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          13 hours ago











        • @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

          – supercat
          10 hours ago
















        32














        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer


























        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          17 hours ago






        • 4





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          17 hours ago











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          16 hours ago











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          13 hours ago











        • @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

          – supercat
          10 hours ago














        32












        32








        32







        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer















        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 16 hours ago

























        answered 18 hours ago









        GlorfindelGlorfindel

        2,11841627




        2,11841627













        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          17 hours ago






        • 4





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          17 hours ago











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          16 hours ago











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          13 hours ago











        • @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

          – supercat
          10 hours ago



















        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          17 hours ago






        • 4





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          17 hours ago











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          16 hours ago











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          13 hours ago











        • @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

          – supercat
          10 hours ago

















        Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

        – rghome
        17 hours ago





        Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

        – rghome
        17 hours ago




        4




        4





        The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

        – rghome
        17 hours ago





        The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

        – rghome
        17 hours ago













        Yeah, it tends to one bit for large N but it isn't really worth the hassle.

        – Glorfindel
        16 hours ago





        Yeah, it tends to one bit for large N but it isn't really worth the hassle.

        – Glorfindel
        16 hours ago













        FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

        – rghome
        13 hours ago





        FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

        – rghome
        13 hours ago













        @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

        – supercat
        10 hours ago





        @rghome: The cases where such techniques are most likely to be worthwhile would be those where the number of values in the domain is something in the range 16 to 22 or 256 to 361, or 65536 to 92680. The costs of expanding something from 8 bits to 9, or 16 to 17, or 32 to 33, are often quite large, so saving "one bit" can be very significant in those cases.

        – supercat
        10 hours ago













        7














        For such small number of bits, it is infeasible to safe many bits as Glorfindel has pointed out.
        However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



        Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



        If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



        2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



        In the best case, it costs 32 + 5 + 1 = 38 bits.



        This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



        However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



        Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



        Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



        Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
        In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



        This means, for 10 000 values on a domain that takes 32 bits to encode, we get




        • 120 032 bits with the smart delta-encoding in the best case

        • 640 000 bits with the naive start, end encoding, always (no best or worst case)

        • 740 032 bits with the smart delta encoding in the worst case


        If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



        Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



        As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



        If your start points are distributed equally, this encoding doesn't really work that well.



        This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



        Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






        share|improve this answer




























          7














          For such small number of bits, it is infeasible to safe many bits as Glorfindel has pointed out.
          However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



          Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



          If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



          2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



          In the best case, it costs 32 + 5 + 1 = 38 bits.



          This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



          However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



          Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



          Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



          Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
          In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



          This means, for 10 000 values on a domain that takes 32 bits to encode, we get




          • 120 032 bits with the smart delta-encoding in the best case

          • 640 000 bits with the naive start, end encoding, always (no best or worst case)

          • 740 032 bits with the smart delta encoding in the worst case


          If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



          Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



          As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



          If your start points are distributed equally, this encoding doesn't really work that well.



          This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



          Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






          share|improve this answer


























            7












            7








            7







            For such small number of bits, it is infeasible to safe many bits as Glorfindel has pointed out.
            However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



            Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



            If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



            2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



            In the best case, it costs 32 + 5 + 1 = 38 bits.



            This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



            However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



            Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



            Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



            Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
            In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



            This means, for 10 000 values on a domain that takes 32 bits to encode, we get




            • 120 032 bits with the smart delta-encoding in the best case

            • 640 000 bits with the naive start, end encoding, always (no best or worst case)

            • 740 032 bits with the smart delta encoding in the worst case


            If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



            Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



            As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



            If your start points are distributed equally, this encoding doesn't really work that well.



            This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



            Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






            share|improve this answer













            For such small number of bits, it is infeasible to safe many bits as Glorfindel has pointed out.
            However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



            Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



            If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



            2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



            In the best case, it costs 32 + 5 + 1 = 38 bits.



            This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



            However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



            Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



            Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



            Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
            In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



            This means, for 10 000 values on a domain that takes 32 bits to encode, we get




            • 120 032 bits with the smart delta-encoding in the best case

            • 640 000 bits with the naive start, end encoding, always (no best or worst case)

            • 740 032 bits with the smart delta encoding in the worst case


            If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



            Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



            As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



            If your start points are distributed equally, this encoding doesn't really work that well.



            This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



            Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 12 hours ago









            PolygnomePolygnome

            1,2261014




            1,2261014























                0














                To expand on the answer from @Glorfindel:



                As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                share|improve this answer




























                  0














                  To expand on the answer from @Glorfindel:



                  As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                  share|improve this answer


























                    0












                    0








                    0







                    To expand on the answer from @Glorfindel:



                    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                    share|improve this answer













                    To expand on the answer from @Glorfindel:



                    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered 9 hours ago









                    Jared GoguenJared Goguen

                    1276




                    1276























                        0














                        This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                        The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 150-160 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 150-160. To encode the range 150-160, you output “00” and stop there.



                        Let’s also suppose that the ranges 140-155 and 145-160 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 140-150.



                        00: 140-150
                        010: 140-155
                        101: 145-160


                        You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                        There are algorithms to find the optimal coding. I won’t try to explain it here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                        As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                        share|improve this answer




























                          0














                          This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                          The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 150-160 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 150-160. To encode the range 150-160, you output “00” and stop there.



                          Let’s also suppose that the ranges 140-155 and 145-160 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 140-150.



                          00: 140-150
                          010: 140-155
                          101: 145-160


                          You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                          There are algorithms to find the optimal coding. I won’t try to explain it here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                          As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                          share|improve this answer


























                            0












                            0








                            0







                            This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                            The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 150-160 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 150-160. To encode the range 150-160, you output “00” and stop there.



                            Let’s also suppose that the ranges 140-155 and 145-160 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 140-150.



                            00: 140-150
                            010: 140-155
                            101: 145-160


                            You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                            There are algorithms to find the optimal coding. I won’t try to explain it here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                            As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                            share|improve this answer













                            This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                            The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 150-160 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 150-160. To encode the range 150-160, you output “00” and stop there.



                            Let’s also suppose that the ranges 140-155 and 145-160 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 140-150.



                            00: 140-150
                            010: 140-155
                            101: 145-160


                            You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                            There are algorithms to find the optimal coding. I won’t try to explain it here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                            As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 10 mins ago









                            Patrick McElhaneyPatrick McElhaney

                            212210




                            212210






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Software Engineering Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsoftwareengineering.stackexchange.com%2fquestions%2f390192%2fwhat-is-the-most-efficient-way-to-store-a-numeric-range%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Taj Mahal Inhaltsverzeichnis Aufbau | Geschichte | 350-Jahr-Feier | Heutige Bedeutung | Siehe auch |...

                                Baia Sprie Cuprins Etimologie | Istorie | Demografie | Politică și administrație | Arii naturale...

                                Nicolae Petrescu-Găină Cuprins Biografie | Opera | In memoriam | Varia | Controverse, incertitudini...