Unsigned short changes value when assigning to struct field

Question

I have a function to parse 20 bytes long IP header buffer:

void parseIp(struct ipHeader *ip, const void *buffer)
{
    uint8_t* b = buffer;
   // memcpy(b,buffer,20);
    ip->version = (b[0] & 0xf0) >> 4;
    ip->ihl = (b[0] & 0x0f);
    ip->dscp = (b[1] & 0xfC)>>2;
    ip->ecn = (b[1] & 0x3);

    unsigned short l = (b[2] << 8) | b[3];

    printf("%d\n",l);
    ip->length = l;
    ip->identification = (b[4] << 0xFF) | b[5];
}

struct ipHeader:

struct ipHeader {
    int version;
    int ihl;
    int dscp;
    int ecn;
    unsigned short length;
    unsigned short identification;
    int flags;
    int fragment_offset;
    int time_to_live;
    int protocol;
    unsigned short header_checksum;
    unsigned char source_ip[4];
    unsigned char destination_ip[4];
};

Now the code prints l as 467, which is correct, but as this l is assigned to the struct field length it changes to 54017. I don't understand at all what is going on. I added the variable l to make sure no overflowing or type conversion errors could happen, but still it changes.

This is part of school work, so I can't change the struct.

EDIT full code:

#include <stdio.h>
#include <arpa/inet.h>
#include "ipheader.h"


/* Parses the given buffer into an IP header structure.
 * 
 * Parameters:
 * ip: pointer to the IP header structure that will be filled based
 *      on the data in the buffer
 * buffer: buffer of 20 bytes that contain the IP header. */
    void parseIp(struct ipHeader *ip, const void *buffer)
    {
        uint8_t* b = buffer;
       // memcpy(b,buffer,20);
        ip->version = (b[0] & 0xf0) >> 4;
        ip->ihl = (b[0] & 0x0f);
        ip->dscp = (b[1] & 0xfC)>>2;
        ip->ecn = (b[1] & 0x3);

        unsigned short l = (b[2] << 8) | b[3];

        printf("%d\n",l);
        ip->length = l;
        ip->identification = (b[4] << 8) | b[5];
    }


/* Builds a 20-byte byte stream based on the given IP header structure
 * 
 * Parameters:
 * buffer: pointer to the 20-byte buffer to which the header is constructed
 * ip: IP header structure that will be packed to the buffer */
void sendIp(void *buffer, const struct ipHeader *ip)
{
}


/* Prints the given IP header structure */
void printIp(const struct ipHeader *ip)
{
    /* Note: ntohs below is for converting numbers from network byte order
     to host byte order. You can ignore them for now
     To be discussed further in Network Programming course... */
    printf("version: %d   ihl: %d   dscp: %d   ecn: %d\n",
            ip->version, ip->ihl, ip->dscp, ip->ecn);
    printf("length: %d   id: %d   flags: %d   offset: %d\n",
            ntohs(ip->length), ntohs(ip->identification), ip->flags, ip->fragment_offset);
    printf("time to live: %d   protocol: %d   checksum: 0x%04x\n",
            ip->time_to_live, ip->protocol, ntohs(ip->header_checksum));
    printf("source ip: %d.%d.%d.%d\n", ip->source_ip[0], ip->source_ip[1],
            ip->source_ip[2], ip->source_ip[3]);
    printf("destination ip: %d.%d.%d.%d\n", ip->destination_ip[0],
            ip->destination_ip[1],
            ip->destination_ip[2], ip->destination_ip[3]);    
}

/* Shows hexdump of given data buffer */
void hexdump(const void *buffer, unsigned int length)
{
    const unsigned char *cbuf = buffer;
    unsigned int i;
    for (i = 0; i < length; ) {
        printf("%02x ", cbuf[i]);
        i++;
        if (!(i % 8))
            printf("\n");
    }
}

struct ipHeader {
    int version;
    int ihl;
    int dscp;
    int ecn;
    unsigned short length;
    unsigned short identification;
    int flags;
    int fragment_offset;
    int time_to_live;
    int protocol;
    unsigned short header_checksum;
    unsigned char source_ip[4];
    unsigned char destination_ip[4];
};

void parseIp(struct ipHeader *ip, const void *buffer);
void sendIp(void *buffer, const struct ipHeader *ip);

void printIp(const struct ipHeader *ip);
void hexdump(const void *buffer, unsigned int length);


#include <arpa/inet.h>
#include "ipheader.h"

int main()
{
    /* Feel free to modify this function to test different things */

    unsigned char bytes[] = {
        0x45, 0x00, 0x01, 0xd3, 0xda, 0x8d, 0x40, 0x00,
        0x40, 0x06, 0x8c, 0xd5, 0xc0, 0xa8, 0x01, 0x46,
        0x6c, 0xa0, 0xa3, 0x33 };

    struct ipHeader ip;

    parseIp(&ip, bytes);
    printIp(&ip);

    struct ipHeader ipfields = {
        4, // version
        28, // ihl
        4, // dscp
        0, // ecn
        htons(1500), // length
        htons(1234), // id
        1, // flags
        1024, // offset
        15, // time_to_live
        33, // protocol
        htons(0x1234), // checksum (invalid)
        {1, 2, 3, 4}, // source IP
        {5, 6, 7, 8} // destination IP
    };
    unsigned char sendbuf[20];

    sendIp(sendbuf, &ipfields);
    hexdump(sendbuf, sizeof(sendbuf));
}

`(b[4] << 0xFF) | b[5];` -->> `0xff` is a large shift. (you probably want to shift by eight) — joop, Mar 29 '16 at 12:27
That may well be the answer. The overflow of identification into the neighboring field changes the value of l/length. Verify this by printing length before an after assigning identification and then fix the identification overflow and see the problem go away. — LawfulEvil, Mar 29 '16 at 12:31
@LawfulEvil no, there is no overflow here, large shifts just "fall off" the left side. The lvalue `ip->identification =` is only 16 bits wide, so it will only contain `b[5]`, the rest having been shifted into outer space. — joop, Mar 29 '16 at 12:44
@joop Actually there is undefined behavior because of that: it left shifts data into (and past) the sign bits of a signed integer. Since `b[4]` is implicitly promoted to `int`, which is signed. — Lundin, Mar 29 '16 at 12:46
Anyway I think we can close this as localized/simple typo. I think the OP meant to write `b[4] & 0xFF` or something like that. — Lundin, Mar 29 '16 at 12:54
IMO the value being shifted is still unsigned. type coercion/ sign extension takes place just before the assignment of the RHS expression to the lvalue. — joop, Mar 29 '16 at 12:54
Sure, its almost certainly a typo. The question is asking, what/how is length getting changed from 467 into 54017. That hasn't been addressed. — LawfulEvil, Mar 29 '16 at 12:57
@joop No, that is not correct. `b` is an array of `uint8_t`, a small integer type. So the item `b[4]` gets implicitly promoted to `int` according to the integer promotion rule. The shift operation is executed on an `int`. The result type is that of the promoted left operand, `int`. Which then in turn will be converted during assignment to something else, in this case `unsigned short`. — Lundin, Mar 29 '16 at 12:58
I'm voting to close this question as the problem cannot be reproduced. — 2501, Mar 29 '16 at 13:45
Post the code used to determine the value of `ip->length` (which is not shown) that shows "it changes to 54017". — chux - Reinstate Monica, Mar 29 '16 at 13:58
`ip->identification = (b[4] << 0xFF) | b[5];` --> (shift too great) --> undefined behavior. Al least fix that. — chux - Reinstate Monica, Mar 29 '16 at 14:04
Thanks for everyone for quick help. I added the full code to the post. — Ou Tsei, Mar 29 '16 at 14:23
@Lundin: though you are technically correct,shifting with more than the wordsize always results in UB, it does not depend on the signedness.The uchar8_t type will be promoted to a (larger) signed integer type since that can be done without loss of value (bits), _before_ the shift. After the shift, the result would be equal in either case, given that the intermediate representation has enough bits to represent the *target* type (lvalue) , assuming sizeof (intermediate_representation) > sizeof (unsigned short) here. ergo:the left bits in the intermediate result will fall off anyway on assignment — joop, Mar 29 '16 at 15:15
@joop Yes, it does, because signedness reduces the size of the integer type with one bit. If you have `my_uint8 << 8` on a system where `int` is 16 bits large (15 bits wide + 1 bit sign) you will invoke UB in case `my_uint8` contains a value in the MSB. 6.5.7: "The result of E1 << E2 is E1 left-shifted E2 bit positions" /--/ "If E1 has a signed type and nonnegative value, and E1 × 2E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined." In my example assume `my_uint8` contains value 128. 128*2^8 = 32768. Will not fit inside 16 bit int. — Lundin, Mar 30 '16 at 06:23

M.M · Accepted Answer · 2016-03-29T22:52:35.813

For the given input:

unsigned char bytes[] = { 0x45, 0x00, 0x01, 0xd3, 0xda,

then the code:

unsigned short l = (b[2] << 8) | b[3];

produces l with a value of 467.

You say in your question, "as this l is assigned to the struct field length it changes to 54017. ". However this is not true. If you add a line immediately after your existing ip->length = l;:

printf("%d\n", ip->length);

you will still see 467.

I guess the problem you refer to is that your printIp function prints 54017.This is because that function does not print ip->length. It prints ntohs(ip->length). The ntohs macro changes the value from 567 to 54017.

To fix this, change the printIp function to print ip->length, and not ntohs(ip->length) .

Remove the other ntohs calls from that function too, and remove htons from your definition of ipfields. The integers are supposed to be stored in host order (i.e. native order) inside the struct ipHeader, and stored in network order (i.e. big-endian) when in the unsigned char buffer.

Portability Note 1: Technically you should use %hu as format specifier in both of those printf statements, because the argument type was unsigned short.

Portability Note 2: l == 467 regardless of int size, contrary to what has been suggested in some of the comments/answers so far. But to support values of b[2] greater than 0x7F when running on a system with 16-bit int, you should write ((unsigned)b[2] << 8) | b[3].

Portability Note 3: It would be a good idea to use uint16_t instead of unsigned short because there are now systems with 32-bit unsigned short. If you do this, the printf format specifier is "%"PRI16u for which you may need #include <inttypes.h>

score -1 · Answer 2 · answered Mar 29 '16 at 13:00

-1

Not sure whether your are aware of Endianness (Big endian/Little endian) Refer: https://en.wikipedia.org/wiki/Endianness

Basically a Little-endian format reverses the order and stores the least significant byte at the lower memory address with the most significant byte being stored at the highest memory address.

So, when you assign I (467 = 0x1d3), its stored in little endian format depending upon your machine endianness which is (0xd301 = 54017).

So use htons if you want the correct value to be assigned.

answered Mar 29 '16 at 13:00

Ravi

81
9

Bit shifts are endianess-independent, so I don't quite see how this answers the question. – Lundin Mar 29 '16 at 13:02
There is no bit shift happening in statement ip->length = l, its a pure assignment statement where length will be stored in reverse order if its a little endian machine, it depends how length variable is used further in the program. – Ravi Mar 29 '16 at 13:22
1

`l = (b[2] << 8) | b[3];` is endianess-independent code so I don't see how that matters? Why would you care how a specific variable is stored according to endianess, if you write no code relying on endianess? – Lundin Mar 29 '16 at 13:24
1

@Ou Tsei - can you share the code where you are using length variable and getting 54017 – Ravi Mar 29 '16 at 13:28

yano · Answer 3 · 2016-03-29T14:34:43.060

-2

As the others have mentioned, it looks like you're possibly having some shifting problems since you're shifting beyond the width of your data type, which results in ..? (I don't know, seems like there is an argument about whether this is defined behavior or not). On my machine your code results in 467 which is correct, luckily. Casting the dereferences explicitly defines what you want, however.

 unsigned short l = (((unsigned short)b[2]) << 8) | ((unsigned short)b[3]);

In addition, you need to worry about endianess (and you are), since network header code should always be big endian,,, I personally wouldn't worry about bit shifting if I didn't have to. For the parts of the header that are multibyte and fall on byte boundaries, I'd do something like this:

 ip->length = ntohs(((unsigned short*)b)[1]);
 ip->identification = ntohs(((unsigned short*)b)[2]);
 ip->header_checksum = ntohs(((unsigned short*)b[5]);
 /*
 unsigned int sourceIpAddr = ntohl(((unsigned int*)b)[3]);
 unsigned int destIpAddr = ntohl(((unsigned int*)b[4]);
 Not sure what endianess you want for the source and destination IPs since those are just byte arrays
 */

Note, the indexes change when you cast b to a different pointer type.

If I had complete control over the struct, I'd create the whole thing using bit fields, then you wouldn't have to worry about shifting at all, but you said the struct was defined for you.

edited Mar 29 '16 at 14:34

answered Mar 29 '16 at 14:15

yano

4,827
2
23
35

1

`(unsigned short*)b)[1]` etc. violates the strict aliasing rule – M.M Mar 29 '16 at 22:37
@M.M I have never heard of this in my life. I found what looked like a good example online but did not see compiler warnings or undefined behavior. Furthermore, it looks like it's quite the debate among other SO questions. But essentially, casting to different types is illegal/undefined? I should never cast a `char` pointer to another type and deref it? I should never cast a pointer to a pointer of different width and deref it? Unions still don't resolve this? IMO if it's illegal the compiler should throw errors. Thanks for bringing it to my attention, I'll def have to watch this. – yano Mar 30 '16 at 05:49
@yano The cast invokes undefined behavior and compilers don't necessarily warn for UB. In some cases they even implement well-defined but non-standard, non-portable behavior for things that are UB according to the standard. – Lundin Mar 30 '16 at 06:13
As for why it is UB, see http://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule. The problem with strict aliasing violations isn't de-referencing as a different type, but rather that the compiler is free to assume that a chunk of data will only get accessed through certain pointer types, so it might optimize away parts of your code. It is a common non-standard extension to implement well-defined behavior for such casts though (particularly among embedded systems compilers). Using a union with both char and short members would indeed have solved the aliasing problem. – Lundin Mar 30 '16 at 06:16
@Lundin Thanks for the further explanation. I did come across that question (and others) last night. This one (http://stackoverflow.com/questions/2906365/gcc-strict-aliasing-and-casting-through-a-union) was the source of confusion about the unions, but I just read the `-fstrict-aliasing` section in the `gcc` manual and I think I'm cleared up now; got to make sure to access the mem through ONLY the union type. Still,, I lost sleep last night, heh. No one likes to be told what they've believed in all their lives in a lie :(. I had no idea casting pointers is essentially wrong. – yano Mar 30 '16 at 14:25
@yano Yeah but read that post I linked. You can always cast from pointer type to pointer-to-character. Just not the other way around. And you can cast to/from void pointer, provided that you don't cast the void pointer to some other type later. – Lundin Mar 30 '16 at 14:29
@Lundin Yeah I did, thanks, I think I've got it squared away now (at least w the aliasing rules, can't say I fully grasp the assembly optimization yet). I just meant in general. I've been doing stuff like the code in my answer here for years, never had any problems, I don't _think_. If strict aliasing was covered in school I was sleeping that day. I'm working on some code now that mysteriously slows down with optimizations turned on and I'm casting `unsigned long long*` to `unsigned int*` all over the place in it ,, maybe that attributes to the slowdown. I'll have to refactor it to find out. – yano Mar 30 '16 at 15:23

Unsigned short changes value when assigning to struct field

3 Answers3