Jeffrey Sabarese

Navigation

Skip navigation.

Search

Site navigation

Email conversation

FromJeffrey Sabarese
ToMe
SubjectChar. Encoding - Entities - IDE Behaviours
Date19 March 2007 17:15
hi there. thank you for your time!

i very much like your composition at
http://www.howtocreate.co.uk/sidehtmlentity.html

i find it very useful. i have never reproduced it, but i have saved it
locally for the purpose of quick-access on localhost, and in doing so, i've
discovered some things about Character Encodings, and the behaviour of IDE's
and Browsers vs. the declared Character Encoding.

First of all, i do not claim to be an expert on this subject, just to be
clear. :)
I realize that your entities page (in my awareness) does not have a declared
Character Set, so any time i have saved it, i have used the UTF-8 character
encoding, which does work fine, and i've also modified the doctype to XHTML
Transitional, UTF-8, which also works-- HOWEVER, if there is anything "out
of line", the character glyphs do INDEED become "skewed". I found this a
curious fact, so i tried to figure out what i had done "wrong" to make the
page NOT allow the browser to properly decode the entities as intended.

** MY INQUIRY **
To be honest, i am still unclear on the issue as a whole, however I have at
least determined a certain method for getting it "right"-- but it is
dependent upon the software i'm using ([brand 1]). For
example, using [brand 2] (and other softwares such as [brand 3]),
i've saved the page, declared the Character Set as UTF-8, and opened the
page to see that the decoded glyphs are not right. Yet, i've done precisely
the same w/ [brand 1] and it "works right".

PLEASE DON'T MISTAKE THIS FOR AN ENDORSEMENT! but a curiosity:
I wonder: what might it be about the different software; what is it that the
software does to the file "behind the scense" which causes it to remain
"unmodified", and therefore encoded properly? It is my guess that, for
example, [brand 3] must be adding or in some way "pre"-decoding
(while saving) the entities, and then re-encoding them-- improperly. It is
quite puzzling, and compelling for me. I wonder what are your own thoughts
on this?

i hope this is O.K. with you. i promise that i use it only locally, and have
to intent to publish it-- however, i will remove my saved copies if you
request i do so.

P.S. My web site where i reference howToCreate.co.uk:
[URL]
:)

thank you so much!
-J.S.
FromMe
ToJeffrey Sabarese
SubjectRe: Char. Encoding - Entities - IDE Behaviours
Date19 March 2007 21:20
Jeffrey,

> i've also modified the doctype to XHTML Transitional

The page is HTML and is not compatible with XHTML, so there is no reason to
do this. If you actually tried to serve it as XHTML (using the correct
content type), it would generate parser errors.

> I realize that your entities page (in my awareness) does not have a
> declared Character Set

It does. I use the Content-Type header as part of the HTTP communication.
The correct character set is utf-8.

> For example, using [brand 2] (and other softwares such as [brand 3]
> i've saved the page, declared the Character Set as UTF-8, and
> opened the page to see that the decoded glyphs are not right. Yet, i've done
> precisely the same w/ [brand 1] and it "works right".

The base of UTF-8 is compatible with ascii, but the higher ranges are
completely incompatible. In the case of [brand 2], it does not seem to
understand utf-8, so it breaks the file when it opens it (treating it as
your system default encoding - 8859-1 or whatever - converting the unknown
characters into whatever characters it can manufacture from that bit
sequence). When you save it, it saves the modified broken output, but in
whatever way it interpreted them, instead of in the same bit sequence as
they were in when you opened it. So it ends up broken.

In the case of the [brand 3], I do not know exactly what it does, but
I suspect it does the same thing. It may well also remove my entities from
the source and replace them with exact characters in 8859-1, which confuses
it even more when you tell the browser to interpret it as utf-8.

You need to use an editor that understands utf-8 and either recognises the
file as that, or allows you to tell it what the file is. I use an editor
that does that. Your other editor seems to be able to do this too.

> i hope this is O.K. with you. i promise that i use it only locally

This is perfectly ok.


Mark 'Tarquin' Wilton-Jones - author of http://www.howtocreate.co.uk/
FromJeffrey Sabarese
ToMe
SubjectChar. Encoding - Entities - IDE Behaviours
Date22 March 2007 15:08
AttachmentSample files
Tarquin,

Thank you for your prompt reply, and for the thoughtful responses to
my various inquiries.
this is the last i will contact you, unsolicited. thank you for your
understanding.

You might say i've conducted a bit of an experiment this morning,
inspired by your reply.

Curious what would happen in a variety of situations, i archived in
RAR format and attached to this mail a series of files which were
saved from the source-view of different Browsers, and from code
manipulation (or not) of the "htmlentities" file in different Editors.
(the file-names themselves should tell the tale)

What i hope to gain from this second mail, Sir, is a better
understanding of how i should go about coding in UTF-8, how to avoid
the pitfalls of non-compliant software, and-- rather than a "trial and
error" experiment as i did today, to simply "Know" for reasons of
sofware settings and features, which software supports, and which will
NOT support coding for i18n.

please see below, and the attached...

>> I realize that your entities page (in my awareness) does not have a
>> declared Character Set
>
> It does. I use the Content-Type header as part of the HTTP
> communication. The correct character set is utf-8.

Forgive me, i do not mean to argue the point, but could you tell me
which line this is? am i going mad? i do not see it. perhaps you've
written it in a way which i am unfamiliar?

>> For example, using [brand 2] (and other softwares such as [brand 3]...SNIP
>
> The base of UTF-8 is compatible with ascii, but the higher ranges are
> completely incompatible.....SNIP...it saves the modified broken output,
> but in whatever way it interpreted them, instead of in the same bit
> sequence as they were in when you opened it. So it ends up broken.

THANK YOU! it's the "upper UTF-8 element which was puzzling. now it
makes sense. thanks you for that understanding! the question now
remains: which editors to use? :)

> In the case of the [brand 3], I do not know exactly what it does,
> but I suspect it does the same thing. It may well also remove my
> entities from the source and replace them with exact characters in
> 8859-1, which confuses it even more when you tell the browser to
> interpret it as utf-8.

meaning, when i added the meta-tag w/ the UTF-8 ?? or simply setting
the browser to read it as such... or?

> You need to use an editor that understands utf-8 and either recognises
> the file as that, or allows you to tell it what the file is. I use an
> editor that does that. Your other editor seems to be able to do this too.

If you don't mind, i'd like to know what you recommend for such compatibilities.
i think i understand most of it now.

One curiosity remains: What if, for example, [brand 3] "ruins" the
file by putting / warping the encodings. is there any way to "un-do"
this process, or is the file essentially ruined?
How can i be sure that when i "view source" that the encodings are not
destroyed at that very moment?

thanks very much for your time!
Jeffrey Sabarese
FromMe
ToJeffrey Sabarese
SubjectRe: Char. Encoding - Entities - IDE Behaviours
Date22 March 2007 22:27
Jeffrey,

>> I use the Content-Type header as part of the HTTP
>> communication. The correct character set is utf-8.
>
> could you tell me which line this is?

It is not a line in the HTML. It is a part of the HTTP communication between
the server and the browser when it requests the page.

This is something that I have set in Apache (my Web server):
AddType "text/html; charset=UTF-8" html

or:
AddDefaultCharset UTF-8

> which editors to use? :)

Many editors are capable of understanding utf-8. You should sheck with the
application vendor when you get the application. I do not like to recommend
specific non-browser software, but personally I use UltraEdit, and it works
very well. Files are assumed to be my system default (ISO-8859-1) encoding
if they only contain characters that are in that character set.

However, using the file menu, I can convert the file from "Ascii" (actually
not Ascii, but my system default; 8859-1) into utf-8. I can then add
characters into it that are not available in 8859-1. If I save the file with
some characters in it that do not exist in Ascii, then it will be able to
automatically work out if that file is using my system default or utf-8 when
I reopen it, so it works for my encoding needs.

(Note that it will have a couple of problems with some obscure combinations,
such as if you save a file in 8859-1 containing the sequence £ and no other
non-ascii content, save it, and reopen it, it will think it was a £ sign in
utf-8.)

>> confuses it even more when you tell the browser to
>> interpret it as utf-8.
>
> meaning, when i added the meta-tag w/ the UTF-8 ?? or simply setting
> the browser to read it as such... or?

either.

> One curiosity remains: What if, for example, [brand 3] "ruins" the
> file by putting / warping the encodings. is there any way to "un-do"
> this process, or is the file essentially ruined?

I would assume it is ruined, but without detailed testing (and possibly some
knowledge of encodings that goes over my head), I could not say for sure.

> How can i be sure that when i "view source" that the encodings are not
> destroyed at that very moment?

They are not ruined as long as the file is served online from a site that
sends the Content-Type header (such as mine). So source viewing of an online
Web page should in theory be fine, unless something is very broken in the
application that is showing you the source code (such as [brand 2]). It is
only when you save it onto your system, and the header is lost, that it
becomes a real problem.
This site was created by Mark "Tarquin" Wilton-Jones.
Don't click this link unless you want to be banned from our site.