Package com.github.oeuvres.alix.util
Class Char
java.lang.Object
com.github.oeuvres.alix.util.Char
Efficient character categorizer, faster than Character.is*(), optimized for tokenizer in latin scripts. Idea is to populate a big array of properties for the code points.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final shortBinary flag, not used.static final shortBinary flag, a digit.static final shortBinary flag, isHighSurrogate.static final shortBinary flag, a letterstatic final shortBinary flag, composite shortcut, word char.static final shortBinary flag, lower case letter.static final shortBinary flag, isLowSurrogate.static final shortBinary flag, math operator.static final shortBinary flag, punctuation char for clause in a sentence.static final shortBinary flag, punctuation charstatic final shortBinary flag, composite shortcut, word separator.static final shortBinary flag, punctuation char for sentence.static final shortBinary flag, a spacestatic final shortBinary flag, a token charstatic final shortBinary flag, upper case letter.static final shortBinary flag, not used. -
Method Summary
Modifier and TypeMethodDescriptionstatic ChainDeligature, Æ → AE, œ → oe…static booleanisDigit(char c) Is Numeric, likeCharacter.isDigit(char).static booleanisHighSurrogate(char c) Is the first short of a supplemental unicode codepoint, likeCharacter.isHighSurrogate(char).static booleanisLetter(char c) Is a letterCharacter.isLetter(char).static booleanisLetterOrDigit(char c) Is a letter or a digit, likeCharacter.isLetterOrDigit(char).static booleanisLowerCase(char c) Is a lower case letter, likeCharacter.isLowerCase(char).static booleanisLowSurrogate(char c) Is the second short of a supplemental unicode codepoint, likeCharacter.isLowSurrogate(char).static booleanisMath(char c) Is a Mathematic symbol, seeCharacter.MATH_SYMBOL.static booleanisPUNcl(char c) Is a punctuation mark of clause level (insisde a sentence) (,;: etc.)static booleanisPunctuation(char c) Is a punctuation mark between words.static booleanisPunctuationOrSpace(char c) Is punctuation or space.static booleanisPUNsent(char c) Is a punctuation mark of sentence break level (!?.static booleanisSpace(char c) Is a "whitespace" according to ISO (space, tabs, new lines) and also for Unicode (non breakable spoaces),Character.isSpaceChar(char),Character.isWhitespace(char).static booleanisToken(char c) Is a word character, letter, but also, '’-_ and some other tweaks for lexical parsing.static booleanisUpperCase(char c) Is an upper case letter, likeCharacter.isUpperCase(char).static shortprops(char c) Get the internal properties for a char as flags.static StringtoASCII(CharSequence src) ASCII version of a latin script string, same as luceneASCIIFoldingFilterFactory.static StringtoASCII(CharSequence src, boolean punStrip) ASCII version of a latin script string, same as luceneASCIIFoldingFilterFactory, allowing to strip all non word chars.static chartoLower(char c) Efficient lower casing (test ifisUpperCase(char)before).static StringBuilderLower casing a mutable string.static StringtoString(char c) Human readable information about a char.static chartoUpper(char c) Efficient upper casing (test ifisLowerCase(char)before).
-
Field Details
-
LETTER
public static final short LETTERBinary flag, a letter- See Also:
-
TOKEN
public static final short TOKENBinary flag, a token char- See Also:
-
SPACE
public static final short SPACEBinary flag, a space- See Also:
-
PUNCTUATION
public static final short PUNCTUATIONBinary flag, punctuation char- See Also:
-
LOWERCASE
public static final short LOWERCASEBinary flag, lower case letter.- See Also:
-
UPPERCASE
public static final short UPPERCASEBinary flag, upper case letter.- See Also:
-
VOWEL
public static final short VOWELBinary flag, not used.- See Also:
-
CONSONNANT
public static final short CONSONNANTBinary flag, not used.- See Also:
-
DIGIT
public static final short DIGITBinary flag, a digit.- See Also:
-
PUNsent
public static final short PUNsentBinary flag, punctuation char for sentence.- See Also:
-
PUNclause
public static final short PUNclauseBinary flag, punctuation char for clause in a sentence.- See Also:
-
MATH
public static final short MATHBinary flag, math operator.- See Also:
-
LOWSUR
public static final short LOWSURBinary flag, isLowSurrogate.- See Also:
-
HIGHSUR
public static final short HIGHSURBinary flag, isHighSurrogate.- See Also:
-
PUNCTUATION_OR_SPACE
public static final short PUNCTUATION_OR_SPACEBinary flag, composite shortcut, word separator.- See Also:
-
LETTER_OR_DIGIT
public static final short LETTER_OR_DIGITBinary flag, composite shortcut, word char.- See Also:
-
-
Method Details
-
isDigit
public static boolean isDigit(char c) Is Numeric, likeCharacter.isDigit(char).- Parameters:
c- char to test.- Returns:
- true if c is digit.
-
isHighSurrogate
public static boolean isHighSurrogate(char c) Is the first short of a supplemental unicode codepoint, likeCharacter.isHighSurrogate(char).- Parameters:
c- char to test.- Returns:
- true if c is not a full char but a part.
-
isLetter
public static boolean isLetter(char c) Is a letterCharacter.isLetter(char).- Parameters:
c- char to test.- Returns:
- true if c is a letter, false otherwise.
-
isLetterOrDigit
public static boolean isLetterOrDigit(char c) Is a letter or a digit, likeCharacter.isLetterOrDigit(char).- Parameters:
c- char to test.- Returns:
- true if c is a letter or a digit, false otherwise.
-
isLowerCase
public static boolean isLowerCase(char c) Is a lower case letter, likeCharacter.isLowerCase(char).- Parameters:
c- char to test.- Returns:
- true if c is a letter lower case, false otherwise.
-
isLowSurrogate
public static boolean isLowSurrogate(char c) Is the second short of a supplemental unicode codepoint, likeCharacter.isLowSurrogate(char).- Parameters:
c- char to test.- Returns:
- true if c is not a full char but a part, false otherwise.
-
isMath
public static boolean isMath(char c) Is a Mathematic symbol, seeCharacter.MATH_SYMBOL.- Parameters:
c- char to test.- Returns:
- true if c is a math symbol, false otherwise.
-
isPunctuation
public static boolean isPunctuation(char c) Is a punctuation mark between words.- Parameters:
c- char to test.- Returns:
- true if c is punctuation, false otherwise.
-
isPunctuationOrSpace
public static boolean isPunctuationOrSpace(char c) Is punctuation or space.- Parameters:
c- char to test.- Returns:
- true if c is a word separator, false otherwise.
-
isPUNsent
public static boolean isPUNsent(char c) Is a punctuation mark of sentence break level (!?. etc.)- Parameters:
c- char to test.- Returns:
- true if c is a math symbol, false otherwise.
-
isPUNcl
public static boolean isPUNcl(char c) Is a punctuation mark of clause level (insisde a sentence) (,;: etc.)- Parameters:
c- char to test.- Returns:
- true if c is ending a sentence, false otherwise.
-
isSpace
public static boolean isSpace(char c) Is a "whitespace" according to ISO (space, tabs, new lines) and also for Unicode (non breakable spoaces),Character.isSpaceChar(char),Character.isWhitespace(char).- Parameters:
c- char to test.- Returns:
- true if c is a space, false otherwise.
-
isToken
public static boolean isToken(char c) Is a word character, letter, but also, '’-_ and some other tweaks for lexical parsing.- Parameters:
c- char to test.- Returns:
- true if c is a token char, false otherwise.
-
isUpperCase
public static boolean isUpperCase(char c) Is an upper case letter, likeCharacter.isUpperCase(char).- Parameters:
c- char to test.- Returns:
- true if c is an upper case letter, false otherwise.
-
props
public static short props(char c) Get the internal properties for a char as flags.- Parameters:
c- char to test.- Returns:
- raw flags as a short.
-
toLower
public static char toLower(char c) Efficient lower casing (test ifisUpperCase(char)before).- Parameters:
c- char to transform.- Returns:
- char to lower case.
-
toLower
Lower casing a mutable string.- Parameters:
s- the char sequence.- Returns:
- the modified char sequence.
-
toASCII
ASCII version of a latin script string, same as luceneASCIIFoldingFilterFactory.- Parameters:
src- unicode char sequence.- Returns:
- ASCII version of src.
-
toASCII
ASCII version of a latin script string, same as luceneASCIIFoldingFilterFactory, allowing to strip all non word chars.- Parameters:
src- unicode char sequence.punStrip- strip word separator.- Returns:
- ASCII version of src.
-
deligat
Deligature, Æ → AE, œ → oe…- Parameters:
source- mutable char sequence.- Returns:
- modified source.
-
toUpper
public static char toUpper(char c) Efficient upper casing (test ifisLowerCase(char)before).- Parameters:
c- char to convert.- Returns:
- converted char.
-
toString
Human readable information about a char.- Parameters:
c- char to test.- Returns:
- human readable information.
-