CJK types are set by these tokenizers, but you can also use #CJKBigramFilter(TokenStream, int) to explicitly control which of the CJK scripts are turned into bigrams.
By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams
flag in CJKBigramFilter#CJKBigramFilter(TokenStream, int, boolean). This can be used for a combined unigram+bigram approach.
In all cases, all non-CJK input is passed thru unmodified.