freesinger
Repos
34
Followers
107
Following
23

Apache Doris is an easy-to-use, high performance and unified analytics database.

6588
1876

An optimized clustering framework for 'Clustering by fast search and find of density peaks' in Science 2014.

11
11

Events

started
Created at 1 month ago
Created at 1 month ago
started
Created at 1 month ago
pull request opened
[fix](column)fix get_shrinked_column misspell

Proposed changes

Fix misspell

Checklist(Required)

  1. Does it affect the original behavior:
    • [ ] Yes
    • [x] No
    • [ ] I don't know
  2. Has unit tests been added:
    • [ ] Yes
    • [ ] No
    • [x] No Need
  3. Has document been added or modified:
    • [ ] Yes
    • [x] No
    • [ ] No Need
  4. Does it need to update dependencies:
    • [ ] Yes
    • [x] No
  5. Are there any changes that cannot be rolled back:
    • [ ] Yes (If Yes, please explain WHY)
    • [x] No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Created at 2 months ago
freesinger create branch spell-fix
Created at 2 months ago
issue comment
[feature](JSON datatype)Support JSON datatype

Thanks for @xiaokang 's carefully review and tremendous aids. Cheers~

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

feature-wip add page index row range (#12652)

Add some utils and provide the candidate row range (generated with skipped row range of each column) to read for page index filter this version support binary operator filter

todo:

  • use context instead of structures in close()
  • process complex type filter
  • use this instead of row group minmax filter
  • refactor _eval_binary() for row group filter and page index filter

enhancement compare LogicalProperties with output set instead of output list (#12743)

We used output list to compare two LogicalProperties before. Since join reorder will change the children order of a join plan and caused output list changed. the two join plan will not equals anymore in memo although they should be. So we must add a project on the new join to keep the LogicalProperties the same. This PR changes the equals and hashCode funtions of LogicalProperties. use a set of output to compare two LogicalProperties. Then we do not need add the top peoject anymore. This help us keep memo simple and efficient.

enhancement use Literal promotion to avoid unnecessary cast (#12663)

Instead of add a cast function on literal, we directly change the literal type. This change could save cast execution time and memory. For example: In SQL: "CASE WHEN l_orderkey > 0 THEN ...", 0 is a TinyIntLiteral. Before this PR: "CASE WHEN l_orderkey > CAST (TinyIntLiteral(0) AS INT)` With this PR:
"CASE WHEN l_orderkey > IntegerLiteral(0)"

feature Push down not slot references expression of on clause (#11805)

pushdown not slotreferences expr of on clause. select * from t1 join t2 on t1.a + 1 = t2.b + 2 and t1.a + 1 > 2

project() +---join(t1.a + 1 = t2.b + 2 && t1.a + 1 > 2) |---scan(t1)     +---scan(t2)

transform to

project() +---join(c = d && c > 2)     |---project(t1.a -> t1.a + 1)     |   +---scan(t1) +---project(t2.b -> t2.b + 2)     +---scan(t2)

feature-wip filter rows by page index (#12664)

Proposed changes

Parquet v1.11+ supports page skipping, which helps the scanner reduce the amount of data scanned, decompressed, decoded, and insertion. According to the performance FlameGraph, decompression takes up 20% cpu time. If a page can be filtered as a whole, the page can not be decompressed.

However, the row numbers between pages are not aligned. Columns containing predicates can be filtered by page granularity, but other columns need to be skipped within pages, so non predicate columns can only save the decoding and insertion time.

Array column needs the repetition level to align with other columns, so the array column can only save the decoding and insertion time.

Explore

OffsetIndex in the column metadata can locate the page position. Theoretically, a page can be completely skipped, including the time of reading from HDFS. However, the average size of a page is around 500KB. Skipping a page requires calling the skip. The performance of skip is low when it is called frequently, and may not be better than continuous reading of large blocks of data (such as 4MB).

If multiple consecutive pages are filtered, skip reading can be performed according toOffsetIndex. However, for the convenience of programming and readability, the data of all pages are loaded and filtered in turn.

add FE support for JSON datatype and RPC definitions

add JSONB data storage format type

fix JsonLiteral resolve bug

add DataTypeJson case in data_type_factory

add JSON syntax check in FE

add operators for jsonb_document, currently not support comparison between any JSON type value

add ColumnJson and DataTypeJson

add JsonField to store JsonValue

add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral

add push_json for MysqlResultWriter

JSON column need no zone_map_index

Revert "JSON column need no zone_map_index"

This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.

add JSON writer and reader, ignore zone-map for JSON column

add json_to_string for DataTypeJson

add olap_data_convertor for JSON type

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

improve improve join cost model (#12657)

feature Internal-query, execute SQL query statement internally in FE (#9983)

Execute SQL query statements internally(in FE). Internal-query mainly used for statistics module, FE obtains statistics by SQL from BE, such as column maximum value, minimum value, etc.

This is a tool module as statistics, it will not affect the original code, also will not affect the use of users.

The simple usage process is as follows(the following code does no exception handling):

String dbName = "test";
String sql = "SELECT * FROM table0";

InternalQuery query = new InternalQuery(dbName, sql);
InternalQueryResult result = query.query();
List<ResultRow> resultRows = result.getResultRows();

for (ResultRow resultRow : resultRows) {
    List<String> columns = resultRow.getColumns();
    for (int i = 0; i < resultRow.getColumns().size(); i++) {
        resultRow.getColumnIndex(columns.get(i));
        resultRow.getColumnName(i);
        resultRow.getColumnType(columns.get(i));
        resultRow.getColumnType(i);
        resultRow.getColumnValue(columns.get(i));
        resultRow.getColumnValue(i);
    }
}

refactor rename transform to applyExploration UT helper class PlanChecker (#12725)

fix remove statistical task multiple times in one loop cycle (#12741)

There is a problem with StatisticsTaskScheduler. The peek() method obtains a reference to the same task object, but the for-loop executes multiple removes.

test runtime filter unit cases not rely on NereidPlanner to generate PhysicalPlan anymore (#12740)

This PR:

  1. add rewrite and implement method to PlanChecker
  2. improve unit tests of runtime filter

typo Add docs of math function (#12532)

  • docs of math function

docs add a series of date function documents (#12713)

  • docs add a series of date function documents add docs for hours_add, hours_sub, minutes_add, minutes_sub, seconds_add, seconds_sub, years_sub, years_add, months_add, months_sub, days_add, days_add, weeks_add, weeks_sub functions.

feature template for building internal query SQL statements (#12714)

Template for building internal query SQL statements,it mainly used for statistics module. After the template is defined, the executable statement will be built after the given parameters.

For example, template and parameters:

  • template: SELECT ${col} FROM ${table} WHERE id = ${id};,
  • parameters: {col=colName, table=tableName, id=1}
  • result sql: SELECT colName FROM tableName WHERE id = 1;

usage:

String template = "SELECT * FROM ${table} WHERE id = ${id};";
Map<String, String> params = new HashMap<>();
params.put("table", "table0");
params.put("id", "123");

// result: SELECT * FROM table0 WHERE id = 123;
String result = InternalSqlTemplate.processTemplate(template, params);

chore add order by in test_rollup_agg_date.groovy (#12737)

feature-wip fix that incremental clone may lead to loss of delete bitmap (#12721)

fixthe inlineview's slots' nullability property is not set correctly (#12681)

The output slots of inline view may come from an outer join nullable side table. So it's should be nullable.

fixthe output of window function's nullability should be consistent with output slot (#12607)

FE may force window function to output a nullable value in some case, be should follow this and change the nullability accordingly.

add FE support for JSON datatype and RPC definitions

add JSONB data storage format type

fix JsonLiteral resolve bug

add DataTypeJson case in data_type_factory

add JSON syntax check in FE

add operators for jsonb_document, currently not support comparison between any JSON type value

add ColumnJson and DataTypeJson

add JsonField to store JsonValue

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

fix some data convert bugs, set Mysql type to JSON

Created at 2 months ago

improve: check same logicalProperty when insert a Group. (#12469)

feature implement uncheckedCast method in VarcharLiteral (#12468)

Implement uncheckedCast on VarcharLiteral for a temp way to let TimestampArithmetic work. We should remove these code and do implicit cast in TypeCoercion rule in future.

enhancement change aggregate and join stats calc algorithm (#12447)

The original statistic derive calculate algorithm rely on NDV and other column statistics. But we cannot get these stats in product environment. This PR change these operator's stats calc algorithm to use a DEFAULT RATIO variable instead of column statistics. We should change these algorithm when we could get column stats in product environment

improvment unset common fields to reduce plan thrift size (#12495)

  1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB. This optimization is introduced from #4904, but lost after #9720

  2. Disable ExprSubstitutionMap.verify when debug is disable. So that the plan time of query with 1656 union will be reduced from 20s to 2s

[fix](vectorized load) fix incomplete errmsg when find partition failed (#12485)

Signed-off-by: freemandealer freeman.zhang1992@gmail.com

delete_doc_upd (#12473)

delete_doc_update

doc performance doc and script update (#12493)

feature-wip add gzip compression codec (#12488)

Query failed when reading parquet data compressed by GZIP:

mysql> select * from customer limit 1; ERROR 1105 (HY000): errCode = 2, detailMessage = unknown compression type(GZIP)

bugfix escape identifiers for sqlserver and postgresql (#12487)

Delimited identifier format for sqlserver and postgresql is different from MySQL. Sqlserver use brackets ([ ]) and postgresql use double quotes("").

feature-wip bug fix, parquet footer buffer is small when containing many columns (#12477)

Failed when reading parquet file with many columns(>1600).

mysql> select int_col from types_sf100_r100w limit 5; ERROR 1105 (HY000): errCode = 2, detailMessage = Couldn't deserialize thrift msg: TProtocolException: Invalid data parse_thrift_footer uses fixed length buffer(=64k) to read parquet footer, but the meta data of a parquet file with 1600 columns can exceed 5MB.

Therefore, the buffer size needs to be applied according to the actual length.

[improvement](error info)improve the s3 path err msg #12438

feature-wip Add memtracker and span for new olap scan node (#12281)

Add memtracker and span for new olap scan node

fix column prune generate empty project list on join's child (#12486)

  • fix column prune generate empty project list on join's child

Enhancement Add readable information in subquery for array type #12463

regression add some case for array insert (#12474)

Co-authored-by: hucheng01 hucheng01@baidu.com

fix subquery predicate's slot appears in having's output by mistake (#12494)

when uncorrelated subquery in having predicates, having's output will appears one slot from subquery by mistake. This PR fix it by always add a project on the top of having.

Co-authored-by: mch_ucchi organic_chemistry@foxmail.com

feature Support function registry (#12481)

Support function registry.

The classes:

  • BuiltinFunctions: contains the built-in functions list
  • FunctionRegistry: used to register scalar functions and aggregate functions, it can find the function by name
  • FunctionBuilder: used to resolve a BoundFunction class, extract the constructor, and build to a BoundFunction by arguments(List<Expression>)

Register example: you can add built-in functions in the list for simplicity

public class BuiltinFunctions implements FunctionHelper {
    public final List<ScalarFunc> scalarFunctions = ImmutableList.of(
            scalar(Substring.class, "substr", "substring"),
            scalar(WeekOfYear.class),
            scalar(Year.class)
    );

    public final ImmutableList<AggregateFunc> aggregateFunctions = ImmutableList.of(
            agg(Avg.class),
            agg(Count.class),
            agg(Max.class),
            agg(Min.class),
            agg(Sum.class)
    );
}

Note:

  • Currently, we only support register scalar functions add aggregate functions, we will support register table functions.
  • Currently, we only support resolve function by function name and difference arity, but can not resolve the same arity override function, e.g. some_function(Expression) and some_function(Literal)

Improvement improve partial sort algorithm (#12349)

enhancement add optionalAnd to simplify code (#12497)

Add optionalAnd to avoid adding True which may make BE crash. Use optional to simplify code.

feature-wip update delete bitmap when increamental clone (#12364)

Created at 2 months ago

typo INSERT documentation fix (#12455)

INSERT documentation fix

fix cache cleaner (#12432)

[brpc]using pooled connection and enlarge brpc connection timeout and retry… (#10443)

  • using pooled connection and enlarge brpc connection timeout and retry times

When a connection failure happen, doris fails queries using the connection. We should lower the impact of a connection failure by using pooled connection and enlaring connection timeout and retry times.

  • clang format

enhancement avoid abuse of Offset and Offset64 #12378

We already separate Array Offset64 and String Offset(32bit) in PR: #12341

Now we limit: Offset inside IColumn, Offset64 only inside ColumnArray, to avoid abuse of them. If we use the wrong one, it will compile failed.

fix threadpool schedules does not work right on concurr… (#12370)

  • fix threadpool schedules does not work right on concurrent token

Assuming there is a concurrent thread token whose concurrency is 2, and the 1st submit on the token is submitted to threadpool while the 2nd is not submitted due to busy. The token's active_threads is 1, then thread pool does not schedule the token.

The patch fixes the problem.

[fix](grouping sets) grouping sets cause be core or return wrong results (#12313)

enhancement executeSQL rest api support streaming response (#12239)

[Enhancement](Error Msg) show details of COLUMN and TABLE name regex #11999

Co-authored-by: wuhangze wuhangze@jd.com

fix fix orthogonal_bitmap_union_count plan : wrong PREAGGREGATION (#12095)

Execution plan display when using orthogonal_bitmap_union_count function:

PREAGGREGATION: OFF

Reason: Invalid Aggregate Operator: orthogonal_bitmap_union_count

The correct plan is: PREAGGREGATION: ON Co-authored-by: lihuigang lihuigang@meituan.com

docs update quick-compaction docs (#12417)

fix crash caused by failure of prepare (#12437)

feature: Left deep tree join order. (#12439)

  • feature: Left deep tree join order.

enhancement add syntax support for fractional literal (#12444)

Just as legacy planner, Nereids parse all fractional literal to decimal. In the future, we will add more syntax for user to control the fractional literal type.

featureSupport function "current_date" in FE (#11702)

Issue Number: close #11699

[fix](stream load) Fix wrong conversion of null value when vstream load json format (#12460)

add FE support for JSON datatype and RPC definitions

add JSONB data storage format type

fix JsonLiteral resolve bug

add DataTypeJson case in data_type_factory

add JSON syntax check in FE

Created at 2 months ago

rename json to jsonb

Created at 2 months ago
issue comment
[feature](JSON datatype)Support JSON datatype

Sample records are as follows: image

Created at 2 months ago

rename json to jsonb

Created at 2 months ago

Bug block call clear_column_data may have ref not equal 1 (#12350)

enhancementadd single space separator rule to fe check style (#12354)

Some times, our code use more than one space as separator by mistake. This PR add a CheckStyle rule SingleSpaceSeparator to check that for Nereids.

Add ctas support config key type ut and doc. (#12327)

fix hash join should use children's output tuple ids not output tableref ids (#12261)

regression add tpcds sf1 unique test (#12268)

Enhancement new add the property of reserve_replica to restore statement (#11942)

Add a new property called 'reserve_replica', which means you can get a table with same partitions with the same replication num as before the backup.

Co-authored-by: Stalary stalary@163.com Co-authored-by: camby 104178625@qq.com

enhancement Split Array Offsets and String Offsets (#12341)

In old Doris version string offsets are 32bit, but it is not enough for Array type. If we change string offsets from 32bit to 64bit, there will be problem if we upgrade BE one by one. Because at the same time 32bit Offsets and 64 bit Offsets String will exist at the same time. As a result, we separate the Codes for Array Offsets. Co-authored-by: cambyzju zhuxiaoli01@baidu.com

fix fix dead loop in unnesting subquery rule (#12345)

fix fix dead loop in unnesting subquery rule

feature-wip bug fix, get the correct group reader (#12294)

Fix the problem that cannot read the lineitem table of TPCH , and the error of allocate memory Co-authored-by: jinzhe jinzhe@selectdb.com

enhancementadd single empty line rule to fe check style for Nereids (#12365)

test add subquery regression Testing (#12372)

Added regression test of sub-queries. Currently only associated sub-queries are added. Non-associated sub-queries will be added after project revision.

docs set hadoop env (#12342)

(spark-load) set hadoop env

DOCS Add docs for new time functions (#12382)

Add docs for new time functions

typoSql blacklist documentation fix (#12376)

Sql blacklist documentationfix

fix add data valid check for ARRAY type while insert or load (#12283)

Add data valid check for ARRAY type while insert or load Co-authored-by: cambyzju zhuxiaoli01@baidu.com

enhancementadd projections and output id in explain string (#12358)

In earlier PR #11842, we add the ability of projection on each ExecNode. But, we cannot get the projection expr list in explain. This is inconvenience to debug. This PR add them into explain string if they exist.

fixLogicalAggregate's equals and hashCode missing two attributes (#12393)

After applying NormalizeAggregate rule, owner groups of all aggregate children are removed. The root cause is the new aggregate node is regarded as the old aggregate node, because LogicalAggregate.equals() does not take some attributes ("normalized", "disassembled") into account.

featureadd colocate, shuffle and bucket shuffle join algorithm to Nereids (#11976)

This PR

  1. add support below join algorithm already supported by legacy to Nereids
  • colocate join
  • bucket shuffle join
  • shuffle join
  • broadcast join
  1. update all cost enforce derive utils
  • ChildOutputPropertyDeriver
  • EnforceMissingPropertiesHelper
  • RequestPropertyDeriver
  1. add a local quick sort plan used in enforce
  2. set PhysicalProperties to PhysicalPlan when choose best plan from memo
  3. rename Job#pushTask to Job#pushJob

fix fix nullptr of runtime state (#12395)

Remove default nullptr runtime state, which is very error-prone

Enhancement: Add elasticsearch docker file (#12377)

Created at 3 months ago